Site Reliability Engineer - San Francisco, United States - Together AI

Together AI San Francisco, United States

2 weeks ago

Description

As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly.

You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase.

You specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems.

Requirements

5+ years of professional SRE or related experience

Bachelor's degree in Computer Science or a related field (or equivalent work experience)

Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes

Knowledgeable in Prometheus and Grafana for metrics and alerting

Advanced knowledge of cloud services

Proficiency in programming/scripting languages

Demonstrated ability to lead initiatives and problem definition and scoping, design, and planning through epics and blueprints

Excellent oral and written communication

Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts

Responsibilities

Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability

Use your on-call shift to prevent incidents from ever happening

Run our infrastructure with Ansible, Terraform, and Kubernetes

Build monitoring that alerts on symptoms rather than on outages

Document runbooks so your findings turn into repeatable actions and then into automation

Improve operational processes (such as deployments and upgrades)

Design, build and maintain core infrastructure that enables scaling to a massive number of concurrent users

Debug production issues across services and levels of the stack

Identify significant projects that result in substantial improvements in reliability, cost savings and/or revenue

Identify changes for the product architecture from the reliability, performance and availability perspectives with a data driven approach

Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage

Plan the growth of Together AI's infrastructure

About Together AI

Together AI is a research-driven artificial intelligence company.

We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models.

We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama.

We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at

#J-18808-Ljbffr

Sr. Site Reliability Engineer

3 weeks ago

Outdefine San Francisco, CA, United States

full time $ /yr remote ???????? USD · full time $ /yr hybrid ???????? USD · #J-18808-Ljbffr ...
Reliability Engineer

2 weeks ago

OpenAI San Francisco, United States

Join the engineering teams that bring OpenAI's ideas safely to the world · The Applied Engineering team works across research, engineering, product, and design to bring OpenAI's technology to consumers and businesses. We seek to learn from deployment and distribute the benefits ...
Site Reliability Engineer

17 hours ago

WEX San Francisco, United States

(*) This is a remote position; however, the candidate must reside within 30 miles of one of the following locations: Boston, MA; Dallas, TX; San Francisco Bay Area, CA; Portland, ME; and Washington, D.C. · About the Team/Role · The WEX Site Reliability Engineering (SRE) team is ...
Systems Reliability Engineer

3 days ago

Cloudflare Inc San Francisco, United States

Available Locations: · Remote Australia, Singapore · Production Engineering is responsible for the world's most reliable, observable, performant, and safe network ecosystem. Our customers rely on our products and systems to safely modify, troubleshoot, and release products with ...
Plant Reliability Engineer

3 weeks ago

Bridgeway Professionals Inc San Francisco, United States

This opportunity is with a medium sized specialty chemical manufacturer located outside of San Francisco. The plant is PSM regulated, DCS controlled, with a very high standard of safety and overall housekeeping. Millions have been invested in the plant and more upgrades are plann ...
Site Reliability Engineer

2 weeks ago

Best Secret San Francisco, United States

About BestSecretGroup · We are a leading European members-only online destination for premium and luxury off-price fashion. Partnering with over 3,000 international brands, our tech-focused mindset and strong commitment to sustainability drives a truly unique experience for our m ...
Site Reliability Engineer

2 weeks ago

Vertisystem San Francisco, United States

Duration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organization's CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. The ...
Junior Reliability Engineer

9 minutes ago

Jones Lange Lasalle, Inc. West Valley City, United States

The Junior Reliability Engineer is responsible for performing data validation around assets (HVAC, Electrical, Plumbing, etc.) that are managed by both Mobile and Static Facilities Management Technicians at all managed facilities within our West Caro Reliability Engineer, Liabili ...
Site Reliability Engineer

2 weeks ago

Wasmer San Francisco, United States

[Full Time] Site Reliability Engineer at Wasmer (United States) | BEAMSTART Jobs · Site Reliability Engineer · Wasmer United States · Date Posted · 25 Mar, 2023 · Work Location · San Francisco, United States · Salary Offered · Not Specified · Job Type · Full Time · Experience R ...
Site Reliability Engineer

2 weeks ago

Vertisystem San Francisco, United States

Duration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organizations CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. T ...
Site Reliability Engineer

1 week ago

Vertisystem San Francisco, United States

Duration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organization's CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. Th ...
Site Reliability Engineer

6 days ago

DigitalOcean San Francisco, United States

Do you ever wonder what happens inside the cloud? · DigitalOcean (NYSE: DOCN) simplifies cloud computing so builders can spend more time creating software that changes the world. With our mission-critical infrastructure and fully managed offerings, DigitalOcean enables startups a ...
Plant Reliability Engineer

3 weeks ago

Affinity Executive Search San Francisco, CA, United States

Plant Reliability Engineer · Location: San Francisco, CA area · Description: · This opportunity is with a medium sized specialty chemical manufacturer located outside of San Francisco. The plant is PSM regulated, DCS controlled, with a very high standard of safety and overall ...
Reliability Engineering Manager

3 weeks ago

OpenAI San Francisco, United States

About the team: · Reliable services are what enables Open AI to train the best AI models in the world and to bring the promise of safe, effective AI to the world. The SRE team in research is responsible for defining, measuring, and improving the reliability of the research platf ...
Site Reliability Engineer

2 days ago

Talkdesk San Francisco, United States

At Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers' most critical customer service processes. We are recognized as a Contact Center as ...
Site Reliability Engineer

3 days ago

PostHog Enterprise San Francisco, United States

PostHog helps engineers build better products. We are a single platform to analyze, test, observe, and deploy new features. We give engineers product analytics, session recording, feature flags, A/B testing, event pipelines, SQL access, and a data warehouse... and there's plenty ...
Site Reliability Engineer

2 weeks ago

DAOmatch San Francisco, United States

Aptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way.Founded by some of the original creators and maintainers that researched, designed, and built the Diem blockchain to ser ...
Site Reliability Engineer

2 days ago

Cypress Human Capital Management, LLC San Francisco, United States

Site Reliability Engineer (Grafana) · Responsibilities · Collaborate with Service Owners and Observability Leaders to develop a strategy for monitoring the technology stack using Grafana. · Initiate data ingestion by deploying Telegraf and exporters (if necessary), utilizing di ...
Site Reliability Engineer

1 week ago

Appspace San Francisco, United States

At Appspace, we're passionate about creating better work experiences for people everywhere, and we're looking for people that feel the same way. Our global office locations and flexible work culture help you work wherever and however you're at your best. Plus, we take the time to ...
Engineering Director, Reliability

2 weeks ago

StarTree San Francisco, United States

At StarTree we're a group of passionate individuals that desire to improve the lives of many by developing tools and technologies that support availability and speed in the world of real-time analytics. · Our aim is to make it simple for every company to delight their users - ex ...

Site Reliability Engineer - San Francisco, United States - Together AI

Description

Sr. Site Reliability Engineer

Reliability Engineer

Site Reliability Engineer

Systems Reliability Engineer

Plant Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Junior Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Plant Reliability Engineer

Reliability Engineering Manager

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Engineering Director, Reliability

Ananya Kapoor

Rob Schroder

McKenzie Friel

Anesh Tilwani

Lucero Yañez

Bryan Ruy

for Recruiters

Information

Site Reliability Engineer - San Francisco, United States - Together AI

Description

Site Reliability Engineer professionals in San Francisco