-
Sr. Site Reliability Engineer
3 weeks ago
Outdefine San Francisco, CA, United Statesfull time $ /yr remote ???????? USD · full time $ /yr hybrid ???????? USD · #J-18808-Ljbffr ...
-
Reliability Engineer
2 weeks ago
OpenAI San Francisco, United StatesJoin the engineering teams that bring OpenAI's ideas safely to the world · The Applied Engineering team works across research, engineering, product, and design to bring OpenAI's technology to consumers and businesses. We seek to learn from deployment and distribute the benefits ...
-
Site Reliability Engineer
17 hours ago
WEX San Francisco, United States(*) This is a remote position; however, the candidate must reside within 30 miles of one of the following locations: Boston, MA; Dallas, TX; San Francisco Bay Area, CA; Portland, ME; and Washington, D.C. · About the Team/Role · The WEX Site Reliability Engineering (SRE) team is ...
-
Systems Reliability Engineer
3 days ago
Cloudflare Inc San Francisco, United StatesAvailable Locations: · Remote Australia, Singapore · Production Engineering is responsible for the world's most reliable, observable, performant, and safe network ecosystem. Our customers rely on our products and systems to safely modify, troubleshoot, and release products with ...
-
Plant Reliability Engineer
3 weeks ago
Bridgeway Professionals Inc San Francisco, United StatesThis opportunity is with a medium sized specialty chemical manufacturer located outside of San Francisco. The plant is PSM regulated, DCS controlled, with a very high standard of safety and overall housekeeping. Millions have been invested in the plant and more upgrades are plann ...
-
Site Reliability Engineer
2 weeks ago
Best Secret San Francisco, United StatesAbout BestSecretGroup · We are a leading European members-only online destination for premium and luxury off-price fashion. Partnering with over 3,000 international brands, our tech-focused mindset and strong commitment to sustainability drives a truly unique experience for our m ...
-
Site Reliability Engineer
2 weeks ago
Vertisystem San Francisco, United StatesDuration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organization's CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. The ...
-
Junior Reliability Engineer
9 minutes ago
Jones Lange Lasalle, Inc. West Valley City, United StatesThe Junior Reliability Engineer is responsible for performing data validation around assets (HVAC, Electrical, Plumbing, etc.) that are managed by both Mobile and Static Facilities Management Technicians at all managed facilities within our West Caro Reliability Engineer, Liabili ...
-
Site Reliability Engineer
2 weeks ago
Wasmer San Francisco, United States[Full Time] Site Reliability Engineer at Wasmer (United States) | BEAMSTART Jobs · Site Reliability Engineer · Wasmer United States · Date Posted · 25 Mar, 2023 · Work Location · San Francisco, United States · Salary Offered · Not Specified · Job Type · Full Time · Experience R ...
-
Site Reliability Engineer
2 weeks ago
Vertisystem San Francisco, United StatesDuration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organizations CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. T ...
-
Site Reliability Engineer
1 week ago
Vertisystem San Francisco, United StatesDuration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organization's CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. Th ...
-
Site Reliability Engineer
6 days ago
DigitalOcean San Francisco, United StatesDo you ever wonder what happens inside the cloud? · DigitalOcean (NYSE: DOCN) simplifies cloud computing so builders can spend more time creating software that changes the world. With our mission-critical infrastructure and fully managed offerings, DigitalOcean enables startups a ...
-
Plant Reliability Engineer
3 weeks ago
Affinity Executive Search San Francisco, CA, United StatesPlant Reliability Engineer · Location: San Francisco, CA area · Description: · This opportunity is with a medium sized specialty chemical manufacturer located outside of San Francisco. The plant is PSM regulated, DCS controlled, with a very high standard of safety and overall ...
-
Reliability Engineering Manager
3 weeks ago
OpenAI San Francisco, United StatesAbout the team: · Reliable services are what enables Open AI to train the best AI models in the world and to bring the promise of safe, effective AI to the world. The SRE team in research is responsible for defining, measuring, and improving the reliability of the research platf ...
-
Site Reliability Engineer
2 days ago
Talkdesk San Francisco, United StatesAt Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers' most critical customer service processes. We are recognized as a Contact Center as ...
-
Site Reliability Engineer
3 days ago
PostHog Enterprise San Francisco, United StatesPostHog helps engineers build better products. We are a single platform to analyze, test, observe, and deploy new features. We give engineers product analytics, session recording, feature flags, A/B testing, event pipelines, SQL access, and a data warehouse... and there's plenty ...
-
Site Reliability Engineer
2 weeks ago
DAOmatch San Francisco, United StatesAptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way.Founded by some of the original creators and maintainers that researched, designed, and built the Diem blockchain to ser ...
-
Site Reliability Engineer
2 days ago
Cypress Human Capital Management, LLC San Francisco, United StatesSite Reliability Engineer (Grafana) · Responsibilities · Collaborate with Service Owners and Observability Leaders to develop a strategy for monitoring the technology stack using Grafana. · Initiate data ingestion by deploying Telegraf and exporters (if necessary), utilizing di ...
-
Site Reliability Engineer
1 week ago
Appspace San Francisco, United StatesAt Appspace, we're passionate about creating better work experiences for people everywhere, and we're looking for people that feel the same way. Our global office locations and flexible work culture help you work wherever and however you're at your best. Plus, we take the time to ...
-
Engineering Director, Reliability
2 weeks ago
StarTree San Francisco, United StatesAt StarTree we're a group of passionate individuals that desire to improve the lives of many by developing tools and technologies that support availability and speed in the world of real-time analytics. · Our aim is to make it simple for every company to delight their users - ex ...
Site Reliability Engineer - San Francisco, United States - Together AI
Description
As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly.
You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase.
You specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems.
5+ years of professional SRE or related experience
Bachelor's degree in Computer Science or a related field (or equivalent work experience)
Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
Knowledgeable in Prometheus and Grafana for metrics and alerting
Advanced knowledge of cloud services
Proficiency in programming/scripting languages
Demonstrated ability to lead initiatives and problem definition and scoping, design, and planning through epics and blueprints
Excellent oral and written communication
Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts
Responsibilities
Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability
Use your on-call shift to prevent incidents from ever happening
Run our infrastructure with Ansible, Terraform, and Kubernetes
Build monitoring that alerts on symptoms rather than on outages
Document runbooks so your findings turn into repeatable actions and then into automation
Improve operational processes (such as deployments and upgrades)
Design, build and maintain core infrastructure that enables scaling to a massive number of concurrent users
Debug production issues across services and levels of the stack
Identify significant projects that result in substantial improvements in reliability, cost savings and/or revenue
Identify changes for the product architecture from the reliability, performance and availability perspectives with a data driven approach
Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage
Plan the growth of Together AI's infrastructure
About Together AI
Together AI is a research-driven artificial intelligence company.
We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models.
We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama.
We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure.
CompensationWe offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.
Equal Opportunity
Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
#J-18808-Ljbffr