Jobs
>
San Francisco

    Site Reliability Engineer - San Francisco, United States - Together AI

    Default job background
    Description


    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly.

    You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase.


    You specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems.

    Requirements

    5+ years of professional SRE or related experience

    Bachelor's degree in Computer Science or a related field (or equivalent work experience)

    Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes

    Knowledgeable in Prometheus and Grafana for metrics and alerting

    Advanced knowledge of cloud services

    Proficiency in programming/scripting languages

    Demonstrated ability to lead initiatives and problem definition and scoping, design, and planning through epics and blueprints

    Excellent oral and written communication

    Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts

    Responsibilities

    Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability

    Use your on-call shift to prevent incidents from ever happening

    Run our infrastructure with Ansible, Terraform, and Kubernetes

    Build monitoring that alerts on symptoms rather than on outages

    Document runbooks so your findings turn into repeatable actions and then into automation

    Improve operational processes (such as deployments and upgrades)

    Design, build and maintain core infrastructure that enables scaling to a massive number of concurrent users

    Debug production issues across services and levels of the stack

    Identify significant projects that result in substantial improvements in reliability, cost savings and/or revenue

    Identify changes for the product architecture from the reliability, performance and availability perspectives with a data driven approach

    Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage

    Plan the growth of Together AI's infrastructure

    About Together AI

    Together AI is a research-driven artificial intelligence company.

    We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models.

    We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama.

    We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure.

    Compensation

    We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

    Equal Opportunity


    Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

    Please see our privacy policy at

    #J-18808-Ljbffr


  • Outdefine San Francisco, CA, United States

    full time $ /yr remote ???????? USD · full time $ /yr hybrid ???????? USD · #J-18808-Ljbffr ...

  • OpenAI

    Reliability Engineer

    2 weeks ago


    OpenAI San Francisco, United States

    Join the engineering teams that bring OpenAI's ideas safely to the world · The Applied Engineering team works across research, engineering, product, and design to bring OpenAI's technology to consumers and businesses. We seek to learn from deployment and distribute the benefits ...


  • WEX San Francisco, United States

    (*) This is a remote position; however, the candidate must reside within 30 miles of one of the following locations: Boston, MA; Dallas, TX; San Francisco Bay Area, CA; Portland, ME; and Washington, D.C. · About the Team/Role · The WEX Site Reliability Engineering (SRE) team is ...


  • Cloudflare Inc San Francisco, United States

    Available Locations: · Remote Australia, Singapore · Production Engineering is responsible for the world's most reliable, observable, performant, and safe network ecosystem. Our customers rely on our products and systems to safely modify, troubleshoot, and release products with ...


  • Bridgeway Professionals Inc San Francisco, United States

    This opportunity is with a medium sized specialty chemical manufacturer located outside of San Francisco. The plant is PSM regulated, DCS controlled, with a very high standard of safety and overall housekeeping. Millions have been invested in the plant and more upgrades are plann ...


  • Best Secret San Francisco, United States

    About BestSecretGroup · We are a leading European members-only online destination for premium and luxury off-price fashion. Partnering with over 3,000 international brands, our tech-focused mindset and strong commitment to sustainability drives a truly unique experience for our m ...


  • Vertisystem San Francisco, United States

    Duration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organization's CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. The ...


  • Jones Lange Lasalle, Inc. West Valley City, United States

    The Junior Reliability Engineer is responsible for performing data validation around assets (HVAC, Electrical, Plumbing, etc.) that are managed by both Mobile and Static Facilities Management Technicians at all managed facilities within our West Caro Reliability Engineer, Liabili ...


  • Wasmer San Francisco, United States

    [Full Time] Site Reliability Engineer at Wasmer (United States) | BEAMSTART Jobs · Site Reliability Engineer · Wasmer United States · Date Posted · 25 Mar, 2023 · Work Location · San Francisco, United States · Salary Offered · Not Specified · Job Type · Full Time · Experience R ...


  • Vertisystem San Francisco, United States

    Duration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organizations CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. T ...


  • Vertisystem San Francisco, United States

    Duration: 6 months contract · Pay rate: $90/hr on W2 · Job Summary: · It is an exciting time to be part of the organization's CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. Th ...


  • DigitalOcean San Francisco, United States

    Do you ever wonder what happens inside the cloud? · DigitalOcean (NYSE: DOCN) simplifies cloud computing so builders can spend more time creating software that changes the world. With our mission-critical infrastructure and fully managed offerings, DigitalOcean enables startups a ...


  • Affinity Executive Search San Francisco, CA, United States

    Plant Reliability Engineer · Location: San Francisco, CA area · Description: · This opportunity is with a medium sized specialty chemical manufacturer located outside of San Francisco. The plant is PSM regulated, DCS controlled, with a very high standard of safety and overall ...


  • OpenAI San Francisco, United States

    About the team: · Reliable services are what enables Open AI to train the best AI models in the world and to bring the promise of safe, effective AI to the world. The SRE team in research is responsible for defining, measuring, and improving the reliability of the research platf ...


  • Talkdesk San Francisco, United States

    At Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers' most critical customer service processes. We are recognized as a Contact Center as ...


  • PostHog Enterprise San Francisco, United States

    PostHog helps engineers build better products. We are a single platform to analyze, test, observe, and deploy new features. We give engineers product analytics, session recording, feature flags, A/B testing, event pipelines, SQL access, and a data warehouse... and there's plenty ...


  • DAOmatch San Francisco, United States

    Aptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way.Founded by some of the original creators and maintainers that researched, designed, and built the Diem blockchain to ser ...


  • Cypress Human Capital Management, LLC San Francisco, United States

    Site Reliability Engineer (Grafana) · Responsibilities · Collaborate with Service Owners and Observability Leaders to develop a strategy for monitoring the technology stack using Grafana. · Initiate data ingestion by deploying Telegraf and exporters (if necessary), utilizing di ...


  • Appspace San Francisco, United States

    At Appspace, we're passionate about creating better work experiences for people everywhere, and we're looking for people that feel the same way. Our global office locations and flexible work culture help you work wherever and however you're at your best. Plus, we take the time to ...


  • StarTree San Francisco, United States

    At StarTree we're a group of passionate individuals that desire to improve the lives of many by developing tools and technologies that support availability and speed in the world of real-time analytics. · Our aim is to make it simple for every company to delight their users - ex ...