Distributed Training Engineer - San Francisco

Only for registered members San Francisco, United States

1 week ago

Default job background

Job summary
Distributed Training Engineer to build optimize and maintain the critical software stack that powers our large-scale AI training workloads.

  • Maintenance of ML libraries and frameworks including JAX PyTorch CUDA and ROCm across multiple environments and hardware configurations.
  • End-to-End Stack Ownership: Build maintain continuously improve the entire ML software stack from ROCm/CUDA drivers to high-level JAX/PyTorch tooling.

Lorem ipsum dolor sit amet
, consectetur adipiscing elit. Nullam tempor vestibulum ex, eget consequat quam pellentesque vel. Etiam congue sed elit nec elementum. Morbi diam metus, rutrum id eleifend ac, porta in lectus. Sed scelerisque a augue et ornare.

Donec lacinia nisi nec odio ultricies imperdiet.
Morbi a dolor dignissim, tristique enim et, semper lacus. Morbi laoreet sollicitudin justo eget eleifend. Donec felis augue, accumsan in dapibus a, mattis sed ligula.

Vestibulum at aliquet erat. Curabitur rhoncus urna vitae quam suscipit
, at pulvinar turpis lacinia. Mauris magna sem, dignissim finibus fermentum ac, placerat at ex. Pellentesque aliquet, lorem pulvinar mollis ornare, orci turpis fermentum urna, non ullamcorper ligula enim a ante. Duis dolor est, consectetur ut sapien lacinia, tempor condimentum purus.
Get full access

Access all high-level positions and get the job of your dreams.



Similar jobs

  • Only for registered members US-SF

    You will work on the technical platform supporting the program, which includes stripe.training and the infrastructure behind it, but the main focus will be on creating environments for learners to engage in hands-on practice. Your projects will span different products, technologi ...

  • Only for registered members San Francisco $250,000 - $460,000 (USD)

    We are building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve. · Success for us is measured by raising both training throughput (how fast models train) and researcher throughput (how fast ideas become experiments and pro ...

  • Only for registered members San Francisco

    We are looking for a Performance Engineer to drive efficiency improvements across our distributed training stack. As a key member of the team, you will analyze large-scale training runs, identify utilization gaps, and design optimizations that push the boundaries of throughput an ...

  • Only for registered members San Francisco, CA

    The Release Train Engineer (RTE) is a coach for the Agile Release Train (ART), responsible for facilitating ART processes and execution in a Microsoft Azure and .NET development environment. · ...

  • Only for registered members San Francisco

    +Job Title: · Release Train Engineer · The Release Train Engineer (RTE) is a coach for the Agile Release Train (ART), responsible for facilitating ART processes and execution in a Microsoft Azure and .NET development environment. · +Facilitate ART Events: Organize and lead Progra ...

  • Only for registered members San Francisco $380,000 - $555,000 (USD)

    The Sora team is working on making video a key capability of OpenAI's foundation models. As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideas. ...

  • Only for registered members San Francisco, CA

    The Assistant Engineering Director role at Grand Hyatt hotels provides exposure to all facets of hotel engineering management while developing administrative, financial, and leadership skills. · ...

  • Only for registered members San Francisco $340,000 - $425,000 (USD)

    +Job summary · We are seeking a Research Engineer to join our Pre-training team, · responsible for developing the next generation of large language models.+ResponsibilitiesConduct research and implement solutions in areas such as model architecture, · Data processing, · ...

  • Only for registered members San Francisco

    This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.Collaborate with researchers to enable them to develop systems-efficient video models and architectures · Apply the latest techniq ...

  • Only for registered members San Francisco

    We're building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve. · ...

  • Only for registered members San Francisco, CA

    The Sora team is working on making video a key capability of OpenAI's foundation models.As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideasCollaborate with ...

  • Only for registered members San Francisco

    We turn trips into journeys, encounters into experiences, and jobs into careers. · Join a team that is making travel more human. · ...

  • Only for registered members San Francisco, CA

    The Training team is responsible for producing large language models that power our research and products. · Experience landing contributions to major LLM training runs · ...

  • Only for registered members San Francisco

    Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack stack from data pipelines to GPU kernels Design build and optimize distributed training systems PyTorch for multi-node GPU clusters ensuring scalability r ...

  • Only for registered members San Francisco $100,000 - $300,000 (USD)

    We are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing. · We believe massive scale through data-driven machine learning is the key to unlocking these capabilities for the widespread deployment of robots ...

  • Only for registered members San Francisco $147,400 - $220,900 (USD)

    We are a group of engineers to support training foundation models at Apple We build infrastructure to support training foundation models with general capabilities such as understanding and generation of text images speech videos and other modalities and apply these models to Appl ...

  • Only for registered members San Francisco, CA

    We are looking for an ML Engineer with 3+ YOE in high-performance computing systems to manage and optimize our computational infrastructure for training and deploying our machine learning models. · Design, implement, and maintain scalable computing solutions for training and depl ...

  • Only for registered members San Francisco $160,000 - $230,000 (USD)

    We are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, · and cost-efficiency.We invite ...

  • Only for registered members San Francisco $350,000 - $500,000 (USD)

    We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts working together to build beneficial AI systems. · ...

  • Only for registered members San Francisco, CA

    We're seeking Staff level Engineer to join our Pre-training team, responsible for developing the next generation of large language models. · In this role, you will work at the intersection of cutting-edge research and practical engineering, contributing to the development of safe ...