Distributed Training Engineer - San Francisco
1 week ago

Job summary
Distributed Training Engineer to build optimize and maintain the critical software stack that powers our large-scale AI training workloads.
- Maintenance of ML libraries and frameworks including JAX PyTorch CUDA and ROCm across multiple environments and hardware configurations.
- End-to-End Stack Ownership: Build maintain continuously improve the entire ML software stack from ROCm/CUDA drivers to high-level JAX/PyTorch tooling.
Job description
, consectetur adipiscing elit. Nullam tempor vestibulum ex, eget consequat quam pellentesque vel. Etiam congue sed elit nec elementum. Morbi diam metus, rutrum id eleifend ac, porta in lectus. Sed scelerisque a augue et ornare.
Donec lacinia nisi nec odio ultricies imperdiet.
Morbi a dolor dignissim, tristique enim et, semper lacus. Morbi laoreet sollicitudin justo eget eleifend. Donec felis augue, accumsan in dapibus a, mattis sed ligula.
Vestibulum at aliquet erat. Curabitur rhoncus urna vitae quam suscipit
, at pulvinar turpis lacinia. Mauris magna sem, dignissim finibus fermentum ac, placerat at ex. Pellentesque aliquet, lorem pulvinar mollis ornare, orci turpis fermentum urna, non ullamcorper ligula enim a ante. Duis dolor est, consectetur ut sapien lacinia, tempor condimentum purus.
Access all high-level positions and get the job of your dreams.
Similar jobs
Training Engineer, Training
3 days ago
You will work on the technical platform supporting the program, which includes stripe.training and the infrastructure behind it, but the main focus will be on creating environments for learners to engage in hands-on practice. Your projects will span different products, technologi ...
Training Performance Engineer
1 month ago
We are building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve. · Success for us is measured by raising both training throughput (how fast models train) and researcher throughput (how fast ideas become experiments and pro ...
Training Performance Engineer
2 days ago
We are looking for a Performance Engineer to drive efficiency improvements across our distributed training stack. As a key member of the team, you will analyze large-scale training runs, identify utilization gaps, and design optimizations that push the boundaries of throughput an ...
Release Train Engineer
1 month ago
The Release Train Engineer (RTE) is a coach for the Agile Release Train (ART), responsible for facilitating ART processes and execution in a Microsoft Azure and .NET development environment. · ...
Release Train Engineer
1 month ago
+Job Title: · Release Train Engineer · The Release Train Engineer (RTE) is a coach for the Agile Release Train (ART), responsible for facilitating ART processes and execution in a Microsoft Azure and .NET development environment. · +Facilitate ART Events: Organize and lead Progra ...
Distributed Training Engineer, Sora
1 month ago
The Sora team is working on making video a key capability of OpenAI's foundation models. As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideas. ...
Assistant Engineering Director in Training
1 month ago
The Assistant Engineering Director role at Grand Hyatt hotels provides exposure to all facets of hotel engineering management while developing administrative, financial, and leadership skills. · ...
Research Engineer, Pre-training
1 month ago
+Job summary · We are seeking a Research Engineer to join our Pre-training team, · responsible for developing the next generation of large language models.+ResponsibilitiesConduct research and implement solutions in areas such as model architecture, · Data processing, · ...
Distributed Training Engineer, Sora
3 days ago
This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.Collaborate with researchers to enable them to develop systems-efficient video models and architectures · Apply the latest techniq ...
Training: ML Framework Engineer
3 days ago
We're building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve. · ...
Distributed Training Engineer, Sora
1 month ago
The Sora team is working on making video a key capability of OpenAI's foundation models.As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideasCollaborate with ...
Assistant Engineering Director in Training
1 month ago
We turn trips into journeys, encounters into experiences, and jobs into careers. · Join a team that is making travel more human. · ...
Researcher Engineer/Scientist, Training
1 month ago
The Training team is responsible for producing large language models that power our research and products. · Experience landing contributions to major LLM training runs · ...
Staff Software Engineer, Training
1 month ago
Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack stack from data pipelines to GPU kernels Design build and optimize distributed training systems PyTorch for multi-node GPU clusters ensuring scalability r ...
We are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing. · We believe massive scale through data-driven machine learning is the key to unlocking these capabilities for the widespread deployment of robots ...
ML Engineer, FM Training Integration
4 weeks ago
We are a group of engineers to support training foundation models at Apple We build infrastructure to support training foundation models with general capabilities such as understanding and generation of text images speech videos and other modalities and apply these models to Appl ...
We are looking for an ML Engineer with 3+ YOE in high-performance computing systems to manage and optimize our computational infrastructure for training and deploying our machine learning models. · Design, implement, and maintain scalable computing solutions for training and depl ...
We are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, · and cost-efficiency.We invite ...
Research Engineer, Reward Models Training
1 month ago
We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts working together to build beneficial AI systems. · ...
Staff Infrastructure Engineer, Pre-training
1 month ago
We're seeking Staff level Engineer to join our Pre-training team, responsible for developing the next generation of large language models. · In this role, you will work at the intersection of cutting-edge research and practical engineering, contributing to the development of safe ...