- Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for large-scale training workloads.
- Develop high-performance optimizations to maximize throughput and efficiency.
- Develop reusable frameworks and libraries to improve training reproducibility, reliability, and scalability for new model architectures.
- Establish standards for reliability, maintainability, and security, ensuring systems are robust under rapid iteration.
- Collaborate with researchers and engineers to build scalable infrastructure.
- Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.
- Bachelor's degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar.
- Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases
- Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
- Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
- A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.
- Past experience working on distributed training for the world's largest models to make them stable, reliable, and performant.
- Track record of improving research productivity through infrastructure design or process improvements.
- Contributions to open-source ML infrastructure such as PyTorch, XLA, Megatron-LM, or DeepSpeed.
- Location: This role is based in San Francisco, California.
- Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.
- Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
- Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
-
Anthropic's mission is to create reliable, interpretable, and steerable AI systems. · Maintain the single source of truth for all trainer-facing content — decks, speaker notes, facilitator guides, demo libraries, exercises — and build AI-augmented workflows for keeping it current ...
San Francisco, CA1 month ago
-
· Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. · We are scientists, engineers, and buil ...
San Francisco $350,000 - $475,000 (USD) per year1 week ago
-
Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...
San Francisco $225,000 - $550,000 (USD)1 week ago
-
Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...
San Francisco $225,000 - $550,000 (USD) Full time1 week ago
-
Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...
San Francisco, CA6 days ago
-
Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...
San Francisco1 day ago
- Work in company
Research Intern RL & Post-Training Systems, Turbo (Summer 2026)
Only for registered members
The Turbo Research team investigates how to make post-training and reinforcement learning for large language models efficient, scalable, and reliable. · As a research intern, you will study RL and post-training methods whose performance and scalability are tightly coupled to infe ...
San Francisco $58 - $63 (USD)1 month ago
- Work in company
Research Intern RL & Post-Training Systems, Turbo (Summer 2026)
Only for registered members
About Together AI is a research-driven artificial intelligence company that aims to significantly lower the cost of modern AI systems by co-designing software, algorithms and models. · ...
San Francisco, CA2 months ago
-
The Manager Train Control Systems reports to the Deputy Director Rail Systems Engineering responsible for oversight managing of complex train control systems including installation modification maintenance upgrade of Districts Rail Operations Control System ROCS Data Acquisition ...
San Carlos1 month ago
-
· General · The Manager, Train Control Systems reports to the Deputy Director Rail Systems Engineering and is responsible for oversight and managing of complex train control systems including the installation, modification, maintenance, and upgrade of the District's Rail Operati ...
San Carlos, CA1 week ago
-
The · Manager, Train Control Systems reports to the Deputy Director Rail Systems Engineering and is responsible for oversight and managing of complex train control systems including the installation, · maintenance, · and upgrade of the District's Rail Operations Control System ( ...
San Carlos, CA1 month ago
-
The Manager Train Control Systems reports to the Deputy Director Rail Systems Engineering responsible for oversight managing of complex train control systems including installation modification maintenance upgrade of the Districts Rail Operations Control System ROCS Data Acquisit ...
San Carlos, CA1 month ago
-
This is a remote position with up to · 75% travel, giving you the opportunity to share your expertise across diverse locations while advancing your career.Instruct, and demonstrate expertise, around the commissioning and maintenance of electrical power distribution products and ...
Pleasanton $93,750 - $137,500 (USD) Full time1 month ago
-
We're seeking a Power Systems External Training Specialist who can deliver dynamic training sessions across diverse locations while advancing their career.This role offers the opportunity to share your expertise remotely (up to 75% travel) within Eaton's Engineering Services & Sy ...
Pleasanton $93,750 - $137,500 (USD)1 month ago
-
Lift – Barber Training Systems is exploring additional training capacity within our private Montclair studio and is connecting with experienced independent personal trainers interested in operating inside an established, high-income training environment. · This is an independent ...
Oakland $90 - $110 (USD)4 days ago
-
Lift – Barber Training Systems is exploring additional training capacity within our private Montclair studio and is connecting with experienced independent personal trainers interested in operating inside an established, high-income training environment. · This is an independent ...
Oakland, CA3 days ago
-
We are looking for an individual to provide technical assistance to users of our platforms. Assist users in addressing questions and resolving technical issues. · Train users on processes and procedures. · ...
San Francisco $30 - $60 (USD) per hour Freelance1 week ago
-
As an ML Infra Engineer (Data Systems), you'll build and operate the data infrastructure that powers large-scale robot learning. · ...
San Francisco, CA1 month ago
-
Job summary · As an ML Infra Engineer (Data Systems), you'll build and operate the data infrastructure that powers large-scale robot learning. · ResponsibilitiesData Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw mul ...
San Francisco1 month ago
-
The Turbo team sits at the intersection of efficient inference (algorithms, architectures, engines) and post-training / RL systems. We build and operate the systems behind Together's API. · ...
San Francisco, CA2 weeks ago
-
You'll build and operate the data infrastructure that powers large-scale robot learning. Your systems will sit directly between raw data sources and training/evaluation, enabling us to move faster while maintaining performance, correctness, and reliability at scale. · Data Ingest ...
San Francisco Full time1 month ago
Research Engineer, Infrastructure, Training Systems - San Francisco - Thinking Machines Lab
Description
Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.
We are scientists, engineers, and builders who've created some of the most widely used AI products, including ChatGPT and , open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
About the Role
We're looking for an infrastructure research engineer to design and build the core systems that enable scalable, efficient training of large models for deployment and research. Your goal is to make experimentation and training at Thinking Machines fast and reliable to ensure our research teams can focus on science, not system bottlenecks.
This role is ideal for someone who blends deep systems and performance expertise with a curiosity for machine learning at scale. You'll take ownership of the training stack end to end, ensuring every GPU cycle drives scientific progress.
Note: This is an "evergreen role" that we keep open on an on-going basis to express interest. We receive many applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Still, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities open. You are welcome to reapply if you get more experience, but please avoid applying more than once every 6 months. You may also find that we put up postings for singular roles for separate, project or team specific needs. In those cases, you're welcome to apply directly in addition to an evergreen role.
What You'll Do
Minimum qualifications:
-
Training Content and Systems Architect
Only for registered members San Francisco, CA
-
Research Engineer, Infrastructure, Training Systems
Only for registered members San Francisco
-
Member of Technical Staff, Pre-training Systems
Only for registered members San Francisco
-
Member of Technical Staff, Pre-training Systems
Full time Only for registered members San Francisco
-
Member of Technical Staff, Pre-training Systems
Only for registered members San Francisco, CA
-
Member of Technical Staff, Pre-training Systems
Only for registered members San Francisco
-
Research Intern RL & Post-Training Systems, Turbo (Summer 2026)
Only for registered members San Francisco
-
Research Intern RL & Post-Training Systems, Turbo (Summer 2026)
Only for registered members San Francisco, CA
-
Manager, Train Control Systems
Only for registered members San Carlos
-
Manager, Train Control Systems
Only for registered members San Carlos, CA
-
Manager, Train Control Systems
Only for registered members San Carlos, CA
-
Manager, Train Control Systems
Only for registered members San Carlos, CA
-
Power Systems External Training Specialist
Full time Only for registered members Pleasanton
-
Power Systems External Training Specialist
Only for registered members Pleasanton
-
Independent Personal Trainer Partnership
Only for registered members Oakland
-
Independent Personal Trainer Partnership
Only for registered members Oakland, CA
-
Help Desk Support
Freelance HireHubJobs- San Francisco
-
ML Infra Engineer
Only for registered members San Francisco, CA
-
ML Infra Engineer
Only for registered members San Francisco
-
AI Researcher, Core ML
Only for registered members San Francisco, CA
-
ML Infra Engineer
Full time Only for registered members San Francisco