Research Engineer, Infrastructure, Training Systems - San Francisco - Thinking Machines Lab

1 day ago

$350,000 - $475,000 (USD) per year

Description

Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.
We are scientists, engineers, and builders who've created some of the most widely used AI products, including ChatGPT and , open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
About the Role
We're looking for an infrastructure research engineer to design and build the core systems that enable scalable, efficient training of large models for deployment and research. Your goal is to make experimentation and training at Thinking Machines fast and reliable to ensure our research teams can focus on science, not system bottlenecks.
This role is ideal for someone who blends deep systems and performance expertise with a curiosity for machine learning at scale. You'll take ownership of the training stack end to end, ensuring every GPU cycle drives scientific progress.
Note: This is an "evergreen role" that we keep open on an on-going basis to express interest. We receive many applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Still, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities open. You are welcome to reapply if you get more experience, but please avoid applying more than once every 6 months. You may also find that we put up postings for singular roles for separate, project or team specific needs. In those cases, you're welcome to apply directly in addition to an evergreen role.
What You'll Do

Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for large-scale training workloads.
Develop high-performance optimizations to maximize throughput and efficiency.
Develop reusable frameworks and libraries to improve training reproducibility, reliability, and scalability for new model architectures.
Establish standards for reliability, maintainability, and security, ensuring systems are robust under rapid iteration.
Collaborate with researchers and engineers to build scalable infrastructure.
Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.

Skills and Qualifications
Minimum qualifications:

Bachelor's degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar.
Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases
Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.

Preferred qualifications - we encourage you to apply if you meet some but not all of these:

Past experience working on distributed training for the world's largest models to make them stable, reliable, and performant.
Track record of improving research productivity through infrastructure design or process improvements.
Contributions to open-source ML infrastructure such as PyTorch, XLA, Megatron-LM, or DeepSpeed.

Logistics

Location: This role is based in San Francisco, California.
Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.
Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.

Work in company
Training Content and Systems Architect
Only for registered members

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. · Maintain the single source of truth for all trainer-facing content — decks, speaker notes, facilitator guides, demo libraries, exercises — and build AI-augmented workflows for keeping it current ...

San Francisco, CA
1 month ago
Work in company
Research Engineer, Infrastructure, Training Systems
Only for registered members

· Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. · We are scientists, engineers, and buil ...

San Francisco $350,000 - $475,000 (USD) per year
1 week ago
Work in company
Member of Technical Staff, Pre-training Systems
Only for registered members

Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...

San Francisco $225,000 - $550,000 (USD)
1 week ago
Work in company
Member of Technical Staff, Pre-training Systems
Only for registered members

Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...

San Francisco $225,000 - $550,000 (USD) Full time
1 week ago
Work in company
Member of Technical Staff, Pre-training Systems
Only for registered members

Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...

San Francisco, CA
6 days ago
Work in company
Member of Technical Staff, Pre-training Systems
Only for registered members

Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...

San Francisco
1 day ago
Work in company
Research Intern RL & Post-Training Systems, Turbo (Summer 2026)
Only for registered members

The Turbo Research team investigates how to make post-training and reinforcement learning for large language models efficient, scalable, and reliable. · As a research intern, you will study RL and post-training methods whose performance and scalability are tightly coupled to infe ...

San Francisco $58 - $63 (USD)
1 month ago
Work in company
Research Intern RL & Post-Training Systems, Turbo (Summer 2026)
Only for registered members

About Together AI is a research-driven artificial intelligence company that aims to significantly lower the cost of modern AI systems by co-designing software, algorithms and models. · ...

San Francisco, CA
2 months ago
Work in company
Manager, Train Control Systems
Only for registered members

The Manager Train Control Systems reports to the Deputy Director Rail Systems Engineering responsible for oversight managing of complex train control systems including installation modification maintenance upgrade of Districts Rail Operations Control System ROCS Data Acquisition ...

San Carlos
1 month ago
Work in company
Manager, Train Control Systems
Only for registered members

· General · The Manager, Train Control Systems reports to the Deputy Director Rail Systems Engineering and is responsible for oversight and managing of complex train control systems including the installation, modification, maintenance, and upgrade of the District's Rail Operati ...

San Carlos, CA
1 week ago
Work in company
Manager, Train Control Systems
Only for registered members

The · Manager, Train Control Systems reports to the Deputy Director Rail Systems Engineering and is responsible for oversight and managing of complex train control systems including the installation, · maintenance, · and upgrade of the District's Rail Operations Control System ( ...

San Carlos, CA
1 month ago
Work in company
Manager, Train Control Systems
Only for registered members

The Manager Train Control Systems reports to the Deputy Director Rail Systems Engineering responsible for oversight managing of complex train control systems including installation modification maintenance upgrade of the Districts Rail Operations Control System ROCS Data Acquisit ...

San Carlos, CA
1 month ago
Work in company
Power Systems External Training Specialist
Only for registered members

This is a remote position with up to · 75% travel, giving you the opportunity to share your expertise across diverse locations while advancing your career.Instruct, and demonstrate expertise, around the commissioning and maintenance of electrical power distribution products and ...

Pleasanton $93,750 - $137,500 (USD) Full time
1 month ago
Work in company
Power Systems External Training Specialist
Only for registered members

We're seeking a Power Systems External Training Specialist who can deliver dynamic training sessions across diverse locations while advancing their career.This role offers the opportunity to share your expertise remotely (up to 75% travel) within Eaton's Engineering Services & Sy ...

Pleasanton $93,750 - $137,500 (USD)
1 month ago
Work in company
Independent Personal Trainer Partnership
Only for registered members

Lift – Barber Training Systems is exploring additional training capacity within our private Montclair studio and is connecting with experienced independent personal trainers interested in operating inside an established, high-income training environment. · This is an independent ...

Oakland $90 - $110 (USD)
4 days ago
Work in company
Independent Personal Trainer Partnership
Only for registered members

Lift – Barber Training Systems is exploring additional training capacity within our private Montclair studio and is connecting with experienced independent personal trainers interested in operating inside an established, high-income training environment. · This is an independent ...

Oakland, CA
3 days ago
Work in company Remote job
Help Desk Support
HireHubJobs - Posted by: Francis Gred

We are looking for an individual to provide technical assistance to users of our platforms. Assist users in addressing questions and resolving technical issues. · Train users on processes and procedures. · ...

San Francisco $30 - $60 (USD) per hour Freelance
1 week ago
Work in company
ML Infra Engineer
Only for registered members

As an ML Infra Engineer (Data Systems), you'll build and operate the data infrastructure that powers large-scale robot learning. · ...

San Francisco, CA
1 month ago
Work in company
ML Infra Engineer
Only for registered members

Job summary · As an ML Infra Engineer (Data Systems), you'll build and operate the data infrastructure that powers large-scale robot learning. · ResponsibilitiesData Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw mul ...

San Francisco
1 month ago
Work in company
AI Researcher, Core ML
Only for registered members

The Turbo team sits at the intersection of efficient inference (algorithms, architectures, engines) and post-training / RL systems. We build and operate the systems behind Together's API. · ...

San Francisco, CA
2 weeks ago
Work in company
ML Infra Engineer
Only for registered members

You'll build and operate the data infrastructure that powers large-scale robot learning. Your systems will sit directly between raw data sources and training/evaluation, enabling us to move faster while maintaining performance, correctness, and reliability at scale. · Data Ingest ...

San Francisco Full time
1 month ago

Training Content and Systems Architect
Only for registered members San Francisco, CA
Research Engineer, Infrastructure, Training Systems
Only for registered members San Francisco
Member of Technical Staff, Pre-training Systems
Only for registered members San Francisco
Member of Technical Staff, Pre-training Systems
Full time Only for registered members San Francisco
Member of Technical Staff, Pre-training Systems
Only for registered members San Francisco, CA
Member of Technical Staff, Pre-training Systems
Only for registered members San Francisco
Research Intern RL & Post-Training Systems, Turbo (Summer 2026)
Only for registered members San Francisco
Research Intern RL & Post-Training Systems, Turbo (Summer 2026)
Only for registered members San Francisco, CA
Manager, Train Control Systems
Only for registered members San Carlos
Manager, Train Control Systems
Only for registered members San Carlos, CA
Manager, Train Control Systems
Only for registered members San Carlos, CA
Manager, Train Control Systems
Only for registered members San Carlos, CA
Power Systems External Training Specialist
Full time Only for registered members Pleasanton
Power Systems External Training Specialist
Only for registered members Pleasanton
Independent Personal Trainer Partnership
Only for registered members Oakland
Independent Personal Trainer Partnership
Only for registered members Oakland, CA
Help Desk Support
Freelance HireHubJobs- San Francisco
ML Infra Engineer
Only for registered members San Francisco, CA
ML Infra Engineer
Only for registered members San Francisco
AI Researcher, Core ML
Only for registered members San Francisco, CA
ML Infra Engineer
Full time Only for registered members San Francisco