Research Engineer, Infrastructure, Training Systems - San Francisco - Thinking Machines Lab

    Thinking Machines Lab
    Thinking Machines Lab San Francisco

    1 day ago

    $350,000 - $475,000 (USD) per year
    Description

    Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.
    We are scientists, engineers, and builders who've created some of the most widely used AI products, including ChatGPT and , open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
    About the Role
    We're looking for an infrastructure research engineer to design and build the core systems that enable scalable, efficient training of large models for deployment and research. Your goal is to make experimentation and training at Thinking Machines fast and reliable to ensure our research teams can focus on science, not system bottlenecks.
    This role is ideal for someone who blends deep systems and performance expertise with a curiosity for machine learning at scale. You'll take ownership of the training stack end to end, ensuring every GPU cycle drives scientific progress.
    Note: This is an "evergreen role" that we keep open on an on-going basis to express interest. We receive many applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Still, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities open. You are welcome to reapply if you get more experience, but please avoid applying more than once every 6 months. You may also find that we put up postings for singular roles for separate, project or team specific needs. In those cases, you're welcome to apply directly in addition to an evergreen role.
    What You'll Do

    • Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for large-scale training workloads.
    • Develop high-performance optimizations to maximize throughput and efficiency.
    • Develop reusable frameworks and libraries to improve training reproducibility, reliability, and scalability for new model architectures.
    • Establish standards for reliability, maintainability, and security, ensuring systems are robust under rapid iteration.
    • Collaborate with researchers and engineers to build scalable infrastructure.
    • Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.
    Skills and Qualifications
    Minimum qualifications:
    • Bachelor's degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar.
    • Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases
    • Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
    • Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
    • A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.
    Preferred qualifications - we encourage you to apply if you meet some but not all of these:
    • Past experience working on distributed training for the world's largest models to make them stable, reliable, and performant.
    • Track record of improving research productivity through infrastructure design or process improvements.
    • Contributions to open-source ML infrastructure such as PyTorch, XLA, Megatron-LM, or DeepSpeed.
    Logistics
    • Location: This role is based in San Francisco, California.
    • Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.
    • Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
    • Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.

  • Work in company

    Training Content and Systems Architect

    Only for registered members

    Anthropic's mission is to create reliable, interpretable, and steerable AI systems. · Maintain the single source of truth for all trainer-facing content — decks, speaker notes, facilitator guides, demo libraries, exercises — and build AI-augmented workflows for keeping it current ...

    San Francisco, CA

    1 month ago

  • Work in company

    Research Engineer, Infrastructure, Training Systems

    Only for registered members

    · Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.  · We are scientists, engineers, and buil ...

    San Francisco $350,000 - $475,000 (USD) per year

    1 week ago

  • Work in company

    Member of Technical Staff, Pre-training Systems

    Only for registered members

    Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...

    San Francisco $225,000 - $550,000 (USD)

    1 week ago

  • Work in company

    Member of Technical Staff, Pre-training Systems

    Only for registered members

    Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...

    San Francisco $225,000 - $550,000 (USD) Full time

    1 week ago

  • Work in company

    Member of Technical Staff, Pre-training Systems

    Only for registered members

    Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...

    San Francisco, CA

    6 days ago

  • Work in company

    Member of Technical Staff, Pre-training Systems

    Only for registered members

    Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...

    San Francisco

    1 day ago

  • Work in company

    Research Intern RL & Post-Training Systems, Turbo (Summer 2026)

    Only for registered members

    The Turbo Research team investigates how to make post-training and reinforcement learning for large language models efficient, scalable, and reliable. · As a research intern, you will study RL and post-training methods whose performance and scalability are tightly coupled to infe ...

    San Francisco $58 - $63 (USD)

    1 month ago

  • Work in company

    Research Intern RL & Post-Training Systems, Turbo (Summer 2026)

    Only for registered members

    About Together AI is a research-driven artificial intelligence company that aims to significantly lower the cost of modern AI systems by co-designing software, algorithms and models. · ...

    San Francisco, CA

    2 months ago

  • Work in company

    Manager, Train Control Systems

    Only for registered members

    The Manager Train Control Systems reports to the Deputy Director Rail Systems Engineering responsible for oversight managing of complex train control systems including installation modification maintenance upgrade of Districts Rail Operations Control System ROCS Data Acquisition ...

    San Carlos

    1 month ago

  • Work in company

    Manager, Train Control Systems

    Only for registered members

    · General · The Manager, Train Control Systems reports to the Deputy Director Rail Systems Engineering and is responsible for oversight and managing of complex train control systems including the installation, modification, maintenance, and upgrade of the District's Rail Operati ...

    San Carlos, CA

    1 week ago

  • Work in company

    Manager, Train Control Systems

    Only for registered members

    The · Manager, Train Control Systems reports to the Deputy Director Rail Systems Engineering and is responsible for oversight and managing of complex train control systems including the installation, · maintenance, · and upgrade of the District's Rail Operations Control System ( ...

    San Carlos, CA

    1 month ago

  • Work in company

    Manager, Train Control Systems

    Only for registered members

    The Manager Train Control Systems reports to the Deputy Director Rail Systems Engineering responsible for oversight managing of complex train control systems including installation modification maintenance upgrade of the Districts Rail Operations Control System ROCS Data Acquisit ...

    San Carlos, CA

    1 month ago

  • Work in company

    Power Systems External Training Specialist

    Only for registered members

    This is a remote position with up to · 75% travel, giving you the opportunity to share your expertise across diverse locations while advancing your career.Instruct, and demonstrate expertise, around the commissioning and maintenance of electrical power distribution products and ...

    Pleasanton $93,750 - $137,500 (USD) Full time

    1 month ago

  • Work in company

    Power Systems External Training Specialist

    Only for registered members

    We're seeking a Power Systems External Training Specialist who can deliver dynamic training sessions across diverse locations while advancing their career.This role offers the opportunity to share your expertise remotely (up to 75% travel) within Eaton's Engineering Services & Sy ...

    Pleasanton $93,750 - $137,500 (USD)

    1 month ago

  • Work in company

    Independent Personal Trainer Partnership

    Only for registered members

    Lift – Barber Training Systems is exploring additional training capacity within our private Montclair studio and is connecting with experienced independent personal trainers interested in operating inside an established, high-income training environment. · This is an independent ...

    Oakland $90 - $110 (USD)

    4 days ago

  • Work in company

    Independent Personal Trainer Partnership

    Only for registered members

    Lift – Barber Training Systems is exploring additional training capacity within our private Montclair studio and is connecting with experienced independent personal trainers interested in operating inside an established, high-income training environment. · This is an independent ...

    Oakland, CA

    3 days ago

  • Work in company Remote job

    Help Desk Support

    HireHubJobs - Posted by: Francis Gred

    We are looking for an individual to provide technical assistance to users of our platforms. Assist users in addressing questions and resolving technical issues. · Train users on processes and procedures. · ...

    San Francisco $30 - $60 (USD) per hour Freelance

    1 week ago

  • Work in company

    ML Infra Engineer

    Only for registered members

    As an ML Infra Engineer (Data Systems), you'll build and operate the data infrastructure that powers large-scale robot learning. · ...

    San Francisco, CA

    1 month ago

  • Work in company

    ML Infra Engineer

    Only for registered members

    Job summary · As an ML Infra Engineer (Data Systems), you'll build and operate the data infrastructure that powers large-scale robot learning. · ResponsibilitiesData Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw mul ...

    San Francisco

    1 month ago

  • Work in company

    AI Researcher, Core ML

    Only for registered members

    The Turbo team sits at the intersection of efficient inference (algorithms, architectures, engines) and post-training / RL systems. We build and operate the systems behind Together's API. · ...

    San Francisco, CA

    2 weeks ago

  • Work in company

    ML Infra Engineer

    Only for registered members

    You'll build and operate the data infrastructure that powers large-scale robot learning. Your systems will sit directly between raw data sources and training/evaluation, enabling us to move faster while maintaining performance, correctness, and reliability at scale. · Data Ingest ...

    San Francisco Full time

    1 month ago

Jobs
>
San Francisco