Technical Program Manager, ML Developer Experience and Infrastructure Reliability - Mountain View, CA

Only for registered members Mountain View, CA, United States

1 week ago

Default job background

Job summary

We are looking for a Technical Program Manager to lead cross-functional execution and drive the development of machine learning infrastructure. The ideal candidate will have experience in software engineering, large-scale infrastructure environments, and managing complex technical projects involving machine learning. Key responsibilities include defining and investing in a simplified "golden path" for ML development, ensuring smooth day-to-day operations of the reliability triage ecosystem, driving contract-based reliability programs across Onboard domains, and facilitating communication between ML research, infrastructure foundations, and onboard teams.

Qualifications

  • Bachelor's degree in Computer Science or related technical field
  • 5+ years of experience as a Technical Program Manager in software engineering or large-scale infrastructure environment

Lorem ipsum dolor sit amet
, consectetur adipiscing elit. Nullam tempor vestibulum ex, eget consequat quam pellentesque vel. Etiam congue sed elit nec elementum. Morbi diam metus, rutrum id eleifend ac, porta in lectus. Sed scelerisque a augue et ornare.

Donec lacinia nisi nec odio ultricies imperdiet.
Morbi a dolor dignissim, tristique enim et, semper lacus. Morbi laoreet sollicitudin justo eget eleifend. Donec felis augue, accumsan in dapibus a, mattis sed ligula.

Vestibulum at aliquet erat. Curabitur rhoncus urna vitae quam suscipit
, at pulvinar turpis lacinia. Mauris magna sem, dignissim finibus fermentum ac, placerat at ex. Pellentesque aliquet, lorem pulvinar mollis ornare, orci turpis fermentum urna, non ullamcorper ligula enim a ante. Duis dolor est, consectetur ut sapien lacinia, tempor condimentum purus.
Get full access

Access all high-level positions and get the job of your dreams.



Similar jobs

  • Work in company

    Senior Software Engineer, Reliability Infrastructure

    Only for registered members

    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. · ...

    Mountain View, CA

    1 month ago

  • Work in company

    Lead Infrastructure and Reliability Engineer

    Only for registered members

    We are still early. The playbook is still being written. A single exceptional engineer can reshape how the company operates. · At Luma AI, we operate rapidly scaling 10k+ GPU fleets pushing utilization through-put and reliability hard enough that yesterday's solutions break regul ...

    Palo Alto

    12 hours ago

  • Work in company

    Infrastructure Reliability Engineer, Bare Metal

    Only for registered members

    +We seek a highly skilled and driven Infrastructure Reliability Engineer Bare Metal to join our team and report to our Senior Director Customer Experience. · +Provide expert-level technical support and in-depth troubleshooting for a wide spectrum of hardware associated software i ...

    Sunnyvale $122,000 - $163,000 (USD)

    1 month ago

  • Work in company

    Cloud Infrastructure – Site Reliability Engineer

    Only for registered members

    We are committed to creating a more stable · streaming messaging platform for the future. · Familiar with high-availability architecture design, · and proficient in at least one of Python, Go, · or Java. · ...

    Sunnyvale $104,400 - $171,000 (USD)

    4 weeks ago

  • Work in company

    Infrastructure Reliability Engineer, Bare Metal

    Only for registered members

    We seek a highly skilled and driven Infrastructure Reliability Engineer, Bare Metal to join our team and report to our Senior Director, Customer Experience. · Provide expert-level technical support and in-depth troubleshooting for a wide spectrum of hardware and associated softwa ...

    Sunnyvale, CA

    1 month ago

  • Work in company

    Cloud Infrastructure – Site Reliability Engineer

    Only for registered members

    We are committed to creating a more stable streaming messaging platform for future needs. · ...

    Sunnyvale, CA

    3 weeks ago

  • Work in company

    Site Reliability Engineer, HPC Infrastructure

    Only for registered members

    Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware & silicon design. · ...

    Palo Alto $164,480 - $246,720 (USD)

    1 month ago

  • We are looking for a Technical Program Manager to lead cross-functional execution and drive the "Golden Path" for ML development in our autonomous driving technology company. · Key responsibilities include managing reliability operations, implementing infrastructure stability pro ...

    Mountain View $230,000 - $292,000 (USD) Full time

    1 week ago

  • We are looking for a Technical Program Manager to lead cross-functional execution to define and invest in a simplified "golden path" for ML development for Onboard and Waymo Foundation Model (WaymoFM) development. · The expected base salary range for this full-time position acros ...

    Mountain View, CA, USA

    1 week ago

  • Work in company

    Infrastructure Site Reliability Engineer

    Only for registered members

    TikTok is looking for an Infrastructure Site Reliability Engineer who will manage complex challenges of scale while using expertise in coding algorithms complexity analysis large-scale system design.The team operates with greater speed alignment agility especially in real-time de ...

    San Jose $118,657 - $187,200 (USD)

    1 month ago

  • Work in company

    Infrastructure Site Reliability Engineer

    Only for registered members

    +Site reliability engineering combines software and systems engineering to build and run large-scale systems. · +Engage in service lifecycle from inception through deployment operation automateDesign implement various dashboards monitoring frameworks for efficient automated intel ...

    San Jose, CA

    1 month ago

  • Work in company

    Site Reliability Engineer, AI/ML Infrastructure

    Only for registered members

    We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, · Manage and optimize HPC cluster operations · ...

    Santa Clara

    3 weeks ago

  • Work in company

    Site Reliability Engineer, AI/ML Infrastructure

    Only for registered members

    We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, · Manage and optimize HPC cluster operations · Deploy and maintain infrastructure-as-code solutions · Sup ...

    Santa Clara

    1 week ago

  • Work in company

    Site Reliability Engineer, Cloud Infrastructure

    Only for registered members

    We strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. · Every challenge is an opportunity to learn and innovate as one team. We're resilient and embrace challenges as they come. ...

    San Jose $118,657 - $259,200 (USD)

    1 month ago

  • Work in company

    Site Reliability Engineer, Recommendation Infrastructure

    Only for registered members

    The USDS TikTok Recommendations Infra SRE team works with engineering and product teams to build and run large-scale, globally distributed, observable, fault-tolerant systems. · Engage in and improve the whole lifecycle of Recommendation systems — from system design consulting th ...

    San Jose $118,657 - $259,200 (USD)

    1 month ago

  • Work in company

    Site Reliability Engineer, Infrastructure and Assurance Services

    Only for registered members

    The Systems and Networking team is committed to ensuring the seamless operation of TikTok's US physical infrastructure. · ...

    San Jose, CA

    4 weeks ago

  • Work in company

    Senior Site Reliability Engineer, Cloud Infrastructure

    Only for registered members

    The Systems and Networking team is committed to ensuring the seamless operation of TikTok's US physical infrastructure. We handle the provisioning of physical servers and maintain the TikTok US physical network.We also work closely with our colleagues around the world to build an ...

    San Jose $187,040 - $359,720 (USD)

    1 month ago

  • Work in company

    Site Reliability Engineer, Infrastructure and Assurance Services

    Only for registered members

    The Systems and Networking team is committed to ensuring the seamless operation of TikTok's US physical infrastructure. We handle the provisioning of physical servers and maintain the TikTok US physical network. Additionally, · we engage in collaborative efforts with vendors such ...

    San Jose $118,657 - $259,200 (USD)

    4 weeks ago

  • Work in company

    Sr. Site Reliability Engineer, MLOps, Infrastructure Engineering

    Only for registered members

    +Job summary · As a Site Reliability Engineer, you will be responsible for maintaining and improving our platform to ensure our cross functional teams have the necessary tools and resources to be productive.Mature our Machine Learning Operations Platform and advocate best practic ...

    Fremont

    1 month ago

  • We are looking for talented individuals to join our team in 2026. As a graduate, you will get unparalleled opportunities for you to kickstart your career, pursue bold ideas and explore limitless growth opportunities. · Participate in and enhance the complete service lifecycle, fr ...

    San Jose, CA

    1 week ago