Technical Program Manager, ML Developer Experience and Infrastructure Reliability - Mountain View, CA
1 week ago

Job summary
We are looking for a Technical Program Manager to lead cross-functional execution and drive the development of machine learning infrastructure. The ideal candidate will have experience in software engineering, large-scale infrastructure environments, and managing complex technical projects involving machine learning. Key responsibilities include defining and investing in a simplified "golden path" for ML development, ensuring smooth day-to-day operations of the reliability triage ecosystem, driving contract-based reliability programs across Onboard domains, and facilitating communication between ML research, infrastructure foundations, and onboard teams.
Qualifications
- Bachelor's degree in Computer Science or related technical field
- 5+ years of experience as a Technical Program Manager in software engineering or large-scale infrastructure environment
Job description
, consectetur adipiscing elit. Nullam tempor vestibulum ex, eget consequat quam pellentesque vel. Etiam congue sed elit nec elementum. Morbi diam metus, rutrum id eleifend ac, porta in lectus. Sed scelerisque a augue et ornare.
Donec lacinia nisi nec odio ultricies imperdiet.
Morbi a dolor dignissim, tristique enim et, semper lacus. Morbi laoreet sollicitudin justo eget eleifend. Donec felis augue, accumsan in dapibus a, mattis sed ligula.
Vestibulum at aliquet erat. Curabitur rhoncus urna vitae quam suscipit
, at pulvinar turpis lacinia. Mauris magna sem, dignissim finibus fermentum ac, placerat at ex. Pellentesque aliquet, lorem pulvinar mollis ornare, orci turpis fermentum urna, non ullamcorper ligula enim a ante. Duis dolor est, consectetur ut sapien lacinia, tempor condimentum purus.
Access all high-level positions and get the job of your dreams.
Similar jobs
Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. · ...
1 month ago
We are still early. The playbook is still being written. A single exceptional engineer can reshape how the company operates. · At Luma AI, we operate rapidly scaling 10k+ GPU fleets pushing utilization through-put and reliability hard enough that yesterday's solutions break regul ...
12 hours ago
+We seek a highly skilled and driven Infrastructure Reliability Engineer Bare Metal to join our team and report to our Senior Director Customer Experience. · +Provide expert-level technical support and in-depth troubleshooting for a wide spectrum of hardware associated software i ...
1 month ago
We are committed to creating a more stable · streaming messaging platform for the future. · Familiar with high-availability architecture design, · and proficient in at least one of Python, Go, · or Java. · ...
4 weeks ago
We seek a highly skilled and driven Infrastructure Reliability Engineer, Bare Metal to join our team and report to our Senior Director, Customer Experience. · Provide expert-level technical support and in-depth troubleshooting for a wide spectrum of hardware and associated softwa ...
1 month ago
We are committed to creating a more stable streaming messaging platform for future needs. · ...
3 weeks ago
Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware & silicon design. · ...
1 month ago
Technical Program Manager, ML Developer Experience and Infrastructure Reliability
Only for registered members
We are looking for a Technical Program Manager to lead cross-functional execution and drive the "Golden Path" for ML development in our autonomous driving technology company. · Key responsibilities include managing reliability operations, implementing infrastructure stability pro ...
1 week ago
Technical Program Manager, ML Developer Experience and Infrastructure Reliability
Only for registered members
We are looking for a Technical Program Manager to lead cross-functional execution to define and invest in a simplified "golden path" for ML development for Onboard and Waymo Foundation Model (WaymoFM) development. · The expected base salary range for this full-time position acros ...
1 week ago
TikTok is looking for an Infrastructure Site Reliability Engineer who will manage complex challenges of scale while using expertise in coding algorithms complexity analysis large-scale system design.The team operates with greater speed alignment agility especially in real-time de ...
1 month ago
+Site reliability engineering combines software and systems engineering to build and run large-scale systems. · +Engage in service lifecycle from inception through deployment operation automateDesign implement various dashboards monitoring frameworks for efficient automated intel ...
1 month ago
We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, · Manage and optimize HPC cluster operations · ...
3 weeks ago
We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, · Manage and optimize HPC cluster operations · Deploy and maintain infrastructure-as-code solutions · Sup ...
1 week ago
We strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. · Every challenge is an opportunity to learn and innovate as one team. We're resilient and embrace challenges as they come. ...
1 month ago
Site Reliability Engineer, Recommendation Infrastructure
Only for registered members
The USDS TikTok Recommendations Infra SRE team works with engineering and product teams to build and run large-scale, globally distributed, observable, fault-tolerant systems. · Engage in and improve the whole lifecycle of Recommendation systems — from system design consulting th ...
1 month ago
Site Reliability Engineer, Infrastructure and Assurance Services
Only for registered members
The Systems and Networking team is committed to ensuring the seamless operation of TikTok's US physical infrastructure. · ...
4 weeks ago
The Systems and Networking team is committed to ensuring the seamless operation of TikTok's US physical infrastructure. We handle the provisioning of physical servers and maintain the TikTok US physical network.We also work closely with our colleagues around the world to build an ...
1 month ago
Site Reliability Engineer, Infrastructure and Assurance Services
Only for registered members
The Systems and Networking team is committed to ensuring the seamless operation of TikTok's US physical infrastructure. We handle the provisioning of physical servers and maintain the TikTok US physical network. Additionally, · we engage in collaborative efforts with vendors such ...
4 weeks ago
Sr. Site Reliability Engineer, MLOps, Infrastructure Engineering
Only for registered members
+Job summary · As a Site Reliability Engineer, you will be responsible for maintaining and improving our platform to ensure our cross functional teams have the necessary tools and resources to be productive.Mature our Machine Learning Operations Platform and advocate best practic ...
1 month ago
Site Reliability Engineer Graduate (Data Infrastructure) - 2026 Start (BS/MS)
Only for registered members
We are looking for talented individuals to join our team in 2026. As a graduate, you will get unparalleled opportunities for you to kickstart your career, pursue bold ideas and explore limitless growth opportunities. · Participate in and enhance the complete service lifecycle, fr ...
1 week ago