Staff Machine Learning Infrastructure Engineer - Redwood City - DYNA Robotics Inc

    DYNA Robotics Inc
    DYNA Robotics Inc Redwood City

    1 week ago

    Description
    Company Overview:
    Dyna Robotics makes general-purpose robots powered by a proprietary embodied AI foundation model that generalizes and self-improves across varied environments with commercial-grade performance. Dyna's robots have been deployed at customers across multiple industries. Its frontier model has the top generalization and performance in the industry.
    Dyna Robotics was founded by repeat founders Lindon Gao and York Yang, who sold Caper AI for $350 million, and former DeepMind research scientist Jason Ma. The company has raised over $140M, backed by top investors, including CRV and First Round.We're positioned to redefine the landscape of robotic automation. Join us to shape the next frontier of AI-driven robotics
    Learn more at
    Position Overview:
    We are seeking an experience Machine Learning Infrastructure Engineer to join our team and help scale our ML training platform. In this role, you will be responsible for designing, implementing, and maintaining large-scale ML infrastructure to accelerate model iteration and improve training performance across an expanding GPU ecosystem. You will work on cutting-edge high-performance computing systems, optimizing distributed training environments, and ensuring system reliability as we scale.
    Key Responsibilities:
    • Infrastructure Design & Scalability:
      • Architect and implement large-scale ML training pipelines that leverage parallel GPU processing on platforms like GCP or AWS.
      • Enhance our existing infrastructure to fully exploit parallelism and design for future expansion, ensuring that our system is ready to support growth.
    • High-Performance ML Computing & Distributed Systems:
      • Manage and optimize high-performance computing resources.
      • Develop robust distributed computing solutions, addressing challenges like race conditions, memory optimization, and resource allocation.
      • Optimize model training with techniques like mixed precision, ZeRO, Lora, etc.
    • Job Scheduling & Reliability:
      • Design systems for job rescheduling, automated retries, and failure recovery to maximize uptime and training efficiency.
      • Implement intelligent job queuing mechanisms to optimize training workloads and resource utilization.
    • Storage & Data Handling:
      • Evaluate and implement tradeoffs between different local and networked storage solutions to improve data throughput and access.
      • Develop strategies for caching training data to optimize performance.
    • Collaboration & Continuous Improvement:
      • Work closely with ML researchers and data scientists to understand training requirements and bottlenecks.
      • Continuously monitor system performance, identify areas for improvement, and implement best practices to enhance scalability and reliability.
    Required Qualifications:
    • Bachelor's degree or higher in Computer Science or a related field.
    • At least 7 years of professional experience in the software industry, with a minimum of 2 years in a tech lead role.
    • Proven experience with high-performance computing environments and distributed systems.
    • Demonstrated ability to scale ML training systems and optimize resource utilization.
    • Hands-on experience with job scheduling systems and managing cloud GPU environments (GCP, AWS, etc.).
    • Deep understanding of distributed computing concepts, including race conditions, memory optimization, and parallel processing.
    • Hands-on experience in ML model tuning for performance.
    • Experience with common ML training and inference tools including PyTorch, TensorRT, Triton, Accelerate, etc.
    • Strong analytical and problem-solving skills with the ability to troubleshoot complex system issues.
    • Excellent communication skills to collaborate effectively with cross-functional teams.
    Preferred Qualifications:
    • Experience with container orchestration tools (e.g., Kubernetes) and infrastructure-as-code frameworks.

  • Only for registered members Redwood City

    We're looking for an experienced Infrastructure Engineer to join as a member of our core Datology AI team. As one of our early senior hires, you will partner closely with our founders on the direction of our product and drive business-critical technical decisions. · Design and bu ...

  • Only for registered members Redwood City Full time

    We are seeking an experienced Infrastructure Engineer to join our team at GridCARE. As an Infrastructure Engineer you will be responsible for designing and implementing robust cloud infrastructure with security measures for our B2B SaaS platform. · ...

  • Only for registered members Redwood City, CA

    The ideal candidate will be responsible for designing, implementing, and maintaining robust cloud infrastructure with security measures for our B2B SaaS platform, which leverages machine learning models and cloud-native technologies. · ...

  • Only for registered members Redwood City

    We are looking for a highly motivated Backend Engineer to join our Infrastructure & DevOps team.You will play a critical role in designing, building, and maintaining the backbone of our scalable cloud environments, developer productivity applications, and CI/CD systems. · Your wo ...

  • Only for registered members Redwood City, CA

    We are looking for a highly motivated Backend Engineer to join our Infrastructure & DevOps team. · C3 AI provides excellent benefits and competitive compensation package. · Caliornia Base Pay Range $120,000—$164,000 USD ...

  • Only for registered members Redwood City Full time $120,000 - $164,000 (USD)

    We are looking for a highly motivated Backend Engineer to join our Infrastructure & DevOps team. · Design and develop internal software and applications using Java, JavaScript, Python, and Shell. · Support engineering workflows and release automation. · ...

  • Only for registered members Palo Alto $180,000 - $440,000 (USD)

    We are seeking a highly skilled Senior Infrastructure Engineer to join our US Government Team, · Develop and optimize software to provision and manage xAI's infrastructure across on-premise,virtual machine, and classified cloud environments. · Enhance the reliability, performance ...

  • Only for registered members Palo Alto

    We are seeking an Infrastructure Engineer to join our team.In this role, you will be responsible for designing and optimizing our Hybrid Cloud Infrastructure Platform across public, private, and on-premise datacenters. · ...

  • Only for registered members Redwood City, CA

    This role is working on our core platform infrastructure that powers all of our robots. · Designing and implementing API's · Designing and implementing infrastructure in a multi-tenant distributed system · ...

  • Only for registered members Redwood City OTHER $155,900 - $261,100 (USD)

    The security team at Poshmark is responsible for securing our application platform cloud infrastructure and IT systems to protect Poshmark and its 150 million Poshers. · This role involves designing implementing and maintaining secure AWS cloud corporate IT infrastructure ensurin ...

  • Only for registered members Redwood City $155,000 - $198,000 (USD)

    We are looking for a highly motivated Senior Backend Engineer to join our Infrastructure & DevOps team.You will play a critical role in designing, building, and maintaining the backbone of our scalable cloud environments, CI/CD systems, and developer tooling. · ...

  • Only for registered members Redwood City

    This role is for a Staff Cloud/Infrastructure Security Engineer responsible for designing, · implementing and maintaining secure AWS cloud and corporate IT infrastructure, · ensuring alignment with industry best practices and CIS benchmarks.Develop bot · & fraud attack detection ...

  • Only for registered members Redwood City, CA

    We're looking for seasoned ML Infrastructure engineers with experience designing building and maintaining training and serving infrastructure for ML research. · Provide infrastructure support to our ML research productBuild tooling to diagnose cluster issues hardware failuresMoni ...

  • Only for registered members Redwood City, California, USA

    Poshmark is a leading fashion resale marketplace powered by a vibrant, highly engaged community of buyers and sellers and real-time social experiences. · Develop bot and fraud attack detection and mitigation strategies. · Harden corporate IT and SaaS applications through security ...

  • Only for registered members Redwood City $170,000 - $200,000 (USD)

    This role is working on our core platform infrastructure that powers all of our robots. Our robotics stack covers a wide range of functionality · . Designing and implementing API'sDesigning and implementing infrastructure in a multi-tenant distributed systemEvaluating and debuggi ...

  • Only for registered members Palo Alto $180,000 - $370,000 (USD)

    About xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. · ...

  • Only for registered members Redwood City, CA

    We are looking for a highly motivated Senior Backend Engineer to join our Infrastructure & DevOps team. · ...

  • Only for registered members Redwood City Full time $180,000 - $250,000 (USD)

    We're looking for an experienced Cloud Infrastructure Engineer to join our core team at DatologyAI. In this role, you will lead the design, build, · & operation of highly available, secure and scalable cloud infrastructure that powers our training,inference, · & data curation pip ...

  • Only for registered members Redwood City, CA

    Pospmark is a leading fashion resale marketplace powered by a vibrant, highly engaged community of buyers and sellers and real-time social experiences. · ...

  • Only for registered members Redwood City

    We're looking for an experienced Cloud Infrastructure Engineer to join our core team at DatologyAI. In this role, you will lead the design, build and operation of highly available secure and scalable cloud infrastructure that powers our training inference and data curation pipeli ...

  • Only for registered members US California (Redwood City)

    The security team at Poshmark is responsible for securing our application platform, cloud infrastructure and IT systems to protect Poshmark and its · 150 million Poshers. · ...

Jobs
>
Redwood City