Senior HPC Operations Engineer - Bellevue - Lambda Corporation

    Lambda Corporation
    Lambda Corporation Bellevue

    1 week ago

    Description

    Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU.
    If you'd like to build the world's best AI cloud, join us.
    *Note: This position requires presence in our San Francisco/San Jose or Bellevue office location 4 days per week; Lambda's designated work from home day is currently Tuesday.
    Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.
    What You'll Do

    • Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes)
    • Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
    • Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
    • Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
    • Contribute to the creation of and maintenance of Standard Operating Procedures
    • Provide regular and well-communicated updates to project leads throughout each deployment
    • Mentor and assist less experienced team members
    • Stay up-to-date on the latest HPC/AI technologies and best practices
    You
    • Are a deeply experienced HPC engineer comfortable with logical provisioning of a cluster
    • Have a strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking
    • 10+ years of experience in deploying and configuring HPC clusters for AI workloads
    • Have an innate attention to detail
    • Have experience with Bright Cluster Manager or similar cluster management tools
    • Are in expert in configuring and troubleshooting:
      • SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics
      • Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments
      • Linux based compute nodes, firmware updates, driver installation
      • SLURM, Kubernetes, or other job scheduling systems
    • Work well under deadlines and structured project plans also knowing when and how to ask for changes to project timelines
    • Have excellent problem solving and troubleshooting skills
    • Have flexibility to travel to our North American data centers as on-site needs arise or as part of training exercises
    • Are able to work independently and as part of a team
    • Are comfortable mentoring and supporting junior HPC engineers on cluster deployments
    Nice to Have
    • Experience with machine learning and deep learning frameworks (PyTorch, Tensorflow) and benchmarking tools (DeepSpeed, MLPerf)
    • Experience with containerization technologies ( Docker, Kubernetes)
    • Experience working with the technologies that underpin our cloud business ( GPU acceleration, virtualization, and cloud computing)
    • Keen situational awareness in customer situations, employing diplomacy and tact
    • Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience
    Salary Range Information
    The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
    About Lambda
    • Founded in 2012, with 500+ employees, and growing fast
    • Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove
    • We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
    • Our values are publicly available:
    • We offer generous cash & equity compensation
    • Health, dental, and vision coverage for you and your dependents
    • Wellness and commuter stipends for select roles
    • 401k Plan with 2% company match (USA employees)
    • Flexible paid time off plan that we all actually use
    A Final Note:
    You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.
    Equal Opportunity Employer

  • Only for registered members Bellevue $68,900 - $126,400 (USD)

    This role will manage the capacity of multiple last mile sites and partner with Topology, Operations, Design/Process/Network Engineering, Finance, and other planning teams to build a reliable view of short/mid/long term capacity. · In this role the Operations Engineer will drive ...

  • Only for registered members Bellevue Full time $68,900 - $126,400 (USD)

    This role will manage the capacity of multiple last mile sites and partner with Topology, Operations, Design/Process/Network Engineering, Finance, and other planning teams to build a reliable view of short/mid/long term capacity. In this role the Operations Engineer will drive in ...

  • Only for registered members Bellevue $107,600 - $177,900 (USD)

    Amazon Logistics (AMZL) is undergoing growth at an unprecedented level on a global scale. · We are consistently looking for the next breakthrough or innovation which will disrupt the delivery business and enable a step change in the way millions of customers get their orders. · ...

  • Only for registered members Bellevue, WA

    Amazon Logistics (AMZL) is undergoing growth at an unprecedented level on a global scale. · ...

  • Only for registered members Bellevue, WA

    Amazon Logistics (AMZL) is undergoing growth at an unprecedented level on a global scale. · We are consistently looking for the next breakthrough or innovation which will disrupt the delivery business and enable a step change in the way millions of customers get their orders. · I ...

  • Only for registered members Bellevue, WA

    Amazon Logistics (AMZL) is undergoing growth at an unprecedented level on a global scale. · This individual will be the owner of a process/product roadmap and ensuring delivery to the vision of the next generation of AMZL delivery station. · ...

  • Only for registered members Bellevue, WA

    We are consistently looking for the next breakthrough or innovation which will disrupt the delivery business and enable a step change in the way millions of customers get their orders. · Initiate, define, plan and manage the roll out of the product features and enhancements for t ...

  • Only for registered members Bellevue $68,800 - $126,400 (USD)

    As an Operations Engineer at Amazon you will lead and manage the lifecycle of challenging cross-functional projects developing and delivering the next generation of sortation and distribution solutions. · You are responsible for managing technology or integration project successf ...

  • Only for registered members Bellevue Full time $107,600 - $177,900 (USD)

    We are consistently looking for the next breakthrough or innovation which will disrupt the delivery business and enable a step change in the way millions of customers get their orders. · ...

  • Only for registered members Bellevue $107,600 - $177,900 (USD)

    +Amazon Logistics (AMZL) is changing the way we interact with customers around the globe every single day and solving some of the biggest logistical challenges facing not just Amazon, but the entire industry. · +We are consistently looking for the next breakthrough or innovation ...

  • Only for registered members Bellevue

    We are looking for a hands-on leader with experience managing new product and process development for high dollar value programs. This is a unique opportunity at a critical juncture in Amazon's growth to drive end-to-end product delivery from inception to execution. · ...

  • Only for registered members Bellevue $119,600 - $161,700 (USD)

    Amazon is seeking a Senior Operations Engineer to support the Amazon Logistics Process Engineering team. · ...

  • Only for registered members Bellevue

    As an Operations Engineer at Amazon you will lead and manage the lifecycle of challenging cross-functional projects developing and delivering the next generation of sortation and distribution solutions. · Work independently with internal and external teams to successfully manage ...

  • Only for registered members Bellevue

    + Job summary · Identify, develop and refine supply chain inputs into topology and timing data science models. · + ResponsibilitiesDeploy and integrate network design optimization changes output from data science models. · Presentation project scope and objectives through technic ...

  • Only for registered members Redmond

    SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. · We are looking for engineers to lead cross-functional and multi-disciplinary teams that optimize and accurately track towards im ...

  • Only for registered members Redmond

    Maintain and monitor building systems within our critical environments. Ensure continuous operation of essential systems including fire/life safety, mechanical, electrical, and hot water systems. · ...

  • Only for registered members Redmond $95,000 - $130,000 (USD)

    SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. · Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mar ...

  • Only for registered members Redmond Full time $77,900 - $112,900 (USD)

    Maintaining and monitoring building systems within critical environments is a vital role for an Operating Engineer at JLL. · ...

  • Only for registered members Redmond

    Maintain and monitor building systems including HVAC, electrical, plumbing, refrigeration and air conditioning equipment. Analyze operations to identify and resolve problems/malfunctions while taking appropriate corrective actions. · ...

  • Only for registered members Redmond Full time $77,900 - $112,900 (USD)

    Maintain and monitor building systems including HVAC electrical plumbing refrigeration and air conditioning equipment. · High School diploma GED equivalent or technical training/degree3+ years experience in facilities operations maintenance engineering01 07 or 06A Electrical lice ...

  • Only for registered members Redmond Full time $77,900 - $112,900 (USD)

    As an Operating Engineer at JLL you will play a vital role in maintaining and monitoring building systems within our critical environments ensuring continuous operation of essential systems including fire/life safety mechanical electrical and hot water systems. · Maintain detaile ...

Jobs
>
Bellevue