Jobs
>
Menlo Park

    AI/HPC Systems Performance Engineer - Menlo Park, United States - Meta Inc

    Meta Inc
    Meta Inc Menlo Park, United States

    3 weeks ago

    Default job background
    Description

    Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across stack: network fabric and host networking, comms lib and scheduling infrastructure.

    AI/HPC Systems Performance Engineer Responsibilities

    • Active member of a multi-disciplinary team to develop solutions for large scale training systems.
    • Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues.
    • Identify potential performance issues across the stack: comms lib, RDMA transport, host networking, scheduling and network fabric. Develop and deploy innovative solutions to address the performance issues.
    Minimum Qualifications
    • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
    • BS/MS/PhD in relevant fields (EE, CS), with 4+ years work experience.
    • Experience with using communication libraries, such as MPI, NCCL, and UCX.
    • Experience with developing, evaluating and debugging host networking protocols such as RDMA.
    • Experience with triaging performance issues in complex scale-out distributed applications.
    Preferred Qualifications
    • Understanding of AI training workloads and demands they exert on networks.
    • Understanding of RDMA congestion control mechanisms on IB and RoCE Networks.
    • Understanding of the latest artificial intelligence (AI) technologies.
    • Experience with machine learning frameworks such as PyTorch and TensorFlow.
    • Experience in developing systems software in languages like C++.
    Start preparing
    Learn about how to prepare for your interview with our interview guide, tips, and interactive experiences.
    Visit interview prep

  • Broadcom Corporation

    Performance Engineer

    3 weeks ago


    Broadcom Corporation Palo Alto, United States

    Please Note: · 1. If you are a first time user, please create your candidatelogin account before you apply for a job. (Click Sign In > Create Account) · 2. If you already have a Candidate Account, please Sign-In before you apply. · Job Description: · Why will you enjoy this new ...


  • Broadcom Corporation Palo Alto, United States

    Please Note: · 1. If you are a first time user, please create your candidatelogin account before you apply for a job. (Click Sign In > Create Account) · 2. If you already have a Candidate Account, please Sign-In before you apply. · Job Description: Why will you enjoy this new ...


  • Broadcom Corporation Palo Alto, United States

    Please Note: · 1. If you are a first time user, please create your candidatelogin account before you apply for a job. (Click Sign In > Create Account) · 2. If you already have a Candidate Account, please Sign-In before you apply. · Job Description: · Why will you enjoy this new ...


  • Amazon Palo Alto, United States

    Sr. Performance Engineer, Redshift Performance Engineering · Job ID: | Amazon Development Center U.S., Inc. · The Amazon Redshift Performance Engineering team is looking for an experienced performance engineer who is passionate about database and distributed systems performance. ...


  • Amazon Palo Alto, CA, United States

    Sr. Performance Engineer, Redshift Performance Engineering · Job ID: | Amazon Development Center U.S., Inc. · The Amazon Redshift Performance Engineering team is looking for an experienced performance engineer who is passionate about database and distributed systems performance ...


  • Amazon Palo Alto, United States

    The Amazon Redshift Performance Engineering team is looking for an experienced performance engineer who is passionate about database and distributed systems performance. Join our team and help us make the fastest data warehouse even faster · As part of the Redshift Performance E ...


  • META Menlo Park, United States

    Summary: · Meta Platforms, Inc. (Meta), formerly known as Facebook Inc., builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps and services like Messenger, Instagram, and Whats ...


  • NovaTech Solutions West Menlo Park, United States

    Senior Verification Engineer · We are currently seeking a passionate Senior Verification Engineer to join our team in West Menlo Park. Our goal is to co-create a new category of super servers and make a significant impact in the AI and datacenter industry. · Our innovative soluti ...


  • SB Energy Redwood City, United States

    Title: Performance Engineer, Operations · Basic Function · The Performance Engineer, Operations will be the key owner of operational data, analytics and KPIs in the Operations & Maintenance (O&M) team. · The successful candidate will be responsible for implementing, documenting, ...


  • SB Energy Redwood City, United States

    Title: Performance Engineer, Operations · Scroll down for a complete overview of what this job will require Are you the right candidate for this opportunity · Basic Function · The Performance Engineer, Operations will be the key owner of operational data, analytics and KPIs in ...


  • META Menlo Park, United States

    Meta is seeking a Performance and Capacity Engineer to join the Capacity Team to focus on site-wide capacity planning and fulfillment, system building and tooling development. This person would be required to work cross-functionally with a number of teams to ensure optimal operat ...


  • SB Energy Redwood City, United States

    Title: Performance Engineer, Operations · Basic Function · The Performance Engineer, Operations will be the key owner of operational data, analytics and KPIs in the Operations & Maintenance (O&M) team. · The successful candidate will be responsible for implementing, documentin ...


  • SB Energy Redwood City, United States

    Title: Performance Engineer, Operations · Basic Function · The Performance Engineer, Operations will be the key owner of operational data, analytics and KPIs in the Operations & Maintenance (O&M) team. · The successful candidate will be responsible for implementing, documenting, ...


  • Meta Inc Menlo Park, United States

    Meta is seeking a Performance & Capacity Engineer to join the Capacity Team to focus on site-wide performance and capacity optimization at the intersection of all Meta products and services, and all physical infrastructure (Servers, Data Centers, Network). This role will focus on ...


  • Amazon Inc Palo Alto, United States

    Athena and EMR allow AWS customers to run large scale analytics, leveraging open source engines like Trino and Spark. We run millions of customer clusters, enabling processing on vast datasets. In the last 3 years we have improved our engines by a fa Development Engineer, Perform ...

  • Diverse Lynx

    Performance Engineer

    2 weeks ago


    Diverse Lynx Sunnyvale, United States

    Worked on JMeter or equivalent tool (this requires java knowledge) and not Loadrunner · Strong Java coding experience · Performance Engineering knowledge · Worked with data in past for any big data technologies · Java Coding Experience - · Candidate should be able to read la ...


  • Diverse Lynx Sunnyvale, United States

    Worked on JMeter or equivalent tool (this requires java knowledge) · and not Loadrunner · Strong Java coding experience · Performance Engineering knowledge · Worked with data in past for any big data technologies · Java Coding Experience · - · Candidate should be able to read ...


  • Nuro Mountain View, CA, United States

    Who We Are Nuro exists to better everyday life through robotics. Founded in 2016, Nuro is a leading autonomous technology company with vehicles on road today in California and Texas. The company's core technology is the Nuro Driver, an integrated autonomous driving system consist ...


  • TikTok Mountain View, CA, United States

    TikTok is the leading destination for short-form mobile video. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Mumbai, Singapore, Jakarta, Seoul and Tokyo. Our Trust and Safety engineering team is fast growing and responsible for building ...

  • Info Way Solutions

    Performance Engineer

    3 weeks ago


    Info Way Solutions Fremont, United States

    Hi, · This is Tharani from Info way Solutions; We have an opening for a Perfromance Engineer at San Diego , CAlocation and the detailed Job description is given below · Job title : Performance Engineer · Location : San Diego, CA (Hybrid Mode) · Direct Client · Key Skills: K6 ...