- Active member of a multi-disciplinary team to develop solutions for large scale training systems.
- Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues.
- Identify potential performance issues across the stack: comms lib, RDMA transport, host networking, scheduling and network fabric. Develop and deploy innovative solutions to address the performance issues.
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
- BS/MS/PhD in relevant fields (EE, CS), with 4+ years work experience.
- Experience with using communication libraries, such as MPI, NCCL, and UCX.
- Experience with developing, evaluating and debugging host networking protocols such as RDMA.
- Experience with triaging performance issues in complex scale-out distributed applications.
- Understanding of AI training workloads and demands they exert on networks.
- Understanding of RDMA congestion control mechanisms on IB and RoCE Networks.
- Understanding of the latest artificial intelligence (AI) technologies.
- Experience with machine learning frameworks such as PyTorch and TensorFlow.
- Experience in developing systems software in languages like C++.
-
Performance Engineer
3 weeks ago
Broadcom Corporation Palo Alto, United StatesPlease Note: · 1. If you are a first time user, please create your candidatelogin account before you apply for a job. (Click Sign In > Create Account) · 2. If you already have a Candidate Account, please Sign-In before you apply. · Job Description: · Why will you enjoy this new ...
-
Performance Engineer
1 week ago
Broadcom Corporation Palo Alto, United StatesPlease Note: · 1. If you are a first time user, please create your candidatelogin account before you apply for a job. (Click Sign In > Create Account) · 2. If you already have a Candidate Account, please Sign-In before you apply. · Job Description: Why will you enjoy this new ...
-
Performance Engineer
1 week ago
Broadcom Corporation Palo Alto, United StatesPlease Note: · 1. If you are a first time user, please create your candidatelogin account before you apply for a job. (Click Sign In > Create Account) · 2. If you already have a Candidate Account, please Sign-In before you apply. · Job Description: · Why will you enjoy this new ...
-
Amazon Palo Alto, United StatesSr. Performance Engineer, Redshift Performance Engineering · Job ID: | Amazon Development Center U.S., Inc. · The Amazon Redshift Performance Engineering team is looking for an experienced performance engineer who is passionate about database and distributed systems performance. ...
-
Amazon Palo Alto, CA, United StatesSr. Performance Engineer, Redshift Performance Engineering · Job ID: | Amazon Development Center U.S., Inc. · The Amazon Redshift Performance Engineering team is looking for an experienced performance engineer who is passionate about database and distributed systems performance ...
-
Amazon Palo Alto, United StatesThe Amazon Redshift Performance Engineering team is looking for an experienced performance engineer who is passionate about database and distributed systems performance. Join our team and help us make the fastest data warehouse even faster · As part of the Redshift Performance E ...
-
Performance and Capacity Engineer
1 week ago
META Menlo Park, United StatesSummary: · Meta Platforms, Inc. (Meta), formerly known as Facebook Inc., builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps and services like Messenger, Instagram, and Whats ...
-
NovaTech Solutions West Menlo Park, United StatesSenior Verification Engineer · We are currently seeking a passionate Senior Verification Engineer to join our team in West Menlo Park. Our goal is to co-create a new category of super servers and make a significant impact in the AI and datacenter industry. · Our innovative soluti ...
-
Performance Engineer, Operations
2 weeks ago
SB Energy Redwood City, United StatesTitle: Performance Engineer, Operations · Basic Function · The Performance Engineer, Operations will be the key owner of operational data, analytics and KPIs in the Operations & Maintenance (O&M) team. · The successful candidate will be responsible for implementing, documenting, ...
-
Performance Engineer, Operations
2 weeks ago
SB Energy Redwood City, United StatesTitle: Performance Engineer, Operations · Scroll down for a complete overview of what this job will require Are you the right candidate for this opportunity · Basic Function · The Performance Engineer, Operations will be the key owner of operational data, analytics and KPIs in ...
-
Performance and Capacity Engineer, Fulfillment
2 weeks ago
META Menlo Park, United StatesMeta is seeking a Performance and Capacity Engineer to join the Capacity Team to focus on site-wide capacity planning and fulfillment, system building and tooling development. This person would be required to work cross-functionally with a number of teams to ensure optimal operat ...
-
Performance Engineer, Operations
3 weeks ago
SB Energy Redwood City, United StatesTitle: Performance Engineer, Operations · Basic Function · The Performance Engineer, Operations will be the key owner of operational data, analytics and KPIs in the Operations & Maintenance (O&M) team. · The successful candidate will be responsible for implementing, documentin ...
-
Performance Engineer, Operations
3 weeks ago
SB Energy Redwood City, United StatesTitle: Performance Engineer, Operations · Basic Function · The Performance Engineer, Operations will be the key owner of operational data, analytics and KPIs in the Operations & Maintenance (O&M) team. · The successful candidate will be responsible for implementing, documenting, ...
-
Meta Inc Menlo Park, United StatesMeta is seeking a Performance & Capacity Engineer to join the Capacity Team to focus on site-wide performance and capacity optimization at the intersection of all Meta products and services, and all physical infrastructure (Servers, Data Centers, Network). This role will focus on ...
-
Software Development Engineer, Spark Performance
2 weeks ago
Amazon Inc Palo Alto, United StatesAthena and EMR allow AWS customers to run large scale analytics, leveraging open source engines like Trino and Spark. We run millions of customer clusters, enabling processing on vast datasets. In the last 3 years we have improved our engines by a fa Development Engineer, Perform ...
-
Performance Engineer
2 weeks ago
Diverse Lynx Sunnyvale, United StatesWorked on JMeter or equivalent tool (this requires java knowledge) and not Loadrunner · Strong Java coding experience · Performance Engineering knowledge · Worked with data in past for any big data technologies · Java Coding Experience - · Candidate should be able to read la ...
-
Performance Engineer
3 days ago
Diverse Lynx Sunnyvale, United StatesWorked on JMeter or equivalent tool (this requires java knowledge) · and not Loadrunner · Strong Java coding experience · Performance Engineering knowledge · Worked with data in past for any big data technologies · Java Coding Experience · - · Candidate should be able to read ...
-
Software Engineer, Performance
3 weeks ago
Nuro Mountain View, CA, United StatesWho We Are Nuro exists to better everyday life through robotics. Founded in 2016, Nuro is a leading autonomous technology company with vehicles on road today in California and Texas. The company's core technology is the Nuro Driver, an integrated autonomous driving system consist ...
-
SRE/Performance Engineering
3 weeks ago
TikTok Mountain View, CA, United StatesTikTok is the leading destination for short-form mobile video. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Mumbai, Singapore, Jakarta, Seoul and Tokyo. Our Trust and Safety engineering team is fast growing and responsible for building ...
-
Performance Engineer
3 weeks ago
Info Way Solutions Fremont, United StatesHi, · This is Tharani from Info way Solutions; We have an opening for a Perfromance Engineer at San Diego , CAlocation and the detailed Job description is given below · Job title : Performance Engineer · Location : San Diego, CA (Hybrid Mode) · Direct Client · Key Skills: K6 ...
AI/HPC Systems Performance Engineer - Menlo Park, United States - Meta Inc
Description
Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across stack: network fabric and host networking, comms lib and scheduling infrastructure.
AI/HPC Systems Performance Engineer Responsibilities
Learn about how to prepare for your interview with our interview guide, tips, and interactive experiences.
Visit interview prep