Design and operate large-scale GPU clusters for training and inference
Build and maintain infrastructure using Terraform across cloud and hybrid environments
Develop modular, scalable IaC patterns for compute, networking, and storage provisioning
Improve deployment reproducibility, environment consistency, and operational safety
Optimize networking and storage systems for high-throughput AI workloads
Automate fault detection and recovery across distributed clusters
Debug complex cross-layer issues spanning hardware, drivers, networking, storage, OS, and cloud
Improve observability, monitoring, and reliability of core platform systems
Strong systems engineering fundamentals
Deep, hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale deployments
Experience operating production GPU infrastructure or high-performance distributed systems
Strong understanding of networking and storage systems
Experience with major cloud platforms (GCP, AWS, Azure, OCI, etc.)
Track record of owning production-critical infrastructure end-to-end
Annual salary range between $200K - $550K depending on experience
Equity is a significant part of total compensation, in addition to salary
401(k) plan with 6% salary matching
Generous health, dental and vision insurance for you and your dependents
Unlimited paid time off
Visa sponsorship and relocation stipend to bring you to SF, if possible
A small, fast-paced, highly focused team
Integrity. Words and actions should be aligned
Hands-on. At Magic, everyone is building
Teamwork. We move as one team, not N individuals
Focus. Safely deploy AGI. Everything else is noise
Quality. Magic should feel like magic
-
Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...
San Francisco $200,000 - $550,000 (USD)6 days ago
-
Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alo ...
San Francisco21 hours ago
-
The Data Visualization team at OpenAI is responsible for building and maintaining all the visualization tools used for analyzing various software and hardware aspects of our custom-built hyperscale supercomputers. · This includes visualizing hardware (nodes, network, racks, etc.) ...
San Francisco, CA2 weeks ago
-
The Data Visualization team at OpenAI is responsible for building and maintaining all the visualization tools used for analyzing various software and hardware aspects of our custom-built hyperscale supercomputers. · ...
San Francisco, CA1 month ago
-
About the Team · The Workload Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on ...
San Francisco1 week ago
-
The Workload Networking team is responsible for the collective communication stack used in our largest training jobs. · Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest cu ...
San Francisco $380,000 - $555,000 (USD)1 month ago
-
We are hiring for a fast-growing AI marketing analytics company that helps enterprises understand what truly drives business outcomes.Using GPU-native analytics and causal AI, · the platform enables large organizations to measure marketing impact accurately · and make confident d ...
San Francisco3 weeks ago
-
We are hiring for a fast-growing AI marketing analytics company that helps enterprises understand what truly drives business outcomes. · ...
San Francisco, CA3 weeks ago
-
We are hiring for a fast-growing AI marketing analytics company that helps enterprises understand what truly drives business outcomes. · Work on GPU-native analytics and causal AI used by Fortune 100 companies. · ...
San Francisco3 weeks ago
-
Quadric has created an innovative general purpose neural processing unit (GPNPU) architecture. Quadric's co-optimized software and hardware is targeted to run neural network (NN) inference workloads in a wide variety of edge and endpoint devices, ranging from battery operated sma ...
San Francisco $160,000 - $240,000 (USD) per year1 week ago
-
We're seeking a Networking Operating System Firmware Engineer to help bootstrap and scale the switching layer of our AI supercomputers. · In this role,you'll build and maintain custom SONiC NOS images from scratch, · working across the Linux kernel ,switch ASIC SAI/SDKs ,platform ...
San Francisco, CA1 month ago
-
· About the Company · Our client is a publicly traded company at the forefront of the AI revolution, offering an AI-centric cloud platform that's reshaping the landscape of artificial intelligence. The company provides cutting-edge infrastructure, including large-scale GPU clust ...
San Francisco1 week ago
-
About the Company · Our client is a publicly traded company at the forefront of the AI revolution, offering an AI-centric cloud platform that's reshaping the landscape of artificial intelligence. The company provides cutting-edge infrastructure, including large-scale GPU clusters ...
San Francisco $225,000 - $275,000 (USD) Full time1 week ago
-
· About the Company · Our client is at the forefront of the AI revolution, providing cutting-edge infrastructure that's reshaping the landscape of artificial intelligence. They offer an AI-centric cloud platform that empowers Fortune 500 companies, top-tier innovative startups, ...
San Francisco $135,000 - $210,000 (USD) per year1 week ago
-
About the Company · Our client is a publicly traded company at the forefront of the AI revolution, offering an AI-centric cloud platform that's reshaping the landscape of artificial intelligence. The company provides cutting-edge infrastructure, including large-scale GPU clusters ...
San Francisco1 week ago
-
About the Company · Our client is at the forefront of the AI revolution, providing cutting-edge infrastructure that's reshaping the landscape of artificial intelligence. They offer an AI-centric cloud platform that empowers Fortune 500 companies, top-tier innovative startups, and ...
San Francisco $225,000 - $275,000 (USD) Full time1 week ago
-
We are seeking a Cloud Solutions Architect (Pre-Sales) to join our client's team. · ...
San Francisco $180,000 - $300,000 (USD) Full time1 month ago
-
About the Team · OpenAI's Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research par ...
San Francisco1 week ago
-
We are the AGI Autonomy organization, and we are looking for a driven and talented Member of Technical Staff to join us to build state-of-the-art agents. · Our lab is a small, talent-dense team with the resources and scale of Amazon. Each team in the lab has the autonomy to move ...
San Francisco, CA1 month ago
-
We are looking for a driven and talented Member of Technical Staff to join our team to build state-of-the-art agents. · Design and implement a modern, fast, and ergonomic development environment for AI researchers. · Build and manage CI/CD pipelines that support large-scale AI re ...
San Francisco1 month ago
-
The Compute team works on the design of our AI supercomputers doing everything from workload modeling to accelerator co-design We're leaning into our partnerships to make data center co-design an integral part of this process and are looking for engineers to design AI supercomput ...
San Francisco $270,000 - $340,000 (USD)1 month ago
Member of Technical Staff, Supercomputing Platform - San Francisco - Magic Inc
Description
Magic's Mission
Magic's mission is to build safe AGI that accelerates humanity's progress on the world's most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alone. Our approach combines frontier-scale pre-training, domain-specific RL, ultra-long context, and inference-time compute to achieve this goal.
About The Role
As an engineer on the Supercomputing Platform & Infrastructure team, you will design, build, and operate the large-scale GPU infrastructure that powers Magic's model training and inference workloads.
A core part of this role is building and maintaining our infrastructure using Terraform-driven infrastructure-as-code practices, ensuring reproducibility, reliability, and operational clarity across clusters spanning thousands of GPUs.
Magic's long-context models create sustained pressure on compute, networking, and storage systems. Long-running distributed jobs, high-throughput data movement, and strict availability requirements demand infrastructure that is automated, observable, and resilient by design. You will own the systems and IaC foundations that make this possible.
This role can evolve into broader ownership of supercomputing platform architecture, shaping how Magic scales GPU clusters and infrastructure reliability as model workloads grow.
What You'll Work On
What We're Looking For
Compensation, Benefits, And Perks (US):
Magic strives to be the place where high-potential individuals can do their best work. We value quick learning and grit just as much as skill and experience.
Our Culture
-
Member of Technical Staff, Supercomputing Platform
Only for registered members San Francisco
-
Member of Technical Staff, Supercomputing Platform
Only for registered members San Francisco
-
Software Engineer, Data Visualization
Only for registered members San Francisco, CA
-
Software Engineer, Data Visualization
Only for registered members San Francisco, CA
-
Software Engineer, Collective Communication
Only for registered members San Francisco
-
Software Engineer, Collective Communication
Only for registered members San Francisco
-
Senior Site Reliability Engineer
Only for registered members San Francisco
-
Senior Site Reliability Engineer
Only for registered members San Francisco, CA
-
Senior SRE
Only for registered members San Francisco
-
Deep Learning Compiler Engineer
Only for registered members San Francisco
-
Networking Operating System Firmware Engineer
Only for registered members San Francisco, CA
-
Senior AI/ML Specialist Solutions Architect
Only for registered members San Francisco
-
Senior AI/ML Specialist Solutions Architect
Full time Only for registered members San Francisco
-
Cloud Solutions Architect
Only for registered members San Francisco
-
Senior AI/ML Specialist Solutions Architect
Only for registered members San Francisco
-
Cloud Solutions Architect
Full time Only for registered members San Francisco
-
Cloud Solutions Architect
Full time Only for registered members San Francisco
-
Networking Operating System Firmware Engineer
Only for registered members San Francisco
-
MTS, Developer Experience
Only for registered members San Francisco, CA
-
MTS, Developer Experience
Only for registered members San Francisco
-
Technical Program Manager, Hardware Systems
Only for registered members San Francisco