Engineering Manager, Observability Platform - US, CA, Santa Clara

Only for registered members US, CA, Santa Clara, United States

2 days ago

Default job background
$224,000 - $356,500 (USD) per year
At NVIDIA, we pride ourselves on data-driven decision-making, and the data science platform team is at the heart of this initiative. NVIDIA runs some of the most demanding AI, data, and platform workloads on the planet and none of it works without a reliable, high-scale observabi ...
Job description

At NVIDIA, we pride ourselves on data-driven decision-making, and the data science platform team is at the heart of this initiative. NVIDIA runs some of the most demanding AI, data, and platform workloads on the planet and none of it works without a reliable, high-scale observability foundation. We're hiring an Engineering Manager to lead the team that builds and operates NVIDIA's global observability platform: the system that carries every metric, log, trace, profile, and event our engineers rely on to understand and debug their services. This isn't a traditional people-manager role. You'll stay close to the technology, guide architecture decisions, review designs and code, and help the team solve real distributed-systems challenges. You'll work with engineers to shape how services instrument themselves, how we ingest and store high-cardinality telemetry, and how observability fits cleanly into NVIDIA's broader platform ecosystem.

You'll partner directly with platform, infrastructure, and application teams to evolve how telemetry flows across metrics, logs, traces, profiling, and events. You'll coach and mentor engineers, build strong technical habits, and drive a roadmap that keeps the platform reliable and ready for NVIDIA's rapid growth. If you enjoy deep technical work, high-throughput pipelines, open-source observability stacks, and helping engineers do the best work of their careers, this role is built for you.

What you'll be doing:

  • Leading a team of engineers who design and build the core services, pipelines, and storage layers behind NVIDIA's observability platform.

  • Creating a clear technical direction for the team and supporting work that emphasizes simplicity, performance, and maintainability.

  • Defining the architecture for distributed ingestion services, time-series storage, log and trace pipelines, query paths, and multi-region data flows.

  • Partnering with platform, infrastructure, and application teams to define data models, instrumentation patterns, APIs, and integration standards.

  • Strengthening engineering practices through better tooling, automated tests, schema management, API versioning, documentation, and safe rollout processes.

  • Helping engineers solve distributed-systems issues including ingestion load, indexing pressure, compaction behavior, query fan-out, and replication patterns.

  • Driving predictable execution through clear priorities, collaborative planning, and strong alignment across teams.

  • Representing the observability platform across NVIDIA, gathering feedback, and evolving the system to support future AI workloads.

What we need to see:

  • Bachelors or Master's degree in Computer Science or a related technical field (or equivalent experience)

  • 8+ overall years building distributed systems, with a focus on observability and monitoring systems, and 3+ years managing or leading engineers.

  • Experience with modern observability stacks such as Prometheus, Thanos, Mimir, Loki, OpenSearch, Jaeger, Tempo, or OpenTelemetry or equivalent experience. 

  • Strong foundations in distributed systems concepts including replication, sharding, durability, consensus, and performance tuning.

  • Hands-on experience designing or scaling ingestion pipelines, time-series engines, trace backends, or log indexing systems, especially in high-cardinality environments.

  • Ability to read and review Go or Python code and support engineers through technical decision-making.

  • Clear architectural thinking with a focus on stable APIs, predictable performance, and long-term evolution.

  • Experience mentoring engineers, improving technical judgment, and contributing to a healthy and inclusive engineering culture.

  • Strong communication skills and the ability to explain complex challenges with clarity.

Ways to stand out from the crowd:

  • Experience building or contributing to an observability or telemetry platform used at significant scale.

  • Contributions to open-source projects such as OpenTelemetry, Prometheus, Loki, Thanos, Tempo, Jaeger, ClickHouse, Mimir, or Elasticsearch.

  • Experience with high-throughput systems like Kafka, Flink, Spark, or large-scale data collectors.

  • Deep knowledge of cardinality management, query performance, storage design, or retention optimization.

  • Experience designing multi-region architectures with a focus on consistency, availability, and data locality.

NVIDIA leads the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for exceptional people like you to help us accelerate the next wave of artificial intelligence.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 224,000 USD - 356,500 USD

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until January 13, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.


Similar jobs

  • At NVIDIA, we pride ourselves on data-driven decision-making, and the data science platform team is at the heart of this initiative. NVIDIA runs some of the most demanding AI, data, and platform workloads on the planet and none of it works without a reliable, high-scale observabi ...

    Santa Clara $224,000 - $356,500 (USD)

    2 days ago

  • At NVIDIA, we pride ourselves on data-driven decision-making, and the data science platform team is at the heart of this initiative. NVIDIA runs some of the most demanding AI, data, and platform workloads on the planet and none of it works without a reliable, high-scale observabi ...

    Santa Clara, CA

    2 days ago

  • Work in company

    Engineering Manager, Observability Platform

    Only for registered members

    We're hiring an Engineering Manager to lead the team that builds and operates NVIDIA's global observability platform: · the system that carries every metric, log, trace, profile, and event our engineers rely on to understand and debug their services.NVIDIA runs some of the most ...

    Santa Clara $224,000 - $356,500 (USD)

    1 month ago

  • Work in company

    Engineering Manager, Observability Platform

    Only for registered members

    We re hiring an Engineering Manager to lead the team that builds and operates NVIDIA s global observability platform: the system that carries every metric log trace profile and event our engineers rely on to understand and debug their services. · ...

    Santa Clara $224,000 - $356,500 (USD) Full time

    1 month ago

  • Work in company

    Engineering Manager, Observability Platform

    Only for registered members

    We are hiring an Engineering Manager to lead the team that builds and operates NVIDIA's global observability platform.We pride ourselves on data-driven decision-making, and the data science platform team is at the heart of this initiative. · ...

    Santa Clara, CA

    1 month ago

  • Work in company

    Senior Software Architect, Observability Platform

    Only for registered members

    +NVIDIA's Infrastructure organization is seeking a Senior Software Architect for our Observability Platform to architect and implement distributed observability systems for data centers enabling EDA workflows . · ++Collaborate with HW and SW engineering teams to deliver observab ...

    Santa Clara, CA

    1 month ago

  • Work in company

    Sr. Manager, Observability Platform Engineering

    Only for registered members

    Databricks is the data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast are relying on the Databricks Data Intelligence Platform to unify and democratize data. · The Manager of the Observability Platform team will lead engineers responsible ...

    Mountain View $222,000 - $300,000 (USD) Full time

    1 month ago

  • Work in company

    Sr. Manager, Observability Platform Engineering

    Only for registered members

    Job summaryAs the Manager of the Observability Platform team, you will lead the engineers responsible for building and scaling the next generation of Databricks' global observability systems. · ...

    Mountain View, California

    1 week ago

  • Work in company

    Sr. Manager, Observability Platform Engineering

    Only for registered members

    We are passionate about enabling data teams to solve the world's toughest problems — from making the next mode of transportation a reality to accelerating the development of medical breakthroughs. · Lead the design and development of the next-generation observability platforms th ...

    Mountain View, CA

    1 month ago

  • We are the Catalyst Center Platforms and Capabilities team responsible for delivering scalable secure and high-productivity cloud-native infrastructure and PaaS capabilities that power thousands of enterprise customers. · 8+ years of full-stack development experience with focus o ...

    Milpitas, CA

    1 month ago

  • NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation that's fueled by great technology and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. ...

    Santa Clara $248,000 - $391,000 (USD)

    1 day ago

  • Work in company

    Senior Software Engineer, Observability

    Only for registered members

    NVIDIA's Observability team is seeking a Senior/Staff Engineer to compose and build the next-generation multi-region observability platform. · This platform powers NVIDIA's rapidly expanding AI Data and Observability ecosystem operating at an immense scale. · Architecting end-to- ...

    Santa Clara $184,000 - $356,500 (USD)

    1 month ago

  • Work in company

    Senior Engineer

    Only for registered members

    +NVIDIA is a pioneer in accelerated computing.We are looking for a Senior AI & HPC Observability Engineer to design and build the next-generation observability platform for large-scale AI workloads, · ++Design and implement full-stack observability systems covering metrics, logs, ...

    Santa Clara $224,000 - $356,500 (USD)

    1 month ago

  • Work in company

    Senior Site Reliability Engineer, Observability

    Only for registered members

    NVIDIA is hiring Senior Site Reliability Engineers who will help design and run the company's global telemetry backbone. · ...

    Santa Clara $184,000 - $356,500 (USD)

    1 month ago

  • Work in company

    Senior Site Reliability Engineer, Observability

    Only for registered members

    We're hiring Site Reliability Engineers who want to work on the systems that power everything from large-scale data pipelines to model training clusters to real-time decision making. · Architecting and operating large-scale observability systems that span global regions and suppo ...

    Santa Clara $184,000 - $356,500 (USD)

    1 week ago

  • Work in company

    Senior Product Manager

    Only for registered members

    This product manager will lead the development of foundational tools dedicated to ensuring the resiliency and observability of large-scale accelerated computing platforms. · We have some of the most forward-thinking and hardworking people in the world working for us and, due to o ...

    Santa Clara $208,000 - $327,750 (USD)

    1 month ago

  • Work in company

    Senior Site Reliability Engineer, Observability

    Only for registered members

    We're hiring Site Reliability Engineers who want to work on the systems that power everything from large-scale data pipelines to model training clusters to real-time decision making. · Architecting and operating large-scale observability systems that span global regions and suppo ...

    Santa Clara $184,000 - $356,500 (USD) Full time

    1 month ago

  • Work in company

    Senior Product Manager

    Only for registered members

    NVIDIA has become the platform upon which every new AI-powered application is built. From healthcare research applications to autonomous vehicles, or voice-recognition systems, there is a need to simplify and deliver predictability for AI applications and workflows ... and NVIDIA ...

    Santa Clara $208,000 - $327,750 (USD)

    2 days ago

  • Work in company

    Senior Product Manager

    Only for registered members

    NVIDIA has become the platform upon which every new AI-powered application is built. From healthcare research applications to autonomous vehicles or voice-recognition systems there is a need to simplify and deliver predictability for AI applications and workflows and NVIDIA is ri ...

    Santa Clara $208,000 - $327,750 (USD) Full time

    1 month ago

  • Work in company

    Director, Software Engineering

    Only for registered members

    This Position is open to these office locations: Santa Clara CA; Kirkland WA; San Diego CA; Orlando FL; Chicago ILThe Director of AI Data Center Control Plane leads the engineering implementation of AI-first solutions across Big Data Observability and other data center control pl ...

    Santa Clara

    2 weeks ago