Jobs
>
Santa Clara

    Principal Software Architect - Santa Clara, United States - NVIDIA

    Default job background
    Description

    We are now looking for a Principal Software Architect for AI and HPC.

    At NVIDIA, we are advancing the frontiers of AI capabilities. We seek an expert in high-performance computing and AI to design and develop software resiliency features for training AI models on the world's most powerful and largest supercomputers.

    In this role, you will outline mission requirements for ultra large-scale AI supercomputers, thoroughly investigate and evaluate RAS feature designs, establish software requirements and evaluation metrics, and oversee the complete implementation of RAS features in software. As a leader in HPC and AI software development, you will interact with multiple teams across the organization. Your responsibilities include conducting regular reviews and check-ins with execution teams, ensuring the timely delivery of essential RAS software features such as checkpoint-recovery logic, error detection and attribution, error containment, SDC detection, and other related RAS elements. Leading cross-organizational efforts among various stakeholders and teams, you will coordinate priorities with senior leadership, provide timely updates, and ensure adequate resourcing for the projects.

    What You'll Be Doing:

    • Collaborate with both internal and external customers and partners to define innovative Reliability, Availability, and Serviceability (RAS) requirements and objectives for present and future AI supercomputing products.
    • Oversee and guide the development of RAS features across the entire AI stack, encompassing aspects from job-level scheduling and AI application frameworks (such as PyTorch), down to driver-level and hardware health monitoring on GPUs.
    • Develop and maintain comprehensive software roadmaps, ensuring alignment with diverse engineering teams and synchronizing with engineering and product leadership for strategic coherence.
    • Drive successful implementation and execution of RAS features in software, with demonstrable improvements in end-to-end metrics such as availability during large-scale training runs.

    What We Need to See:

    • A Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a reputed university, or equivalent professional experience.
    • 15+ years of industry experience in systems architecture or related fields, demonstrating a deep understanding of system complexities.
    • Proven ability to work and communicate effectively in a collaborative environment, bridging multiple engineering disciplines.
    • At least 5 years of hands-on experience in software development, preferably in high-complexity projects involving HPC or AI.

    Ways to Stand Out From the Crowd:

    • Demonstrated experience with large-scale AI supercomputing applications, particularly in training and inference stages.
    • In-depth knowledge of the requirements for large-scale AI workload training and inference.
    • A strong passion for and experience in developing system architectures tailored for AI applications, encompassing CPU, GPU, memory, storage, and networking.
    • Hands-on involvement in the entire lifecycle – from design to deployment – of large-scale High-Performance Computing (HPC) systems.
    • Practical experience in adopting and implementing HPC software development practices in large-scale system environments.

    As NVIDIA makes inroads into the Datacenter business, our team plays a central role in getting the most out of our exponentially growing datacenter deployments as well as establishing a data-driven approach to hardware design and system software development. We collaborate with a broad cross section of teams at Nvidia ranging from DL research teams to CUDA Kernel and DL Framework development teams, to Silicon Architecture Teams. NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative and autonomous, we want to hear from you

    The base salary range is 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

    You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

    NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.


  • d-Matrix Santa Clara, United States

    Location · - Santa Clara, Ca · Type · - Full time · Department · - R&D - CTO & Architecture · Compensation · - IC6$180K - $300K - Offers Equity - Offers Bonus · - The pay range below is for all roles at this level across all US locations and functions. Individual pay rates depend ...

  • Intel

    Software Architect

    4 weeks ago


    Intel Santa Clara, United States

    **Job Description** · At Intel, we are innovating the future. We create world-changing technology that enables global progress and enriches the lives of every person on earth. Intel is at the heart of the technologies fueling several market disruptions. The Architecture, Strategy ...

  • Adobe

    Software Architect

    2 weeks ago


    Adobe San Jose, United States

    Our Company · Changing the world through digital experiences is what Adobe's all about. We give everyone—from emerging artists to global brands—everything they need to design and deliver exceptional digital experiences We're passionate about empowering people to create beautiful ...

  • TWO95 International

    Software Architect

    2 weeks ago


    TWO95 International Sunnyvale, United States

    Job Title : Software Architect · Location : San Jose, CA · Type : 6 to 9 months · Rate : $Open (Best Possible) · Job Responsibilities – · Work with business and engineering teams to identify and design security software architecture for implementing new solutions, products and mo ...

  • Tata Consultancy Services

    Software Architect

    3 weeks ago


    Tata Consultancy Services San Jose, United States

    Developmentof the application using Java, Microservices & API integration. · •Sound knowledge of Apache Kafka, ElasticSearch, MongoDB, Cassandra, Oracle PLSQL, POSTMAN, XML/JSON · •Working knowledge of Agile, Scrum, JIRA. · •Troubleshoot system issues and make changesas needed ...


  • F. Hoffmann-La Roche AG Santa Clara, United States

    Roche fosters diversity, equity and inclusion, representing the communities we serve. When dealing with healthcare on a global scale, diversity is an essential ingredient to success. We believe that inclusion is key to understanding people's varied healthcare needs. Together, we ...


  • SiFive Santa Clara, United States

    About SiFive · As the pioneers who introduced RISC-V to the world, SiFive is transforming the future of compute by bringing the limitless potential of RISC-V to the highest performance and most data-intensive applications in the world. SiFive's unrivaled compute platforms are co ...


  • SiFive Santa Clara, United States

    About SiFive · As the pioneers who introduced RISC-V to the world, SiFive is transforming the future of compute by bringing the limitless potential of RISC-V to the highest performance and most data-intensive applications in the world. SiFive's unrivaled compute platforms are co ...


  • NVIDIA Santa Clara, United States

    We are now looking for a Principal Software Architect for AI and HPC. · At NVIDIA, we are advancing the frontiers of AI capabilities. We seek an expert in high-performance computing and AI to design and develop software resiliency features for training AI models on the worlds mo ...


  • NVIDIA Santa Clara, United States

    We are now looking for a Principal Software Architect for AI and HPC. · At NVIDIA, we are advancing the frontiers of AI capabilities. We seek an expert in high-performance computing and AI to design and develop software resiliency features for training AI models on the world's mo ...


  • NVIDIA Santa Clara, United States

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new unive ...


  • Roche Santa Clara, United States Full time

    Roche fosters diversity, equity and inclusion, representing the communities we serve. When dealing with healthcare on a global scale, diversity is an essential ingredient to success. We believe that inclusion is key to understanding people's varied healthcare needs. Together, we ...


  • NVIDIA Santa Clara, United States

    NVIDIA is looking for a highly motivated Senior Software Engineer to join its dynamic and fast-paced Software Safety Tools and Infrastructure team. Here, you will be working at the forefront of technical innovation with some of the best in the industry by crafting, developing, an ...


  • d-Matrix Santa Clara, United States

    d-Matrix has fundamentally changed the physics of memory-compute integration with our digital in-memory compute (DIMC) engine. The "holy grail" of AI compute has been to break through the memory wall to minimize data movements. We've achieved this with a first-of-its-kind DIMC en ...


  • Cadence Design Systems San Jose, United States

    At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology. · The Cadence Compute Systems Group (CSG) develops and licenses IP for system designs. This includes CPUs and high-performance DSPs, DDR and IO controllers, hardware acc ...


  • McAfee San Jose, United States Full time

    · Role Overview: · As a highly skilled and experienced Senior Software Architect with experience in software development, McAfee is looking for you to design and implement complex software systems, providing technical leadership to development teams, and ensuring the overall int ...


  • Walmart Global Tech Sunnyvale, United States

    Job Description: · Under this position, you will lead a large developer community spanned across several orgs and business area, making strides in Walmart eCommerce across their respective markets and businesses. You will be responsible for the central platform behind supporting ...


  • Walmart Global Tech Sunnyvale, United States

    Job Description: · Under this position, you will lead a large developer community spanned across several orgs and business area, making strides in Walmart eCommerce across their respective markets and businesses. You will be responsible for the central platform behind supporting ...


  • Walmart Global Tech Sunnyvale, United States

    Job Description: · Maximise your chances of a successful application to this job by ensuring your CV and skills are a good match. · Under this position, you will lead a large developer community spanned across several orgs and business area, making strides in Walmart eCommerce ...


  • NVIDIA Santa Clara, United States

    Senior System Software Architect, Servers page is loaded · Senior System Software Architect, Servers · Apply · locations · US, CA, Santa Clara · US, TX, Austin · US, NC, Durham · US, WA, Redmond · US, CO, Boulder · time type · Full time · posted on · Posted 4 Days Ago ...