Jobs
>
San Jose

    IT InfiniBand/GPU - San Jose, United States - Cadence Design Systems

    Default job background
    Description


    At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology.

    Cadence is looking for a Sr Staff Systems Engineer who accelerates strategic customer deployments and ensures on-time bring-up and deployment of HPC infrastructure and troubleshooting and supports technical roles supporting HPC, InfiniBand, and GPU at our San Jose location


    The successful candidate will be a hands-on technical candidate within the infrastructure team and be exposed to customer interfaces dealing with the Windows and Linux OS.

    The System Engineer will need experience in Linux environments and proficiency in tasks such as shell scripting.

    Role:
    IT -Sr Staff Systems Engineer

    Location on-site (not remote): San Jose, CA

    Must Haves


    • 15 years of experience in system administration and engineering.
    • Minimum five years overall experience in technical roles supporting GPU Infrastructure setup using InfiniBand
    • Experience with interconnections between InfiniBand & GPUs
    • Experience with GPU Enabled MPIs
    • Experience with GPU Nvidia CUDA or AMDs ROCm
    • Experience with; H100, AMD MI210, GPU servers in Cluster
    • Customer deployments and ensure on-time bring-up of GPU Servers.
    InfiniBand fabric bring-up, configuration, and subnet management on the IB switch


    • Participate in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc.) teams to develop best-in-class practices and tools; you will be analyzing, debugging, and resolving critical firmware and software issues for the workload performance at scale
    • Provide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutions
    • Strong knowledge of Linux operating systems and networking and security concepts.
    • Document and drive acceptance and qualification test plans, procedures, and reports
    Requirements


    • Accelerate strategic customer deployments and ensure on-time bring-up and deployment of HPC infrastructure
    • Participate in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc.) teams to develop best-in-class practices and tools; you will be analyzing, debugging, and resolving critical firmware and software issues for the workload performance at scale
    • Provide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutions
    • Development and implementation of server and rack-level telemetry aspects, collaborate and establish continuous improvements in our design flows
    • Recent experience in critical data center technologies such as server architectures, software containers, job schedulers, and parallel computing. Deployment and operation of large-scale systems; resilient system design; and clustering of computing resources
    • cluster management for HPC and actively connect with management regarding any problems with the equipment and propose a resolution
    • Establish and maintain IT infrastructure and procedures for customer-facing and internal systems
    • Actively establish the technical relationship with our customers engineers, management, and architects at focus accounts
    • Create and develop test plans for new features on each product. Recommend improvements to enable automated scripting for testing and archiving of results. Develop HPC computing strategies for cloud-based computing, GPU-accelerated computing, etc.
    • Provide remote cluster support to large environments, including scalability/flexibility and troubleshooting end-user issues involving job submission, runtime, and resource access.
    • InfiniBand fabric configuration and administration on Red hat/Centos/Linux experience in configuring PKeys and troubleshooting the end-to-end InfiniBand environment
    • InfiniBand fabric bring-up, configuration, subnet management, and monitoring on the IB switch and client side for multi-tenancy setup, understanding of IPoIB communication modes
    • Performance comparison of the InfiniBand network with cluster interconnects and debugging the InfiniBand performance-related issues
    • Automate configuration management, software updates, and system availability maintenance and monitoring using modern DevOps tools (Ansible, Gitlab, etc.)
    • Be a technical specialist on GPU computing and networking products, directly supporting GPU customers
    • Direct experience and strong knowledge of parallel programming, GPU CUDA/ROCm development, and applications.
    • Actively partner with the R&D teams delivering services to our infrastructure to gather their service requirements to live within this infrastructure.
    • Automate repetitive tasks and implement custom solutions using scripting/programming languages such as bash or python
    • Configure and troubleshoot a heterogeneous (QDR, FDR, EDR) InfiniBand network and associated subnet manager
    • Experience with High-performance computer interconnects (e.g. 10 and 40 Gigabit Ethernet, InfiniBand)
    • Able to move 50 pounds
    #LI-MA1

    The annual salary range for California is $133,000 to $247,000.

    You may also be eligible to receive incentive compensation:
    bonus, equity, and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure.

    Please note that the salary range is a guideline and compensation may vary based on factors such as qualifications, skill level, competencies and work location.


    Our benefits programs include:

    paid vacation and paid holidays, 401(k) plan with employer match, employee stock purchase plan, a variety of medical, dental and vision plan options, and more.

    Were doing work that matters. Help us solve what others cant.

    Additional Jobs )

    Cadence plays a critical role in creating the technologies that modern life depends on.

    We are a global electronic design automation company, providing software, hardware, and intellectual property to design advanced semiconductor chips that enable our customers create revolutionary products and experiences.


    Thanks to the outstanding caliber of the Cadence team and the empowering culture that we have cultivated for over 25 years, Cadence continues to be recognized by Fortune Magazine as one of the 100 Best Companies to Work For.

    ?Our shared passion for solving the worlds toughest technical challenges, our dedication to pushing the limits of the industry, and our drive to do meaningful work differentiates the people of Cadence.

    Cadence is committed to creating a diverse environment and is proud to be an equal opportunity employer.

    All qualified applicants will receive consideration for employment without regard to race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, basis of disability, or any other protected class.

    Cadence is committed to creating a diverse environment and is proud to be an equal opportunity employer.

    All qualified applicants will receive consideration for employment without regard to race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, basis of disability, or any other protected class.

    #J-18808-Ljbffr

  • Cadence Design Systems

    IT InfiniBand/GPU

    4 days ago


    Cadence Design Systems San Jose, United States

    At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology. · Cadence is looking for a Sr Staff Systems Engineer who accelerates strategic customer deployments and ensures on-time bring-up and deployment of HPC infrastructure and ...

  • Cadence Design Systems

    IT InfiniBand/GPU

    1 week ago


    Cadence Design Systems San Jose, United States

    At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology. · Cadence is looking for a Sr Staff Systems Engineer who accelerates strategic customer deployments and ensures on-time bring-up and deployment of HPC infrastructure and ...


  • Cadence Design Systems San Jose, United States

    At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology. · Responsibilities; Responsible for assisting with all projects and repairs throughout the data center. · Participate in an on-call rotation and provide hands-on covera ...


  • Cadence Design Systems San Jose, United States

    At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology. · Responsibilities; · + Responsible for assisting with all projects and repairs throughout the data center. · + Participate in an on-call rotation and provide hands-on ...


  • Cadence Design Systems San Jose, United States

    At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology. · Responsibilities; · + Responsible for assisting with all projects and repairs throughout the data center. · + Participate in an on-call rotation and provide hands-on ...

  • Calsoft Pvt. Ltd.

    Network Engineer

    2 weeks ago


    Calsoft Pvt. Ltd. San Jose, United States

    What you'll be doing: · Develop features and tools as part of solution engineering efforts to support all Enterprise Service offerings including, but not limited to InfiniBand/Ethernet switching products. · Work with CLIENTEnterprise customers and internal users to improve the ...


  • Oracle Santa Clara, United States Regular Employee

    Cloud Engineering Infrastructure Development · Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high performance network required to support AI/ML/HPC workloads. This is your opportunity to join the AI revolution and designing systems which allow cus ...


  • NVIDIA Santa Clara, United States Full time

    We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC. DL and HPC applications have a huge compute demand already and run on scales which go up to tens of thousands of GPUs. The ...


  • Lightelligence San Jose, United States

    Lightelligence is a venture-backed AI hardware company founded by MIT alumni, developing cutting-edge technology and products at the forefront of photonic computing and optical connectivity. The company has raised over $200M in pursuit of solving one of today's most complex engin ...


  • NVIDIA Santa Clara, United States Full time

    We are seeking a · motivated Senior HPC Technical Account Manager, passionate about data center and networking technologies, to provide comprehensive solutions for sophisticated installations, maintenance, or operations for a broad scope of groundbreaking networking products and ...


  • NVIDIA Santa Clara, United States Full time

    Salary 180, ,250 USD per year · Requirements: · - Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience. · - 8+ years of experience designing and operating large scale storage infrastructure. · - Experience analyzing and tuning ...


  • Oracle Santa Clara, United States Regular Employee

    OCI Compute is looking for strong Systems/Software Developers to take on the challenge of engineering Compute GPU/HPC Infrastructure solutions for Large Scale HPC/AI/ML Customer Workloads and performance while providing strong guarantees of availability to our customers. Your tea ...


  • NVIDIA Santa Clara, United States

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new unive ...


  • NVIDIA Santa Clara, United States

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new unive ...


  • NVIDIA Santa Clara, United States

    We are seeking a motivated Senior HPC Technical Account Manager, passionate about data center and networking technologies, to provide comprehensive solutions for sophisticated installations, maintenance, or operations for a broad scope of groundbreaking networking products and wi ...


  • Nvidia Santa Clara, United States

    At NVIDIA, we constantly innovate to amplify human intelligence. Our invention of the GPU sparked revolutions in gaming, computer graphics, and parallel computing. Now, GPU deep learning is driving modern AI forward. Join our GPU AI/HPC Infrastructure team and lead the design of ...


  • EDA Cafe Santa Clara, United States

    Job Location · : · 3535 Monroe St. · Santa Clara, California 95051 · United States · Nvidia · 3535 Monroe St. · Santa Clara, CA 95051 · United States · The Networking Application Engineering team is looking for a hardworking, keen · Hardware System Engineer · to join and supp ...

  • NVIDIA

    Solutions Architect

    1 week ago


    NVIDIA Santa Clara, United States

    NVIDIA is looking for a Solutions Architect to work in IPP's (Infrastructure, Planning and Process) Cloud Infrastructure Team. IPP is a global organization within NVIDIA. This group works with various other groups within NVIDIA such as Graphics Processors, Mobile Processors, Deep ...


  • NVIDIA Santa Clara, United States

    For two decades, we have pioneered visual computing, the art and science of computer graphics - with our invention of the GPUs, the engine of modern AI technologies, the field has expanded to encompass AI-powered video games, social networking and web search, IC & other product d ...


  • NVIDIA Santa Clara, United States

    NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of ...