Jobs
>
Trenton

    Hardware Engineer, GPU Infrastructure - Trenton, United States - CoreWeave

    Default job background
    Description


    CoreWeave is seeking a highly skilled and motivated Infrastructure/Hardware Engineer, focusing on GPU and PCIe troubleshooting, to join our Hardware Engineering team, reporting to the Director of Compute Architecture.

    In this role, you will play a crucial part in the design, development, troubleshooting, and optimization of our server hardware infrastructure.

    You will collaborate closely with cross-functional teams, external vendors, and stakeholders to ensure the successful delivery of highly performant and reliable hardware solutions.


    Responsibilities:
    Troubleshoot complex GPU and PCIe related failures

    Partner with external vendors on failure analysis

    Track component RMAs

    Develop and maintain hardware/firmware management services.

    Automate all aspects of the server hardware lifecycle.

    Serve as the senior point of contact for hardware escalation and troubleshooting.

    Collaborate with cross-functional teams to define hardware requirements, specifications, and system architecture.

    Create and maintain accurate documentation of hardware designs, specifications, test procedures, and results.

    Analyze and optimize the performance of hardware systems, identify bottlenecks, and propose improvements for enhanced efficiency.

    Establish processes for internal hardware testing, deployment, and performance optimization.

    The ideal candidate will have at least 2 years professional experience with the following:

    Prior experience supporting and troubleshooting data center class GPUs (preferably A100 or newer)

    Proficiency in ansible/python and experience with programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish).

    Experience using, integrating and automating data center class GPU diagnostics and troubleshooting tools

    In-depth knowledge of server hardware, components, and management technologies, particularly GPUs and PCIe devices.

    Proven ability to stay updated with the latest industry technologies and trends.

    Previous experience collaborating with hardware vendors.

    Strong passion for automation, with a commitment to automating processes comprehensively.

    Excellent documentation skills and attention to detail.

    Strong analytical and problem-solving abilities.

    Hybrid Workplace


    Successful candidates will be expected to attend onboarding training at our NJ Headquarters within their first several weeks of employment, with subsequent quarterly travel requirements of 1 week duration.


    If you reside within a 30-mile radius of our New Jersey, New York, or Philadelphia offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture.

    Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences.

    #J-18808-Ljbffr


  • Oracle Trenton, United States

    Job Description · Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high performance network required to support AI workloads. This is your opportunity to join the AI revolution and building systems which allow customers to scale from tens to thousand ...


  • RiseIT™ Solutions Princeton, United States

    Position: Lead Cloud and Security Engineer · Location: Princeton, NJ (Remote with occasional travel to the site) · Duration: Full time · Cloud Engineering Job Responsibilities: · Design, deploy, and maintain cloud infrastructure on Azure, ensuring optimal performance and cost-eff ...

  • InsideHigherEd

    Cloud Engineer

    2 weeks ago


    InsideHigherEd Princeton, United States

    OverviewThe Accelerator is looking for a cloud engineer to collaborate with team members on developing, deploying, and enhancing data-intensive applications and processes. This individual will work as part of a small cross-functional team, participating in product design and iter ...

  • InsideHigherEd

    Data Engineer

    2 weeks ago


    InsideHigherEd Princeton, United States

    OverviewThe Accelerator seeks a Data Engineer to work with team members to assist in developing, deploying, and improving data-intensive applications and processes. As part of a small cross-functional team, this individual will participate in product design and iterative developm ...


  • Core Weave Philadelphia, United States

    CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases - VFX and rendering, machine learning and AI, batch p ...

  • CoreWeave

    Quality Technician

    1 week ago


    CoreWeave Philadelphia, United States Full time

    · CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batc ...


  • CoreWeave Philadelphia, United States

    Job Description · Job DescriptionCoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, ...


  • Princeton University Princeton, United States

    Overview · The Princeton Language and Intelligence Initiative at Princeton University invites applications for a Senior Research Software Engineer (RSE). This multidisciplinary initiative has three Research thrusts: (a) Better design, evaluation, safety and understanding of larg ...


  • Princeton University Princeton, United States

    Overview: · The Princeton Language and Intelligence Initiative at Princeton University invites applications for a Senior Research Software Engineer (RSE). This multidisciplinary initiative has three Research thrusts: (a) Better design, evaluation, safety and understanding of lar ...


  • InsideHigherEd Princeton, United States

    OverviewThe Princeton Language and Intelligence Initiative at Princeton University invites applications for a Senior Research Software Engineer (RSE). This multidisciplinary initiative has three Research thrusts: (a) Better design, evaluation, safety and understanding of large AI ...

  • CoreWeave

    Stock Administrator

    1 week ago


    CoreWeave Philadelphia, United States Full time

    · CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batc ...