-
Network Development Snr Manager
3 weeks ago
Oracle Trenton, United StatesJob Description · Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high performance network required to support AI workloads. This is your opportunity to join the AI revolution and building systems which allow customers to scale from tens to thousand ...
-
Lead Cloud and Security Engineer
2 weeks ago
RiseIT™ Solutions Princeton, United StatesPosition: Lead Cloud and Security Engineer · Location: Princeton, NJ (Remote with occasional travel to the site) · Duration: Full time · Cloud Engineering Job Responsibilities: · Design, deploy, and maintain cloud infrastructure on Azure, ensuring optimal performance and cost-eff ...
-
Cloud Engineer
2 weeks ago
InsideHigherEd Princeton, United StatesOverviewThe Accelerator is looking for a cloud engineer to collaborate with team members on developing, deploying, and enhancing data-intensive applications and processes. This individual will work as part of a small cross-functional team, participating in product design and iter ...
-
Data Engineer
2 weeks ago
InsideHigherEd Princeton, United StatesOverviewThe Accelerator seeks a Data Engineer to work with team members to assist in developing, deploying, and improving data-intensive applications and processes. As part of a small cross-functional team, this individual will participate in product design and iterative developm ...
-
Revenue Operations Senior Manager, Process
3 weeks ago
Core Weave Philadelphia, United StatesCoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases - VFX and rendering, machine learning and AI, batch p ...
-
Quality Technician
1 week ago
CoreWeave Philadelphia, United States Full time· CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batc ...
-
Senior Sales Development Representative
3 weeks ago
CoreWeave Philadelphia, United StatesJob Description · Job DescriptionCoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, ...
-
Senior Research Software Engineer
2 weeks ago
Princeton University Princeton, United StatesOverview · The Princeton Language and Intelligence Initiative at Princeton University invites applications for a Senior Research Software Engineer (RSE). This multidisciplinary initiative has three Research thrusts: (a) Better design, evaluation, safety and understanding of larg ...
-
Senior Research Software Engineer
2 weeks ago
Princeton University Princeton, United StatesOverview: · The Princeton Language and Intelligence Initiative at Princeton University invites applications for a Senior Research Software Engineer (RSE). This multidisciplinary initiative has three Research thrusts: (a) Better design, evaluation, safety and understanding of lar ...
-
Senior Research Software Engineer
2 weeks ago
InsideHigherEd Princeton, United StatesOverviewThe Princeton Language and Intelligence Initiative at Princeton University invites applications for a Senior Research Software Engineer (RSE). This multidisciplinary initiative has three Research thrusts: (a) Better design, evaluation, safety and understanding of large AI ...
-
Stock Administrator
1 week ago
CoreWeave Philadelphia, United States Full time· CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batc ...
Hardware Engineer, GPU Infrastructure - Trenton, United States - CoreWeave
Description
CoreWeave is seeking a highly skilled and motivated Infrastructure/Hardware Engineer, focusing on GPU and PCIe troubleshooting, to join our Hardware Engineering team, reporting to the Director of Compute Architecture.
In this role, you will play a crucial part in the design, development, troubleshooting, and optimization of our server hardware infrastructure.
You will collaborate closely with cross-functional teams, external vendors, and stakeholders to ensure the successful delivery of highly performant and reliable hardware solutions.
Responsibilities:
Troubleshoot complex GPU and PCIe related failures
Partner with external vendors on failure analysis
Track component RMAs
Develop and maintain hardware/firmware management services.
Automate all aspects of the server hardware lifecycle.
Serve as the senior point of contact for hardware escalation and troubleshooting.
Collaborate with cross-functional teams to define hardware requirements, specifications, and system architecture.
Create and maintain accurate documentation of hardware designs, specifications, test procedures, and results.
Analyze and optimize the performance of hardware systems, identify bottlenecks, and propose improvements for enhanced efficiency.
Establish processes for internal hardware testing, deployment, and performance optimization.
The ideal candidate will have at least 2 years professional experience with the following:
Prior experience supporting and troubleshooting data center class GPUs (preferably A100 or newer)
Proficiency in ansible/python and experience with programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish).
Experience using, integrating and automating data center class GPU diagnostics and troubleshooting tools
In-depth knowledge of server hardware, components, and management technologies, particularly GPUs and PCIe devices.
Proven ability to stay updated with the latest industry technologies and trends.
Previous experience collaborating with hardware vendors.
Strong passion for automation, with a commitment to automating processes comprehensively.
Excellent documentation skills and attention to detail.
Strong analytical and problem-solving abilities.
Hybrid Workplace
Successful candidates will be expected to attend onboarding training at our NJ Headquarters within their first several weeks of employment, with subsequent quarterly travel requirements of 1 week duration.
If you reside within a 30-mile radius of our New Jersey, New York, or Philadelphia offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture.
Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences.
#J-18808-Ljbffr