- Develop and implement reliability strategies, standards, and processes customized for datacenter and high-performance computing applications, addressing unique challenges such as thermal management, power integrity, and workload variability.
- Lead reliability testing and qualification activities tailored for datacenter and HPC environments, including stress testing, thermal cycling, and performance degradation analysis.
- Collaborate closely with cross-functional teams, including hardware design, systems engineering, and datacenter operations, to integrate reliability considerations into product development and deployment processes.
- Conduct thorough reliability analyses specific to datacenter and HPC applications, such as MTBF (Mean Time Between Failures) calculations, system-level fault tolerance assessments, and risk mitigation strategies.
- Define reliability requirements and specifications for new products targeting datacenter and HPC markets, working closely with design teams to ensure compliance with industry standards and customer expectations.
- Lead root cause analysis and corrective actions for reliability issues identified in datacenter and HPC environments, driving continuous improvement initiatives and implementing best practices.
- Stay abreast of emerging technologies and industry trends in datacenter and HPC reliability engineering, leveraging this knowledge to enhance the reliability and performance of our systems.
- Bachelor's degree in Engineering or related field; Master's or PhD degree preferred.
- 15+ years of experience in reliability engineering, with a focus on datacenter and high-performance computing applications at component, board and system level.
- Very strong understanding on physics of failures to drive material and process improvements for components
- Strong understanding of reliability principles, methodologies, and tools relevant to datacenter and HPC environments, such as reliability modeling, fault tolerance techniques, and performance optimization strategies.
- Experience working with industry standards and guidelines specific to datacenter and HPC reliability, such as GR-468 and other relevant datacenter component qualification requirements.
- Proven ability to lead cross-functional teams and drive reliability initiatives in fast-paced environments.
- Excellent problem-solving skills and the ability to perform detailed root cause analysis in complex systems.
- Effective communication skills and the ability to collaborate with internal teams and external stakeholders in the datacenter and HPC ecosystem.
-
Reliability Engineer
3 weeks ago
Comtech TCS Santa Clara, United StatesJob Description · Job Description · Comtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a · Reliability/Failure · Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside cu ...
-
Reliability Engineer
1 week ago
Comtech Telecom Santa Clara, United States Full time RegularComtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, providing solutions to a variety of ...
-
Reliability Engineer
5 days ago
Natron Energy Santa Clara, United StatesNatron is seeking a Reliability Engineer to support the development and test of our high-power battery systems for data center UPS and EV charging applications. The occupant of this position will work with the Product Engineering, Reliability, Technology, and Operations teams to ...
-
Reliability Engineer
3 weeks ago
Comtech Telecom Santa Clara, United StatesComtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, providing solutions to a variety of ...
-
Reliability Engineer
2 weeks ago
COMTECH TELECOMMUNICATIONS Santa Clara, United StatesJob Description · Job DescriptionComtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, pr ...
-
Reliability Engineer
4 days ago
Analog Devices San Jose, United StatesCome join Analog Devices (ADI) – a place where Innovation meets Impact. For more than 55 years, Analog Devices has been inventing new breakthrough technologies that transform lives. At ADI you will work alongside the brightest minds to collaborate on solving complex problems that ...
-
Lead Reliability Engineer
5 days ago
Celestial AI Santa Clara, United StatesAbout Celestial AI · As the industry strives to meet the demands of the AI workloads, bottlenecks in data transfers between processors and memory have hindered progress. The Photonic Fabric based Memory Fabric provides an optically scalable solution to the 'Memory Wall' problem, ...
-
Lead Reliability Engineer
1 week ago
Celestial Services Santa Clara, United StatesJob Description: · We are looking for a Lead Reliability Engineer to spearhead reliability efforts specifically tailored for datacenter and high-performance computing (HPC) applications. The ideal candidate will have a strong background in reliability engineering with a focus on ...
-
Site Reliability Engineer
3 weeks ago
NVIDIA Santa Clara, United StatesNVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's fueled by great technology—and outstanding people. Today, we're tapping into the unlimited potential of AI to define the next era ...
-
Service Reliability Engineer
2 weeks ago
Software Technology Inc Santa Clara, United StatesJob Description · Job Description · Position : Service Reliability Engineer / Sr. Devops Engineer · Location : Santa Clara, CA · Duration : 1 Year + · OK with any visa No OPT please · Local consultants only · Customer will not provide letter for H1B candidates. Please check wi ...
-
Site Reliability Engineer
3 weeks ago
Cryptoware Technologies Inc Santa Clara, United StatesJob DescriptionJob Description · Responsibility · • Lead the effort of global expansion of Huobi globe spanning infrastructure. · • Work with engineering teams to make sure new features and changes are deployed quickly and safely. · • Constantly improve our system performance ...
-
Service Reliability Engineer
1 week ago
Software Technology, Inc Santa Clara, United StatesJob Description · Job DescriptionPosition : Service Reliability Engineer / Sr. Devops Engineer · Location : Santa Clara, CA · Duration : 1 Year + · OK with any visa No OPT please · Local consultants only · Customer will not provide letter for H1B candidates. Please check with t ...
-
Site Reliability Engineer
2 weeks ago
Cryptoware Technologies Inc Santa Clara, United StatesJob Description · Job DescriptionResponsibility · • Lead the effort of global expansion of Huobi globe spanning infrastructure. · • Work with engineering teams to make sure new features and changes are deployed quickly and safely. · • Constantly improve our system performance and ...
-
Senior Reliability Engineer
1 week ago
ServiceNow Santa Clara, United StatesCompany Description · At ServiceNow, our technology makes the world work for everyone, and our people make it possible. We move fast because the world can't wait, and we innovate in ways no one else can for our customers and communities. By joining ServiceNow, you are part of an ...
-
Senior Reliability Engineer
1 week ago
ServiceNow Santa Clara, United StatesCompany Description · At ServiceNow, our technology makes the world work for everyone, and our people make it possible. We move fast because the world can't wait, and we innovate in ways no one else can for our customers and communities. By joining ServiceNow, you are part of an ...
-
Electrical Reliability Engineer
2 weeks ago
Peak Demand San Jose, United StatesWe are working with a company operating in the best of both worlds – an innovative start-up inside of a $6 billion parent company building the next generation of solar. They have developed an industry-leading building-integrated solar technology that is being deployed with custom ...
-
Site Reliability Engineer
4 days ago
Adobe San Jose, United StatesSite Reliability Engineer page is loaded · Site Reliability Engineer · Apply · locations · San Jose · Waltham · Lehi · time type · Full time · posted on · Posted 2 Days Ago · job requisition id · R143795 · Our Company · Changing the world through digital experiences ...
-
Sr. Reliability Engineer
4 days ago
Activate Global, Inc. San Jose, United StatesAt Antora, we're on a mission to stop climate change. And we can't do that unless we tackle the 30% of global emissions that come from industry. · Antora is unlocking zero-emissions industrial energy, cheaper than fossil fuels. Antora's thermal batteries store energy from renewab ...
-
Electrical Reliability Engineer
3 weeks ago
Peak Demand Inc San Jose, United StatesWe are working with a company operating in the best of both worlds an innovative start-up inside of a $6 billion parent company building the next generation of solar. They have developed an industry-leading building-integrated solar technology that is being deployed with customer ...
-
Senior Reliability Engineer
3 weeks ago
Theery San Jose, United StatesJob Description: · Perform reliability evaluation of IC products, packages, and process technology with focus on suitability to end applications and conformance to industry standards. Perform device level failure analysis for an in-depth understanding of IC device failures. Anal ...
Lead Reliability Engineer - Santa Clara, United States - Celestial AI
Description
About Celestial AIAs the industry strives to meet the demands of the AI workloads, bottlenecks in data transfers between processors and memory have hindered progress. The Photonic Fabric based Memory Fabric provides an optically scalable solution to the 'Memory Wall' problem, enabling tens of Terabytes of memory capacity at full HBM bandwidths with low tens of nanoseconds of latency and extremely low power. The Photonic Fabric based Compute Fabric enables Terabyte class bandwidth between compute nodes at low latency and power. Photonic Fabric delivers a transformative leap in AI system performance, ten years more advanced than existing technologies.
Job Description:
We are looking for a Lead Reliability Engineer to spearhead reliability efforts specifically tailored for datacenter and high-performance computing (HPC) applications. The ideal candidate will have a strong background in reliability engineering with a focus on these critical environments, ensuring the robustness and uptime of our systems in demanding operational scenarios.
ESSENTIAL DUTIES AND RESPONSIBILITIES:
For California location:
As an early startup experiencing explosive growth, we offer an extremely attractive total compensation package, inclusive of competitive base salary and a generous grant of our valuable early-stage equity. The target base salary for this role is approximately $175, $200, The base salary offered may be slightly higher or lower than the target base salary, based on the final scope as determined by the depth of the experience and skills demonstrated by candidate in the interviews.
We offer great benefits (health, vision, dental and life insurance), collaborative and continuous learning work environment, where you will get a chance to work with smart and dedicated people engaged in developing the next generation architecture for high performance computing.
Celestial AI Inc. is proud to be an equal opportunity workplace and is an affirmative action employer.
#LI-Onsite