Lead Reliability Engineer - Santa Clara, United States - Celestial Services

Celestial Services Santa Clara, United States

3 weeks ago

Description

Job Description:

We are looking for a Lead Reliability Engineer to spearhead reliability efforts specifically tailored for datacenter and high-performance computing (HPC) applications.

The ideal candidate will have a strong background in reliability engineering with a focus on these critical environments, ensuring the robustness and uptime of our systems in demanding operational scenarios.

ESSENTIAL DUTIES AND RESPONSIBILITIES:

Develop and implement reliability strategies, standards, and processes customized for datacenter and high-performance computing applications, addressing unique challenges such as thermal management, power integrity, and workload variability.

Lead reliability testing and qualification activities tailored for datacenter and HPC environments, including stress testing, thermal cycling, and performance degradation analysis.

Collaborate closely with cross-functional teams, including hardware design, systems engineering, and datacenter operations, to integrate reliability considerations into product development and deployment processes.

Conduct thorough reliability analyses specific to datacenter and HPC applications, such as MTBF (Mean Time Between Failures) calculations, system-level fault tolerance assessments, and risk mitigation strategies.

Define reliability requirements and specifications for new products targeting datacenter and HPC markets, working closely with design teams to ensure compliance with industry standards and customer expectations.

Lead root cause analysis and corrective actions for reliability issues identified in datacenter and HPC environments, driving continuous improvement initiatives and implementing best practices.

Stay abreast of emerging technologies and industry trends in datacenter and HPC reliability engineering, leveraging this knowledge to enhance the reliability and performance of our systems.

QUALIFICATIONS:
Bachelor's degree in Engineering or related field; Master's or PhD degree preferred.

15+ years of experience in reliability engineering, with a focus on datacenter and high-performance computing applications at component, board and system level.

Very strong understanding on physics of failures to drive material and process improvements for components

Strong understanding of reliability principles, methodologies, and tools relevant to datacenter and HPC environments, such as reliability modeling, fault tolerance techniques, and performance optimization strategies.

Experience working with industry standards and guidelines specific to datacenter and HPC reliability, such as GR-468 and other relevant datacenter component qualification requirements.

Proven ability to lead cross-functional teams and drive reliability initiatives in fast-paced environments.

Excellent problem-solving skills and the ability to perform detailed root cause analysis in complex systems.

Effective communication skills and the ability to collaborate with internal teams and external stakeholders in the datacenter and HPC ecosystem.

Location

:

Bay Area location is preferred.

For California location:

As an early startup experiencing explosive growth, we offer an extremely attractive total compensation package, inclusive of competitive base salary and a generous grant of our valuable early-stage equity.

The target base salary for this role is approximately $175, $200, The base salary offered may be slightly higher or lower than the target base salary, based on the final scope as determined by the depth of the experience and skills demonstrated by candidate in the interviews.

#J-18808-Ljbffr

Reliability Engineer

3 weeks ago

Comtech Telecom Santa Clara, United States Full time Regular

Comtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, providing solutions to a variety of ...
Reliability Engineer

2 weeks ago

Natron Energy Santa Clara, United States

Natron is seeking a Reliability Engineer to support the development and test of our high-power battery systems for data center UPS and EV charging applications. The occupant of this position will work with the Product Engineering, Reliability, Technology, and Operations teams to ...
Reliability Engineer

4 weeks ago

COMTECH TELECOMMUNICATIONS Santa Clara, United States

Job Description · Job DescriptionComtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, pr ...
Lead Reliability Engineer

2 weeks ago

Celestial AI Santa Clara, United States

About Celestial AI · As the industry strives to meet the demands of the AI workloads, bottlenecks in data transfers between processors and memory have hindered progress. The Photonic Fabric based Memory Fabric provides an optically scalable solution to the 'Memory Wall' problem, ...
Lead Reliability Engineer

3 weeks ago

Celestial AI Santa Clara, United States

About Celestial AI · As the industry strives to meet the demands of the AI workloads, bottlenecks in data transfers between processors and memory have hindered progress. The Photonic Fabric based Memory Fabric provides an optically scalable solution to the 'Memory Wall' problem, ...
Site Reliability Engineer

6 days ago

NVIDIA Santa Clara, United States

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's fueled by great technology—and outstanding people. Today, we're tapping into the unlimited potential of AI to define the next era ...
Site Reliability Engineer

12 hours ago

HCLTech San Jose, United States

About HCLTech: · HCLTech is a global technology company, home to 221,000+ people across 60 countries, delivering industry-leading capabilities centered around digital, engineering and cloud, powered by a broad portfolio of technology services and products. We work with clients ac ...
Service Reliability Engineer

3 weeks ago

Software Technology, Inc Santa Clara, United States

Job Description · Job DescriptionPosition : Service Reliability Engineer / Sr. Devops Engineer · Location : Santa Clara, CA · Duration : 1 Year + · OK with any visa No OPT please · Local consultants only · Customer will not provide letter for H1B candidates. Please check with t ...
Service Reliability Engineer

1 day ago

Software Technology Inc Santa Clara, United States

Job Description · Job Description · Position : Service Reliability Engineer / Sr. Devops Engineer · Location : Santa Clara, CA · Duration : 1 Year + · OK with any visa No OPT please · Local consultants only · Customer will not provide letter for H1B candidates. Please check wi ...
Site Reliability Engineer

2 weeks ago

TEKsystems San Jose, United States Contract

Description: · Adobe is looking for an experienced Site Reliability Engineer to join the internal tooling team support, configure, integrate, upgrade, and automate the use of enterprise tools used across their large Engineering organization. Role will be focused on user interact ...
Reliability Engineer

1 week ago

Apple Cupertino, United States

Summary · Posted: Apr 13, 2024 · Weekly Hours: 40 · Role Number: · Do you ever wonder what goes into making Apple products an amazing user experience? Apple's innovative reliability team is responsible for insuring that our products exceed our customer's expectations for rob ...
Reliability Engineer

1 week ago

Apple Cupertino, United States

Reliability Engineer · Cupertino,California,United States · Hardware · Do you ever wonder what goes into making Apple products an amazing user experience? Apples innovative reliability team is responsible for insuring that our products exceed our customers expectations for rob ...
Site Reliability Engineer

14 hours ago

Cryptoware Technologies Inc Santa Clara, United States

Job Description · Job DescriptionResponsibility · • Lead the effort of global expansion of Huobi globe spanning infrastructure. · • Work with engineering teams to make sure new features and changes are deployed quickly and safely. · • Constantly improve our system performance and ...
Reliability Engineer

1 week ago

Apple Cupertino, United States

Summary · Posted: Apr 13, 2024 · Weekly Hours: · 40 · Role Number: · Do you ever wonder what goes into making Apple products an amazing user experience? Apple's innovative reliability team is responsible for insuring that our products exceed our customer's expectations for r ...
Senior Reliability Engineer

3 weeks ago

ServiceNow Santa Clara, United States

Company Description · At ServiceNow, our technology makes the world work for everyone, and our people make it possible. We move fast because the world can't wait, and we innovate in ways no one else can for our customers and communities. By joining ServiceNow, you are part of an ...
Principal Site Reliability Engineer

3 weeks ago

Kofi Group Santa Clara, United States Direct Hire

To Apply for this Job Click Here · Principal Site Reliability Engineer · San Francisco Bay Area, CA · We are partnering with a late-stage Cloud Security company that is looking for a Principal Level SRE · The ideal candidate will have: · Strong sense of architecture and design f ...
Sr Site Reliability Engineer

1 week ago

Palo Alto Networks Santa Clara, United States

Our Mission · At Palo Alto Networks everything starts and ends with our mission: · Being the cybersecurity partner of choice, protecting our digital way of life. · Our vision is a world where each day is safer and more secure than the one before. We are a company built on the fou ...
Site Reliability Engineer

2 weeks ago

Advantis Global is now INSPYR Solutions Sunnyvale, United States

ABOUT THIS FEATURED OPPORTUNITY · The QoS Infrastructure Tools Team is responsible for building and maintaining tools that are essential for Site Reliability Engineers (SREs) and engineers across the organization. The team primarily develops applications using Golang for backend ...
Site Reliability Engineer

2 weeks ago

Lawrence Harvey Sunnyvale, United States

Site Reliability Engineer · Status: Full Time · Compensation: 120k to 145k · Hybrid Requirements: 3 days in office, 2 days remote · Lawrence Harvey has partnered with a leading Chinese fintech startup that is committed to democratizing payment services and empowering people and ...
Site Reliability Engineer

3 days ago

AMISEQ Sunnyvale, United States

Site Reliability Engineer · Sunnyvale, CA - Hybrid · 6-12 Months W2 Contract · Job Description: · Hands on development on building n-tier applications using RESTful Services, Java/J2EE, JavaScript, Python, NoSql. · • Working knowledge of one or more cloud technologies such as AZ ...

Lead Reliability Engineer - Santa Clara, United States - Celestial Services

Description

Reliability Engineer

Reliability Engineer

Reliability Engineer