- Architect and lead the development of scalable, secure AI infrastructure on cloud-native platforms to support autonomous driving technologies
- Collaborate closely with ML teams to facilitate seamless integration and optimal performance of AI algorithms
- Identify and address system bottlenecks and instabilities, applying innovative solutions to enhance system reliability and efficiency
- Foster technological advancements through research and implementation of state-of-the-art AI tools and methodologies
- Act as a key technical leader and mentor, promoting a culture of technical excellence and collaborative innovation within the AI infrastructure team
- Bachelor's or Master's in Computer Science, Engineering, or related technical field
- 5 - 8 years + of experience in in designing, deploying, and managing GPU clusters for high-performance computing in AI applications, particularly within cloud environments
- Proficient in cloud services (AWS, Azure, ALI Cloud) and building containerized applications using Kubernetes and Docker
- Strong programming skills in Python, Golang, and experience with AI/ML frameworks (TensorFlow, PyTorch)
- Expertise in designing and managing high-availability, high-throughput systems that support machine learning and deep learning workloads
- Demonstrable leadership skills with a track record of mentoring and leading technical teams
- In-depth understanding of data structures, algorithms, and software engineering principles relevant to AI and autonomous systems
- A fun, supportive and engaging environment
- Opportunity to make significant impact on transportation revolution by the means of advancing autonomous driving
- Opportunity to work on cutting edge technologies with the top talent in the field
- Competitive compensation package
- Snacks, lunches and fun activities
-
Reliability Engineer
1 week ago
Comtech Telecom Santa Clara, United StatesComtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, providing solutions to a variety of ...
-
Reliability Engineer
3 days ago
Comtech Telecom Santa Clara, United States Full time RegularComtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, providing solutions to a variety of ...
-
Reliability Engineer
1 week ago
Comtech TCS Santa Clara, United StatesJob Description · Job Description · Comtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a · Reliability/Failure · Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside cu ...
-
Reliability Engineer
6 days ago
COMTECH TELECOMMUNICATIONS Santa Clara, United StatesJob Description · Job DescriptionComtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, pr ...
-
Lead Reliability Engineer
1 day ago
Celestial Services Santa Clara, United StatesJob Description: · We are looking for a Lead Reliability Engineer to spearhead reliability efforts specifically tailored for datacenter and high-performance computing (HPC) applications. The ideal candidate will have a strong background in reliability engineering with a focus on ...
-
Lead Reliability Engineer
3 days ago
Celestial AI Santa Clara, United StatesAbout Celestial AI · As the industry strives to meet the demands of the AI workloads, bottlenecks in data transfers between processors and memory have hindered progress. The Photonic Fabric based Memory Fabric provides an optically scalable solution to the 'Memory Wall' problem, ...
-
Service Reliability Engineer
3 days ago
Software Technology, Inc Santa Clara, United StatesJob Description · Job DescriptionPosition : Service Reliability Engineer / Sr. Devops Engineer · Location : Santa Clara, CA · Duration : 1 Year + · OK with any visa No OPT please · Local consultants only · Customer will not provide letter for H1B candidates. Please check with t ...
-
Reliability Engineer
3 weeks ago
Wipro Cupertino, United StatesReliability Engineer · Auston, TX or Cupertino, CA/Remote ok for locals · Permanent Role · Job Summary: · A hardware reliability team is looking for a visionary and d engineer, who can lead and execute reliability test on Main Logic Boards, identify issues with Hardware module i ...
-
Site Reliability Engineer
1 week ago
Cryptoware Technologies Inc Santa Clara, United StatesJob Description · Job DescriptionResponsibility · • Lead the effort of global expansion of Huobi globe spanning infrastructure. · • Work with engineering teams to make sure new features and changes are deployed quickly and safely. · • Constantly improve our system performance and ...
-
Service Reliability Engineer
1 week ago
Software Technology Inc Santa Clara, United StatesJob Description · Job Description · Position : Service Reliability Engineer / Sr. Devops Engineer · Location : Santa Clara, CA · Duration : 1 Year + · OK with any visa No OPT please · Local consultants only · Customer will not provide letter for H1B candidates. Please check wi ...
-
Site Reliability Engineer
2 weeks ago
NVIDIA Santa Clara, United StatesNVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's fueled by great technology—and outstanding people. Today, we're tapping into the unlimited potential of AI to define the next era ...
-
Site Reliability Engineer
2 weeks ago
Cryptoware Technologies Inc Santa Clara, United StatesJob DescriptionJob Description · Responsibility · • Lead the effort of global expansion of Huobi globe spanning infrastructure. · • Work with engineering teams to make sure new features and changes are deployed quickly and safely. · • Constantly improve our system performance ...
-
Senior Reliability Engineer
1 day ago
ServiceNow Santa Clara, United StatesCompany Description · At ServiceNow, our technology makes the world work for everyone, and our people make it possible. We move fast because the world can't wait, and we innovate in ways no one else can for our customers and communities. By joining ServiceNow, you are part of an ...
-
Electrical Reliability Engineer
1 week ago
Peak Demand San Jose, United StatesWe are working with a company operating in the best of both worlds – an innovative start-up inside of a $6 billion parent company building the next generation of solar. They have developed an industry-leading building-integrated solar technology that is being deployed with custom ...
-
Semiconductor Reliability Engineer
2 weeks ago
Diverse Lynx San Jose, United StatesSemiconductor Reliability Senior Engineer · 5+ experience in IC reliability engineering with hands-on experience in 1 or more related areas such as Product Engineering, Test Engineering, Failure Analysis. · •Good understanding of Semiconductor, manufacturing process (Fab, Assem ...
-
Site Reliability Engineer
1 week ago
HCLTech San Jose, United StatesAbout HCLTech: · HCLTech is a global technology company, home to 221,000+ people across 60 countries, delivering industry-leading capabilities centered around digital, engineering and cloud, powered by a broad portfolio of technology services and products. We work with clients ac ...
-
Sr. Reliability Engineer
2 weeks ago
Antora Energy San Jose, United StatesAt Antora, we're on a mission to stop climate change. And we can't do that unless we tackle the 30% of global emissions that come from industry. · Antora is unlocking zero-emissions industrial energy, cheaper than fossil fuels. Antora's thermal batteries store energy from renewab ...
-
Sr. Reliability Engineer
2 weeks ago
Antora Energy San Jose, United StatesAt Antora, we're on a mission to stop climate change. And we can't do that unless we tackle the 30% of global emissions that come from industry. · Antora is unlocking zero-emissions industrial energy, cheaper than fossil fuels. Antora's thermal batteries store energy from renewab ...
-
Site Reliability Engineer
3 days ago
Myriad Consulting Inc San Jose, United StatesThis role also open for junior (3+ yoe) candidates, and SRE lead (7+ yoe). · Site Reliability Engineering(SRE) team combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. In our team, you ll have the opportunity ...
-
Senior Reliability Engineer
1 day ago
Theery San Jose, United StatesJob Description: · Perform reliability evaluation of IC products, packages, and process technology with focus on suitability to end applications and conformance to industry standards. Perform device level failure analysis for an in-depth understanding of IC device failures. Analy ...
Senior Staff AI Infrastructure Site Reliability Engineer - Santa Clara, United States - XPENG Motors
Description
XPeng Motors is one of China's leading smart electric vehicle (EV) companies. We design, develop, and manufacture smart EVs that are seamlessly integrated with advanced Internet, AI and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers. We strive to transform smart electric vehicles with technology and data, shaping the mobility experience of the future.As a Senior Staff AI Infrastructure SRE, you will be instrumental in leading the design and implementation of robust, cloud-native AI infrastructure solutions that support our autonomous driving initiatives. Your expertise will guide the development of systems capable of handling large-scale, real-time data processing and advanced machine learning models.
Job Responsibilities:
We are an Equal Opportunity Employer. It is our policy to provide equal employment opportunities to all qualified persons without regard to race, age, color, sex, sexual orientation, religion, national origin, disability, veteran status or marital status or any other prescribed category set forth in federal or state regulations.