Senior Linux HPC Engineer - Menlo Park, United States - SLAC National Accelerator Laboratory

SLAC National Accelerator Laboratory Menlo Park, United States

1 week ago

Description

Senior Linux HPC Engineer

Job ID

5435

Location

SLAC - Menlo Park, CA

Full-Time

Regular

SLAC Job Postings

Position Overview:

Would you like to configure and troubleshoot modern high-performance Linux clusters? Does contributing to breakthrough discoveries in science and medicine excite you? SLAC National Accelerator Laboratory seeks an energetic, motivated developer-operations engineer who enjoys teamwork, learning cutting edge technologies and engaging the user community.

SLAC is one of the world's premier research laboratories, with internationally leading capabilities in photon science, accelerator physics, high energy physics (HEP), and energy sciences. The Controls & Data Systems (CDS) Division in the Technology Innovation Directorate (TID) is involved in many national and international projects, which, among others, include the Rubin Observatory, the Linac Coherent Light Source (LCLS) user facility, CryoEM user facilities, the LHC ATLAS detector at CERN, and accelerator controls for LCLS-II.

This position will play a critical role in deploying, maintaining and monitoring the large-scale scientific computing infrastructure that supports SLAC's data analysis and Machine Learning capabilities. Thousands of scientists worldwide rely on these systems to perform their research activities. We are seeking a creative, resourceful system administrator that can assist in the configuration, deployment and ongoing maintenance of various platforms and services across hundreds of nodes utilizing the best in class automation platforms. Our working environment is highly collaborative. Skills and responsibilities are shared across our department. We value strong communication and documentation. Host platforms include bare-metal, virtual machines and kubernetes.

We encourage free-thinking open dialog and provide opportunities to explore and implement new technologies and ideas. There is huge potential for career growth. High performance computing is recognized as a SLAC core competency.

Given the nature of this position, SLAC is open to on-site and hybrid work options.

Your specific responsibilities will be to:

Lead the management and system administration tasks across hundreds of Linux hosts
Maintain and extend our scientific software catalog - working closely with our scientific partners to build workflows and pipelines
Architect, administer and tune our batch scheduling systems
Lead contribution to developing and standardizing our configuration management platform
Help architect and support core monitoring and alerting (notification) capabilities to track health and performance
Support day-to-day operations and troubleshooting of scientific computing services and infrastructure
Help direct and perform end-user support via our incident platforms and communication channels
Maintain all relevant documentation for administration procedures

To be successful in this position you will bring:

Bachelor's degree in computer sciences, physics or related field and 8 years of relevant experience in information technology, systems administration, or high-performance computing.
Proven ability to work effectively in a team environment with excellent organizational and communication skills
In-depth technical understanding and proven success partnering with scientific teams to understand and implement large and small scale computational and data-driven workflows
Demonstrated ability to lead projects and teams to completion and help drive our scientific mission
Extensive experience with Linux system management, monitoring, open-source software
Expertise and experience in frameworks and scripting for large distributed systems (ansible, bash and python preferred
Expertise with distributed compute and storage systems, high performance computing systems, and networking
Proven ability and experience in deploying and troubleshooting full stack software and hardware systems in large complex clustered environments
Experience leading and and promoting best practices

In addition, preferred experience include:

Expert knowledge of high throughput and high performance frameworks and techniques (MPI, containerization, low-level scientific software libraries)
Expert knowledge of configuration management systems (ansible preferred)
Knowledge of kubernetes primitives and architecture
Expert experience managing and configuring Linux and related applications
Extensive experience with system and service monitoring (prometheus, influxdb, grafana, loki)

SLAC Employee Competencies:

Effective Decisions : Uses job knowledge and solid judgment to make quality decisions in a timely manner.
Self-Development : Pursues a variety of venues and opportunities to continue learning and developing.
Dependability : Can be counted on to deliver results with a sense of personal responsibility for expected outcomes.
Initiative : Pursues work and interactions proactively with optimism, positive energy, and motivation to move things forward.
Adaptability : Flexes as needed when change occurs, maintains an open outlook while adjusting and accommodating changes.
Communication : Ensures effective information flow to various audiences and creates and delivers clear, appropriate written, spoken, presented messages.
Relationships : Builds relationships to foster trust, collaboration, and a positive climate to achieve common goals.

Physical Requirements and Working Conditions:

You are expected to reside locally and work onsite up to 3 days a week
Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of the job. May work extended hours during peak business cycles.

Work Standards :

Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.
Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for environment, safety and security; communicates related concerns; uses and promotes safe behaviors based on training and lessons learned. Meets the applicable roles and responsibilities as described in the ESH Manual, Chapter 1—General Policy and Responsibilities: http://www-
Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide,

Classification Title: System Administrator 3

Grade: K

Job code: 4833

Duration: Regular Continuing

_The expected pay range for this position is $119,000 to $150,000 per annum. SLAC National Accelerator Laboratory/Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs._

SLAC National Accelerator Laboratory is an Affirmative Action / Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All staff at SLAC National Accelerator Laboratory must be able to demonstrate the legal right to work in the United States. SLAC is an E-Verify employer.

AI/HPC Systems Performance Engineer

1 week ago

Meta Inc Menlo Park, United States

Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that co ...
Staff HPC Engineer

1 week ago

ASRC Federal Holding Company Mountain View, United States Full time

Job Title · Staff HPC EngineerLocation · NASA/AMES, MOFFETT FIELD-CA026Job Description · ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government c ...
Senior HPC Engineer

1 week ago

ASRC Federal Holding Company Mountain View, United States Full time

Job Title · Senior HPC EngineerLocation · NASA/AMES, MOFFETT FIELD-CA026Job Description · ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government ...
Staff HPC Engineer

1 week ago

ASRC Federal Holding Company, LLC Mountain View, United States

The successful candidate will be an active supporting member of the ASRC Federal team reporting directly to the Manager of the Application Performance and Productivity (APP) group and matrixed directly to the Supercomputing Systems Team Manager. An i Staff, Engineer, Computer Sci ...
Senior HPC Engineer

2 weeks ago

ASRC Federal Holding Company Mountain View, CA, United States

Job Description ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a cultur ...
Senior HPC Engineer

1 week ago

ASRC Federal Holding Company Mountain View, United States

Job Description · ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a cult ...
Staff HPC Engineer

1 week ago

ASRC Federal Holding Company Mountain View, United States

Job Description · ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a cult ...
Network Engineer, HPC Systems Network Strategy

3 weeks ago

Facebook Menlo Park, United States

Network Engineers at Meta are hybrid software/network engineers who ensure that Meta's network and related services run smoothly and have the capacity for future growth. Vendor and Community Management, Data Analytics, network (re)design, and cost modeling are keys to meeting our ...
HPC Cluster Engineer

1 week ago

Sustainable Talent Santa Clara, United States

Are you ready to make your mark in the forefront of technological innovation? As an HPC Cluster Engineer, you'll play a pivotal role in shaping the future of AI, deep learning, and machine learning initiatives. Join us and leverage Nvidia's cutting-edge GPU technology to drive gr ...
HPC Performance Engineer

1 week ago

1000 KLA Corporation Milpitas, United States Full time

Description · /Preferred Qualifications Responsibilities for this exciting role will include: · Design, implementation & support of high-performance compute clusters · Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inte ...
AWS (HPC) Cloud Engineer

2 weeks ago

TECHFUJI LLC Cupertino, United States

Job Description · Job Description · We are looking for a Senior Systems Developer with expertise in AWS, HPC Job Schedulers (PBS), Python, DevOps, Linux Administration, FlexLM, and Managing SQL and NoSQL on AWS. · Job Responsibilities · Designing and implementing the next gener ...
HPC (AWS) Claster Engineer

1 week ago

TECHFUJI LLC Cupertino, United States

Job Description · Job DescriptionWe are looking for a Senior Systems Developer with expertise in AWS, HPC Job Schedulers (PBS), Python, DevOps, Linux Administration, FlexLM, and Managing SQL and NoSQL on AWS. · Job Responsibilities · Designing and implementing the next generation ...
Principal HPC software engineer

5 days ago

ASML San Jose, CA, United States

The hands-on job of a software engineer for HPC platform is responsible for the design, review and collaboration with computation infrastructure team for a future proof cloud and virtual compute platform with optimization on both in-house and at HMI's customers. The stability of ...
Senior HPC Storage Engineer

1 week ago

NVIDIA Santa Clara, United States

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of ...
Principal HPC software engineer

3 days ago

ASML San Jose, United States

Introduction to the jobThe hands-on job of a software engineer for HPC platform is responsible for the design, review and collaboration with computation infrastructure team for a future proof cloud and virtual compute platform with optimization on both in-house and at HMI's custo ...
Senior AI-HPC Storage Engineer

2 weeks ago

NVIDIA Santa Clara, United States

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of ...
Software Engineer, Systems ML

1 week ago

Meta Menlo Park, United States

Meta is seeking an AI Software Engineer to join our Research & Development teams. The ideal candidate will have industry experience working on AI Infrastructure related topics. The position will involve taking these skills and applying them to solve for some of the most crucial & ...
Production Technician

4 days ago

Super Micro Computer San Jose, United States

Production Technician · **Date**:Apr 16, 2024 · **Location**: San Jose, California, United States · **Company**:Super Micro Computer · **Job Req ID: 24106** · **About Supermicro**: · - Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Dat ...
Engineering Technician

4 days ago

Super Micro Computer San Jose, United States

Engineering Technician · **Date**:May 14, 2024 · **Location**: San Jose, California, United States · **Company**:Super Micro Computer · **Job Req ID: 24501** · **About Supermicro**: · - Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Da ...
Project Manager

4 days ago

Super Micro Computer San Jose, United States

Project Manager · **Date**:May 14, 2024 · **Location**: San Jose, California, United States · **Company**:Super Micro Computer · **Job Req ID: 24383** · **About Supermicro**: · - Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Cent ...

Senior Linux HPC Engineer - Menlo Park, United States - SLAC National Accelerator Laboratory

Description

AI/HPC Systems Performance Engineer

Staff HPC Engineer

Senior HPC Engineer

Staff HPC Engineer

Senior HPC Engineer

Senior HPC Engineer

Staff HPC Engineer

Network Engineer, HPC Systems Network Strategy

HPC Cluster Engineer

HPC Performance Engineer

AWS (HPC) Cloud Engineer

HPC (AWS) Claster Engineer

Principal HPC software engineer

Senior HPC Storage Engineer

Principal HPC software engineer

Senior AI-HPC Storage Engineer

Software Engineer, Systems ML

Production Technician

Engineering Technician

Project Manager

for Recruiters

Information