Jobs
>
Menlo Park

    Senior Linux HPC Engineer - Menlo Park, United States - SLAC National Accelerator Laboratory

    SLAC National Accelerator Laboratory
    SLAC National Accelerator Laboratory Menlo Park, United States

    1 week ago

    Default job background
    Description

    Senior Linux HPC Engineer

    Job ID

    5435

    Location

    SLAC - Menlo Park, CA

    Full-Time

    Regular

    SLAC Job Postings

    Position Overview:

    Would you like to configure and troubleshoot modern high-performance Linux clusters? Does contributing to breakthrough discoveries in science and medicine excite you? SLAC National Accelerator Laboratory seeks an energetic, motivated developer-operations engineer who enjoys teamwork, learning cutting edge technologies and engaging the user community.

    SLAC is one of the world's premier research laboratories, with internationally leading capabilities in photon science, accelerator physics, high energy physics (HEP), and energy sciences. The Controls & Data Systems (CDS) Division in the Technology Innovation Directorate (TID) is involved in many national and international projects, which, among others, include the Rubin Observatory, the Linac Coherent Light Source (LCLS) user facility, CryoEM user facilities, the LHC ATLAS detector at CERN, and accelerator controls for LCLS-II.

    This position will play a critical role in deploying, maintaining and monitoring the large-scale scientific computing infrastructure that supports SLAC's data analysis and Machine Learning capabilities. Thousands of scientists worldwide rely on these systems to perform their research activities. We are seeking a creative, resourceful system administrator that can assist in the configuration, deployment and ongoing maintenance of various platforms and services across hundreds of nodes utilizing the best in class automation platforms. Our working environment is highly collaborative. Skills and responsibilities are shared across our department. We value strong communication and documentation. Host platforms include bare-metal, virtual machines and kubernetes.

    We encourage free-thinking open dialog and provide opportunities to explore and implement new technologies and ideas. There is huge potential for career growth. High performance computing is recognized as a SLAC core competency.

    Given the nature of this position, SLAC is open to on-site and hybrid work options.

    Your specific responsibilities will be to:

    • Lead the management and system administration tasks across hundreds of Linux hosts
    • Maintain and extend our scientific software catalog - working closely with our scientific partners to build workflows and pipelines
    • Architect, administer and tune our batch scheduling systems
    • Lead contribution to developing and standardizing our configuration management platform
    • Help architect and support core monitoring and alerting (notification) capabilities to track health and performance
    • Support day-to-day operations and troubleshooting of scientific computing services and infrastructure
    • Help direct and perform end-user support via our incident platforms and communication channels
    • Maintain all relevant documentation for administration procedures

    To be successful in this position you will bring:

    • Bachelor's degree in computer sciences, physics or related field and 8 years of relevant experience in information technology, systems administration, or high-performance computing.
    • Proven ability to work effectively in a team environment with excellent organizational and communication skills
    • In-depth technical understanding and proven success partnering with scientific teams to understand and implement large and small scale computational and data-driven workflows
    • Demonstrated ability to lead projects and teams to completion and help drive our scientific mission
    • Extensive experience with Linux system management, monitoring, open-source software
    • Expertise and experience in frameworks and scripting for large distributed systems (ansible, bash and python preferred
    • Expertise with distributed compute and storage systems, high performance computing systems, and networking
    • Proven ability and experience in deploying and troubleshooting full stack software and hardware systems in large complex clustered environments
    • Experience leading and and promoting best practices

    In addition, preferred experience include:

    • Expert knowledge of high throughput and high performance frameworks and techniques (MPI, containerization, low-level scientific software libraries)
    • Expert knowledge of configuration management systems (ansible preferred)
    • Knowledge of kubernetes primitives and architecture
    • Expert experience managing and configuring Linux and related applications
    • Extensive experience with system and service monitoring (prometheus, influxdb, grafana, loki)

    SLAC Employee Competencies:

    • Effective Decisions : Uses job knowledge and solid judgment to make quality decisions in a timely manner.
    • Self-Development : Pursues a variety of venues and opportunities to continue learning and developing.
    • Dependability : Can be counted on to deliver results with a sense of personal responsibility for expected outcomes.
    • Initiative : Pursues work and interactions proactively with optimism, positive energy, and motivation to move things forward.
    • Adaptability : Flexes as needed when change occurs, maintains an open outlook while adjusting and accommodating changes.
    • Communication : Ensures effective information flow to various audiences and creates and delivers clear, appropriate written, spoken, presented messages.
    • Relationships : Builds relationships to foster trust, collaboration, and a positive climate to achieve common goals.

    Physical Requirements and Working Conditions:

    • You are expected to reside locally and work onsite up to 3 days a week
    • Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of the job. May work extended hours during peak business cycles.

    Work Standards :

    • Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.
    • Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for environment, safety and security; communicates related concerns; uses and promotes safe behaviors based on training and lessons learned. Meets the applicable roles and responsibilities as described in the ESH Manual, Chapter 1—General Policy and Responsibilities: http://www-
    • Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide,
    Classification Title: System Administrator 3

    Grade: K

    Job code: 4833

    Duration: Regular Continuing

    _The expected pay range for this position is $119,000 to $150,000 per annum. SLAC National Accelerator Laboratory/Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs._

    SLAC National Accelerator Laboratory is an Affirmative Action / Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All staff at SLAC National Accelerator Laboratory must be able to demonstrate the legal right to work in the United States. SLAC is an E-Verify employer.


  • Meta Inc Menlo Park, United States

    Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that co ...

  • ASRC Federal Holding Company

    Staff HPC Engineer

    1 week ago


    ASRC Federal Holding Company Mountain View, United States Full time

    Job Title · Staff HPC EngineerLocation · NASA/AMES, MOFFETT FIELD-CA026Job Description · ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government c ...

  • ASRC Federal Holding Company

    Senior HPC Engineer

    1 week ago


    ASRC Federal Holding Company Mountain View, United States Full time

    Job Title · Senior HPC EngineerLocation · NASA/AMES, MOFFETT FIELD-CA026Job Description · ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government ...

  • ASRC Federal Holding Company, LLC

    Staff HPC Engineer

    1 week ago


    ASRC Federal Holding Company, LLC Mountain View, United States

    The successful candidate will be an active supporting member of the ASRC Federal team reporting directly to the Manager of the Application Performance and Productivity (APP) group and matrixed directly to the Supercomputing Systems Team Manager. An i Staff, Engineer, Computer Sci ...

  • ASRC Federal Holding Company

    Senior HPC Engineer

    2 weeks ago


    ASRC Federal Holding Company Mountain View, CA, United States

    Job Description ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a cultur ...

  • ASRC Federal Holding Company

    Senior HPC Engineer

    1 week ago


    ASRC Federal Holding Company Mountain View, United States

    Job Description · ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a cult ...

  • ASRC Federal Holding Company

    Staff HPC Engineer

    1 week ago


    ASRC Federal Holding Company Mountain View, United States

    Job Description · ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a cult ...


  • Facebook Menlo Park, United States

    Network Engineers at Meta are hybrid software/network engineers who ensure that Meta's network and related services run smoothly and have the capacity for future growth. Vendor and Community Management, Data Analytics, network (re)design, and cost modeling are keys to meeting our ...


  • Sustainable Talent Santa Clara, United States

    Are you ready to make your mark in the forefront of technological innovation? As an HPC Cluster Engineer, you'll play a pivotal role in shaping the future of AI, deep learning, and machine learning initiatives. Join us and leverage Nvidia's cutting-edge GPU technology to drive gr ...


  • 1000 KLA Corporation Milpitas, United States Full time

    Description · /Preferred Qualifications Responsibilities for this exciting role will include: · Design, implementation & support of high-performance compute clusters · Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inte ...


  • TECHFUJI LLC Cupertino, United States

    Job Description · Job Description · We are looking for a Senior Systems Developer with expertise in AWS, HPC Job Schedulers (PBS), Python, DevOps, Linux Administration, FlexLM, and Managing SQL and NoSQL on AWS. · Job Responsibilities · Designing and implementing the next gener ...


  • TECHFUJI LLC Cupertino, United States

    Job Description · Job DescriptionWe are looking for a Senior Systems Developer with expertise in AWS, HPC Job Schedulers (PBS), Python, DevOps, Linux Administration, FlexLM, and Managing SQL and NoSQL on AWS. · Job Responsibilities · Designing and implementing the next generation ...


  • ASML San Jose, CA, United States

    The hands-on job of a software engineer for HPC platform is responsible for the design, review and collaboration with computation infrastructure team for a future proof cloud and virtual compute platform with optimization on both in-house and at HMI's customers. The stability of ...


  • NVIDIA Santa Clara, United States

    NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of ...


  • ASML San Jose, United States

    Introduction to the jobThe hands-on job of a software engineer for HPC platform is responsible for the design, review and collaboration with computation infrastructure team for a future proof cloud and virtual compute platform with optimization on both in-house and at HMI's custo ...


  • NVIDIA Santa Clara, United States

    NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of ...


  • Meta Menlo Park, United States

    Meta is seeking an AI Software Engineer to join our Research & Development teams. The ideal candidate will have industry experience working on AI Infrastructure related topics. The position will involve taking these skills and applying them to solve for some of the most crucial & ...


  • Super Micro Computer San Jose, United States

    Production Technician · **Date**:Apr 16, 2024 · **Location**: San Jose, California, United States · **Company**:Super Micro Computer · **Job Req ID: 24106** · **About Supermicro**: · - Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Dat ...


  • Super Micro Computer San Jose, United States

    Engineering Technician · **Date**:May 14, 2024 · **Location**: San Jose, California, United States · **Company**:Super Micro Computer · **Job Req ID: 24501** · **About Supermicro**: · - Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Da ...

  • Super Micro Computer

    Project Manager

    4 days ago


    Super Micro Computer San Jose, United States

    Project Manager · **Date**:May 14, 2024 · **Location**: San Jose, California, United States · **Company**:Super Micro Computer · **Job Req ID: 24383** · **About Supermicro**: · - Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Cent ...