Jobs
>
Wilmington

    Research Cyberinfrastructure Engineer II, HPC and GPU Cluster - Hanover, United States - InsideHigherEd

    Default job background
    Description
    Posting date:

    06/10/2024

    Open Until Filled:

    Yes

    Position Number:

    1128918

    Position Title:

    Research Cyberinfrastructure Engineer II, HPC and GPU Cluster (RCIEII)

    Department this Position Reports to:

    Research Cyberinfrastructure

    Hiring Range Minimum:

    $99,400

    Hiring Range Maximum:

    $114,300

    Union Type:

    Not a Union Position

    SEIU Level:

    Not an SEIU Position

    FLSA Status:

    Exempt

    Employment Category:

    Regular Full Time w/end date

    Scheduled Months per Year:

    12

    Scheduled Hours per Week:

    40

    Schedule:

    M-F, 8a-5p

    Location of Position:

    Hanover, NH

    Remote Work Eligibility?:

    Hybrid

    Is this a term position?:

    Yes

    If yes, length of term in months.:

    36

    Is this a grant funded position?:

    No

    Position Purpose:

    The Research Cyberinfrastructure Engineer II (RCIEII) enhances research computing infrastructure, focusing on administration, High-Performance Computing (HPC), cloud, and advanced computing solutions. Responsibilities include building and maintaining a graphical processing unit (GPU) cluster primarily used for artificial intelligence (AI) and machine learning (ML) workloads. This role increases infrastructure security, availability, and scalability, leading automation and system optimization initiatives to advance research capabilities. The RCIEII provides advanced support, develops innovative solutions, and leads projects to enhance research success.

    Description:


    Join Our Team as a Research Cyberinfrastructure Engineer II, HPC and GPU Cluster at Dartmouth

    Are you ready to enhance the future of research computing? Dartmouth is looking for a dynamic Research Cyberinfrastructure Engineer II (RCIEII) to innovate and lead in HPC and GPU cluster administration.

    About the Role:
    As an RCIEII, you will enhance research computing infrastructure, focusing on building and maintaining a GPU cluster for AI and ML workloads. You will ensure infrastructure security, availability, and scalability while leading automation and system optimization initiatives.

    What You'll Do:
    Lead Projects:
    Manage and optimize HPC environments and cloud-based infrastructures, focusing on high availability and performance.

    Innovate: Implement cutting-edge computing services and applications, integrating GPU technologies into HPC environments.
    Collaborate:
    Build strategic partnerships with IT departments, technology providers, and research groups to foster collaboration.

    Mentor and Train: Create knowledge-sharing platforms, coordinate hackathons and workshops, and promote continuous development.
    Your Skills and Expertise:

    • Bachelor's degree in Computer Science/IT or equivalent experience.
    • 3+ years in research computing, focusing on HPC system optimization and security.
    • Proficiency in scripting (Python, Bash) and automation tools (Ansible, Terraform).
    • Expertise in Linux, Windows server management, and container technologies (Docker, Kubernetes).
    • Skilled in cloud platforms (AWS, Azure, Google Cloud) and HPC software deployment.

    Why Dartmouth?

    Impactful Work: Contribute to groundbreaking research and innovative projects.
    Collaborative Environment:
    Work with a diverse and interdisciplinary team of experts.

    Professional Growth: Continuous learning and professional development opportunities.

    Join Us:
    Be a part of a team driving innovation in research computing. Apply now to lead the future of research cyberinfrastructure at DartmouthRequired Qualifications - Education and Yrs Exp:

    Bachelors plus 3-5 years' experience or equivalent combination of education and experience

    Required Qualifications - Skills, Knowledge and Abilities:
    • Bachelor's degree or equivalent experience in Computer Science/IT.
    • 3+ years in research computing, focusing on HPC system optimization and security.
    • Proficient in scripting (Python, Bash) and automation tools.
    • Proven project success in enhancing research computing environments.
    • Expertise in Linux and Windows server management.
    • Experienced in Docker and Kubernetes.
    • Familiar with Ansible, Terraform, Puppet for automation.
    • Strong analytical and problem-solving skills.
    • Skilled in cloud platforms (AWS, Azure, Google Cloud).
    • Effective communication and teamwork skills.
    • Leadership experience in mentoring and team development.
    Preferred Qualifications:
    • Advanced degree or certifications in relevant fields.
    • Expertise in AI/ML software and frameworks.
    • Experience with CUDA programming and/or C/C++.
    • Professional certifications (e.g., AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect).
    • Experience in academic/research IT environments.
    • Hands-on data center operations experience.
    • Proficient in HPC software deployment and troubleshooting.
    • Skilled in cloud services for HPC workloads.
    • Experience in developing and maintaining infrastructure documentation.
    • Innovative in developing new services and applications.
    • Comprehensive understanding of security in computing environments.
    • Excellent troubleshooting skills using command-line tools and vendor support.
    Department Contact for Recruitment Inquiries:

    Jonathan Kulp

    Department Contact Phone Number:

    Department Contact for Cover Letter and Title:

    Elijah Gagne

    Department Contact's Phone Number:

    Equal Opportunity Employer:

    Dartmouth College is an equal opportunity/affirmative action employer with a strong commitment to diversity and inclusion. We prohibit discrimination on the basis of race, color, religion, sex, age, national origin, sexual orientation, gender identity or expression, disability, veteran status, marital status, or any other legally protected status. Applications by members of all underrepresented groups are encouraged.

    Background Check:

    Employment in this position is contingent upon consent to and successful completion of a pre-employment background check, which may include a criminal background check, reference checks, verification of work history, conduct review, and verification of any required academic credentials, licenses, and/or certifications, with results acceptable to Dartmouth College. A criminal conviction will not automatically disqualify an applicant from employment. Background check information will be used in a confidential, non-discriminatory manner consistent with state and federal law.

    Is driving a vehicle (e.g. Dartmouth vehicle or off road vehicle, rental car, personal car) an essential function of this job?:

    Not an essential function

    Special Instructions to Applicants:

    This position is a 36-month term position.

    Dartmouth College has a Tobacco-Free Policy. Smoking and the use of tobacco-based products (including smokeless tobacco) are prohibited in all facilities, grounds, vehicles or other areas owned, operated or occupied by Dartmouth College with no exceptions. For details, please see our policy.

    Quick Link:Description:

    Cyberinfrastructure Operations

    • Integrates GPU technologies into HPC environments, collaborating with researchers and HPC programmers.
    • Acts as a Subject Matter Expert (SME) in cloud services, HPC, automation, storage, and container technologies (e.g., Docker, Kubernetes), providing advanced support and consultancy.
    • Manages and optimizes HPC environments and cloud-based infrastructures, focusing on high availability, efficient load balancing, and performance across platforms such as AWS and GCP.
    • Designs and implements networking configurations, maintaining security compliance (e.g., FISMA, PCI, GDPR, HIPAA).
    • Develops and refines automation scripts and workflows using tools like Ansible, Terraform, Python, and PowerShell.
    • Coordinates disaster recovery plans, data integrity strategies, oversees hypervisor environments, and ensures computing services' resilience.
    • Provides on-call support, showcasing problem-solving capabilities and promoting knowledge sharing within the team.
    • Implements security measures to protect HPC environments, applications, servers, and storage from cyber threats.
    • Utilizes scalability techniques to ensure HPC systems can accommodate growing research demands.
    • Monitors system availability, implementing redundancy and failover strategies.
    Percentage Of Time:

    40%

    Description:

    Computing and HPC Initiatives

    • Leads initiatives to design and implement computing services and applications addressing specific research challenges.
    • Collaborates with researchers to understand computational needs, translating these into practical, scalable solutions.
    • Oversees the integration of cloud-based solutions for HPC workloads.
    • Designs and manages data storage infrastructures ensuring data integrity, availability, and compliance with policies and regulations.
    Percentage Of Time:

    20%

    Description:

    Collaboration and Relationship Management

    • Builds and nurtures strategic partnerships with IT departments, technology providers, and research groups.
    • Manages joint ventures with academic partners to pilot new technologies in research computing.
    • Engages stakeholders through updates, presentations, and collaborative sessions, ensuring their needs are met.
    Percentage Of Time:

    20%

    Description:

    Training and Development

    • Creates a knowledge-sharing platform for team members to share best practices and solutions.
    • Coordinates hackathons, tech talks, and workshops to stimulate innovation and the adoption of new technologies.
    • Seeks continuous personal development and identifies opportunities for team advancement.
    Percentage Of Time:

    10%

    Description:

    Leadership

    • Serves as the technical lead in critical problem-solving efforts.
    • Cultivates a problem-solving mindset within the team.
    • Reviews team processes and workflows, identifying inefficiencies.
    • Implements process improvements to enhance team productivity and project management.
    Percentage Of Time:

    5%

    --:

    Demonstrates a commitment to diversity, inclusion, and cultural awareness through actions, interactions, and communications with others.

    --:

    Performs other duties as assigned.