AI/HPC Infrastructure Engineer - Emory, United States - DSO National Laboratories

    DSO National Laboratories
    DSO National Laboratories Emory, United States

    1 month ago

    Default job background
    Description
    Job Openings
    Job Posting Detail

    AI/HPC Infrastructure Engineer
    Responsibilities

    Infrastructure Design:

    Collaborate with cross-functional teams, including AI R&D engineers and software engineers, to design and continually enhance scalable and efficient on-premise AI infrastructure solutions to train and serve large AI models.

    Create, evolve and maintain the infrastructure roadmap aligned with the organization's AI strategy.

    Scalability and Performance:
    Identify and address performance bottlenecks, latency issues, and scalability challenges in AI infrastructure. Leverage your expertise to optimize resource allocation and improve data processing pipelines.

    Monitoring and Maintenance:
    Establish robust monitoring systems to track the health, performance, and utilization of AI infrastructure components. Proactively identify and resolve issues, ensuring high availability and reliability of AI systems.

    Security and Compliance:
    Implement security measures and best practices to protect AI infrastructure and data. Ensure compliance with relevant regulations, privacy standards, and industry best practices.

    Collaboration and Documentation:
    Work closely with cross-functional teams to understand their requirements and provide technical guidance. Document infrastructure configurations, processes, and troubleshooting procedures to enable efficient knowledge sharing and onboarding.

    Requirements

    Degree in Computer Engineering / Computer Science/ Artificial Intelligence


    Familiarity with cluster management tools like Bright, data processing frameworks (e.g., Apache Spark, Apache Beam), machine learning frameworks (e.g., TensorFlow, PyTorch), networking for HPC applications, containerization technologies (e.g.

    , Docker, Kubernetes) and HPC scheduling

    Infrastructure Optimization:
    Experience in optimizing infrastructure for performance, scalability, and cost-efficiency. Knowledge of distributed systems, network architecture, and storage technologies for AI and/or HPC

    Problem-Solving Abilities:
    Demonstrated ability to analyse complex problems, propose innovative solutions, and implement them effectively. Strong troubleshooting and debugging skills to resolve infrastructure-related issues

    Collaboration and Communication:
    Excellent interpersonal skills with the ability to collaborate effectively in a team environment. Strong verbal and written communication skills to convey technical concepts to both technical and non-technical stakeholders.

    #J-18808-Ljbffr