Jobs
>
Arlington

    Senior Principal Engineer Site Reliability - Arlington, United States - Dell

    Dell background
    Description
    Senior Principal Engineer Site Reliability
    Dell Technologies customers rely on our products and services to drive progress. So, we take the service we provide extremely seriously. Service Delivery is all about making sure our technical solutions help clients fulfil their priorities, challenges and initiatives. As trusted advisors, we build in-depth knowledge of what each client wants to achieve. Then we make sure the services delivered by Dell Technologies deliver on all our promises.

    We also work closely with Sales and Global Services colleagues to develop strategic account growth plans, and to identify and pursue sales opportunities.

    Join us to do the best work of your career and make a profound social impact as a

    Senior Principal Engineer - Site Reliability Engineering

    on our

    Service Delivery

    Team in

    Austin, Texas .

    What you'll achieve

    The Senior Principal Engineer- Site Reliability Engineering supporting Artificial Intelligence/Machine Learning/High Performance Compute Solutions, Service Delivery will be responsible for providing the primary management, administration, support, and ongoing maintenance of customer Platforms within a 24x7x365 datacenter environment.

    This is a technical leadership role.

    The ideal candidate will play a crucial role in managing and supporting complex solutions and platforms for our prestigious Fortune 100 clients.

    The role will be expected to work in a positive and collaborative fashion with fellow team members, senior engineering/architect staff, vendors, and customers.

    The Senior Principal Engineer will assist with process maturation, development, technical standards creation, and drive operational excellence through consistent delivery and best practices.


    You will:
    Serve as the top technical expert in deploying, upgrading, troubleshooting Artificial Intelligence/Machine Learning/High Performance Compute Solutions platforms

    Manage and maintain container platform (Kubernetes, OpenShift) infrastructure, including installation, configuration, and upgrades and optimize system performance, capacity, and availability of the environment

    Act in the capacity of an SRE / DevOps expert

    Take the first step towards your dream career

    Every Dell Technologies team member brings something unique to the table.

    Here's what we are looking for with this role:

    Essential Requirements

    Hands on experience working in an infrastructure managed services environment, supporting complex engineered solution in production with Artificial Intelligence/Machine Learning/High Performance Compute Systems and Platforms, Converged/ Hyper-Converged infrastructure along with fluency in AI/ML pipelines, Nvidia GPU optimization, InfiniBand networking, Machine Learning operating systems such as , Compute Orchestration Platform such as runai etc

    Expert-level knowledge of cluster provisioning and resource schedulers

    Programming experience with Python, Go, Ruby, Shell Scripts, PowerShell along with hands on experience with ELK, Prometheus, Grafana, Ansible, Git, or similar technologies

    Expertise in Kubernetes, OpenShift, Docker, Container Networking, and Cloud Native Platform/ Applications

    Strong Networking Fundamentals along with Converged Infra (CI)/Hyper Converged Infa (HCI) Management Certification along with hands-on experience with Amazon Kubernetes Service (AKS), Amazon EKS, Google Kubernetes Engine (GKE), Rancher

    Desirable Requirements
    BE or MS in Computer Science or Computer Engineering or acceptable combination of equivalent industry experience will be considered

    Certified Kubernetes / OpenShift Admin, NSX T Certification

    Who we are
    We believe that each of us has the power to make an impact. That's why we put our team members at the center of everything we do.

    If you're looking for an opportunity to grow your career with some of the best minds and most advanced tech in the industry, we're looking for you.


    Dell Technologies is a unique family of businesses that helps individuals and organizations transform how they work, live and play.

    Join us to build a future that works for everyone because Progress Takes All of Us.

    Application closing date: 03/22/2024


    Dell Technologies is committed to the principle of equal employment opportunity for all employees and to providing employees with a work environment free of discrimination and harassment.

    Read the full Equal Employment Opportunity Policy here.
    #LI-Remote

    #J-18808-Ljbffr


  • MetroStar Washington, United States

    As a Site Reliability Engineering (SRE) Lead, you'll deliver mission-critical services that empower end users. As the ideal candidate, you'll use your extensive experience designing and implementing end-to-end continuous delivery pipelines and experience in AI/ML. You will also u ...


  • Mount Indie Washington, United States

    Job Description · Job DescriptionAs aSite Reliability Engineer (SRE), youll continuously drive improvements in observability, performance, and reliability,with the goal to make an impact across the federal government. This role requires a current TS/SCI that has been obtained wit ...


  • GM Financial Arlington, United States

    Overview: · This is a Hybrid Opportunity at our Arlington, TX office (3 days remote, 2 days onsite) · Why GMF Technology? · GM Financial is set to change the auto finance industry and is leading the path of embarking on tech modernization we have a startup mindset, and preserv ...


  • Allscripts Washington, United States

    Welcome to Veradigm, where our Mission is transforming health, insightfully. Join the Veradigm team and help solve many of today's healthcare challenges being addressed by biopharma, health plans, healthcare providers, health technology partners, and the patients they serve. At V ...


  • Allscripts Washington, United States

    Welcome to Veradigm Our Mission is to be the most trusted provider of innovative solutions that empower all stakeholders across the healthcare continuum to deliver world-class outcomes. Our Vision is a Connected Community of Health that spans continents and borders. With the larg ...


  • Booz Allen Hamilton Alexandria, United States

    Job Number: R · Site Reliability Engineer · The Opportunity : · Do you love finding ways to make systems more efficient? Do you find it impossible to simply maintain when you could improve? Engineering to make a system more resilient and efficient frees up time and money to bui ...


  • Innovative Computer Solutions Group, Inc Alexandria, United States

    Job Description · Job Description · Site Reliability Engineer (SRE) mandatory · skills/qualifications: · Must be a US Citizen · • Must possess minimum 3+ years of actual experience in the industry in an SRE role · • Must possess minimum 10+ years of software engineer experienc ...


  • Parsons Oman Alexandria, United States

    We harness the power of innovation so that you can change the world and help our customers solve their most complex challenges · In a world of possibilities, pursue one with endless opportunities. Imagine NextWhen it comes to what you want in your career, if you can imagine it, y ...


  • Harbor Compliance Washington, United States

    Job Description · Job DescriptionSite Reliability Engineer - Full-time Remote · Advance Your Career with Cutting-Edge Infrastructure at Harbor Compliance · Location: Full-time Remote (Excluding CA, CO, MT, NY) · About Harbor Compliance: · Harbor Compliance is committed to simplif ...


  • Department of Corrections Executive Leadership Washington, United States

    Introduction · The Department of Corrections (DOC) is focused on public safety through the custody and supervision of those in our care. Corrections employees have the opportunity to positively impact the lives of others through careers in a variety of fields. Using cutting-edge ...


  • ARCADIS Farragut, United States

    Job Description · Arcadis is the world's leading company delivering sustainable design, engineering, and consultancy solutions for natural and built assets. · We are more than 36,000 people, in over 70 countries, dedicated to improving quality of life. Everyone has an important ...


  • General Motors Financial Company, Inc. Arlington, United States

    About this role: The Site Reliability Engineering (SRE) team provides leadership, direction, and accountability for building and running large-scale software systems. As a Site Reliability Engineer, you will identify and deliver automation solutions Reliability Engineer, Liabilit ...


  • Azimuth Corporation Springfield, United States

    Job Description · Job DescriptionAzimuth Corporation is seeking a Site Reliability Engineer, in support of a government customer in Springfield, VA. The ideal candidate will create capabilities (pipelines, containers, auditing/monitoring, HA, SLO/SLA policy docs) and maintain exi ...


  • Azimuth Corporation Springfield, United States

    Job Description · Job DescriptionAzimuth Corporation is seeking a Site Reliability Engineer, in support of a government customer in Springfield, VA. The ideal candidate will create capabilities (pipelines, containers, auditing/monitoring, HA, SLO/SLA policy docs) and maintain exi ...


  • General Motors Financial Company, Inc. Arlington, United States

    About this role: The Site Reliability Engineering (SRE) team provides leadership, direction, and accountability for building and running large-scale software systems. As a Site Reliability Engineer, you will identify and deliver automation solutions Reliability Engineer, Liabilit ...


  • Automox Arlington, United States

    Are you ready to own something big? Automox is turning IT admins into IT heroes by replacing traditional tools with our award winning cloud-native endpoint management platform. Our product works autonomously and so do our teams. We value a 'one team' mentality where everyone's un ...


  • Halvik Vienna, United States

    Job Description · Job DescriptionHalvik is a highly successful company that puts people first, and we are looking for someone just like you. We are committed to delivering smarter IT-driven solutions bolstered by quality and innovation to help our customers succeed. Come be a par ...


  • Halvik Vienna, United States

    Halvik is a highly successful company that puts people first, and we are looking for someone just like you. We are committed to delivering smarter IT-driven solutions bolstered by quality and innovation to help our customers succeed. Come be a part of something truly special · S ...


  • Booz Allen Hamilton Chantilly, United States Full time

    Site Reliability EngineerThe Opportunity: · Do you love finding ways to make systems more efficient? Do you find it impossible to simply maintain when you could improve? Engineering to make a system more resilient and efficient frees up time and money to build more capabilities. ...


  • Oracle Reston, United States

    Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible fo ...