Site Reliability Engineering Lead - Winnetka, United States - Medline Industries

    Medline Industries
    Medline Industries Winnetka, United States

    4 weeks ago

    Default job background
    Description
    Site Reliability Engineer Lead
    Summary

    Medline Industries is seeking a talented Site Reliability Engineer Lead to join our Software Engineering organization, helping to modernize Medline systems using modern architecture and technologies to improve efficiency, productivity, and availability.

    Site Reliability Engineering Lead will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions for operational enhancements.

    This role focuses on support and software development for implementing/configuring monitoring tools and techniques, optimizing existing systems, building infrastructure, and reducing work through automation.

    Responsibilities


    Participate in the Agile software development process as a member of scrum teams and work according to the goals of the team, focusing on developing scalable and reliable solutions.

    Analyze and address failure patterns and incidents in a team setting and reduce MTTR and MTBF.

    Design suitable monitoring frameworks to accomplish end-to-end monitoring and seamless alerting.

    Analyze performance tests and their results, document and identify optimization opportunities within Medline's software stack

    Participate in the full development lifecycle of creating automated solutions including seamless alerting and monitoring and addressing technical debts.

    Troubleshoot production processing and execute problem resolution through post-issue evaluations, root-cause analysis, and remediation.


    Design and implement performance tests, identify bottlenecks and opportunities for optimization and capacity demands, and present solutions for continuous improvements.

    Implement Reliability Engineering practices/tests like chaos engineering, fault injections, etc.

    Design automated software and product upgrades, change management and release management solutions.

    Mentor and coach, the team to adopt new tools and technologies wherever applicable.

    Stand as incident commander as needed, ensure blameless postmortems are conducted and documented for future reference.

    Maintain SRE troubleshooting runbooks and update them time to time when needed or if new architectural components are included.

    Driving the use of automation to streamline Ops SOPs and documentation.

    #J-18808-Ljbffr