Site Reliability Engineering Director - Newton, MA, United States - Bright Horizons

    Bright Horizons
    Bright Horizons Newton, MA, United States

    1 month ago

    Default job background
    Description

    The Director of Site Reliability Engineering (SRE) will play a pivotal role in ensuring the seamless and reliable operation of consumer and customer-facing digital infrastructure across our lines of business. This leadership position involves overseeing a team of skilled SRE professionals and collaborating closely with cross-functional teams to enhance complex systems and applications' performance, scalability, and reliability. The Director of SRE is responsible for developing and implementing strategies to optimize our technologys reliability and uptime, managing incident response, and ensuring consistent use of best practices in automation, monitoring, and incident management. This role requires a deep understanding of cloud technologies, distributed systems, DevOps, Software Engineering, Automation / Scripting, Observability, App Support / Monitoring, and a proactive approach to preventing and mitigating potential issues. The Director of SRE must also foster a culture of innovation, continuous improvement, and collaboration within the team to meet the organization's evolving needs and deliver a superior digital experience to users.

    What you will be doing:

    • Strategy and Planning Develop and implement a comprehensive strategy for site reliability, encompassing scalability, performance, and reliability improvements. Align SRE objectives with overall business goals and technology roadmaps. Foster the spirit of continuous improvement to the SRE and position it to benefit the organizational objectives.
    • Leadership and Team Management Provide strong leadership to the Site Reliability Engineering (SRE) team, fostering a culture of collaboration, innovation, and continuous improvement. Recruit, mentor, and develop a high-performing team of SRE professionals. Engrave a can do attitude into the team out of the box, combined with a passion for automation and engineering excellence.
    • Operational Excellence Oversee day-to-day operations of the SRE team, ensuring the reliability and availability of digital infrastructure. Establish and enforce best practices for incident response, monitoring, automation, and system reliability. Do so by incorporating tools and technologies that create a 36-degree view of the SRE efficiency, including but not limited to DevOps, App Support, Monitoring, Incident Management, Observability, Network/Infra/InfoSec, and Enterprise Architecture.
    • Collaboration Collaborate with teams across our lines of business, including development, DevOps, App Support, Monitoring, Network/Infra/InfoSec, and Enterprise Architecture, to drive a unified approach to site reliability that optimizes the work of all those teams and improves time-to-market for all respective objectives. Foster strong relationships with the leadership and partnering delivery organizations to align SRE efforts with organizational goals.
    • Monitoring and Alerting Implement robust monitoring and alerting systems to proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
    • Automation and Efficiency Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
    • System Capacity Planning Work closely with infrastructure and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand. Anticipate growth and scalability requirements.
    • Incident Management Establish and oversee effective SRE-focused incident response processes, ensure timely incident resolution, and conduct post-mortems to identify root causes and implement preventive measures.

    What we hope you will bring to this role?

    • Bachelor's degree in computer science, Engineering, or related field.
    • A minimum of 10 years of experience, including at least 3 years in the SRE or DevOps field, with a proven track record of progressively increasing responsibilities and leadership roles.
    • Demonstrated ability to think strategically and develop a vision for site reliability engineering aligned with the organization's business objectives.
    • Strong leadership and people management skills, including experience leading and developing high-performing teams.
    • A 'can do' attitude is necessary, combined with a deep belief that everything can be automated and systems must always be functional.
    • Strong experience and understanding of software engineering, scripting, build/deployment pipelines, Infrastructure as Code, and SLA/SLO/SLIs.
    • Strong understanding of cloud computing platforms (Azure required, Google Cloud a plus), including lift-and-shift environments (VMs, etc.) and cloud-native setups (AKS, serverless, etc.).
    • Strong understanding and experience in automation tools and programming/scripting/descriptive languages (e.g., C#, PowerShell, Python, Bash, Terraform, JavaScript) to develop and implement automated system reliability and performance solutions.
    • Strong understanding of observability, monitoring, and alerting tools (e.g., Azure AppInsights, Data Dog, Splunk, etc.) and the ability to design and implement effective monitoring strategies.
    • Technical leadership skills, including technical collaboration/communication, problem-solving, and project management, are needed to lead the SRE team in delivering its objectives.
    • Preference may be given to candidates with relevant certifications demonstrating cloud and reliability engineering expertise.

    by Jobble

    #J-18808-Ljbffr