Site Reliability Engineer - Marlborough, United States - Amtex Systems

    Amtex Systems
    Amtex Systems Marlborough, United States

    4 weeks ago

    Default job background
    Description

    Job Title:
    Lead Site Reliability Engineer
    Duration: 6 months Contract to hire

    Location:
    Marlborough, MA (Hybrid)


    Responsibilities:


    Design and manage Java based microservices, bash scripts, Redis, High-Availability design, while strictly adhering to Site Reliability Engineering (SRE) principles.

    Thrive in high-pressure environments, working swiftly and reliably to maintain system integrity and meet service level objectives (SLOs) and service level indicators (SLIs).

    Proactively identify and address potential issues before they impact operations, utilizing observability tools like New Relic, Scalyr/Splunk, bash scripts, and Python scripts.

    Lead initiatives to enhance current systems and implement innovative solutions in collaboration with a fast-paced, mission-driven team, focusing on the implementation of SRE best practices.

    Conduct thorough root-cause analyses for production incidents and generate high-quality RCA reports, leveraging SRE methodologies to prevent recurrence.

    Apply software engineering principles to rectify operational challenges and optimize system performance, with a specific focus on implementing SRE-driven solutions.

    Ensure the availability, latency, performance, efficiency, and security of our infrastructure, adhering rigorously to SRE principles and best practices.

    Design and maintain robust production monitoring systems to ensure timely detection and resolution of issues, following SRE guidelines for effective monitoring and alerting.

    Utilize a diverse array of tools to troubleshoot performance and stability issues effectively, employing SRE methodologies to identify and mitigate bottlenecks.

    Evaluate and enhance application and environment security measures, integrating SRE-driven security practices into the development and deployment pipelines.
    Provide support for globally distributed, multi-cloud (public and/or private) environments, implementing SRE strategies for resilience and fault tolerance.

    Automate repetitive tasks at scale to streamline operational workflows and enhance efficiency, focusing on the implementation of SRE-driven automation solutions.

    Adhere to change management processes during implementations and utilize version control for application infrastructure, following SRE principles for reliable and auditable change management.

    Foster a SRE mindset throughout the organization, promoting collaboration and shared responsibility for reliability and performance


    Qualifications:
    Bachelor's Degree in Computer Science or related field, or foreign equivalent.
    Demonstrated curiosity and self-drive to tackle complex challenges and drive change in a diverse organizational landscape.
    Excellent written and verbal communication skills, with the ability to effectively communicate with engineering management, developers, and leadership.
    Proven ability to adapt to new technologies and learn quickly.
    Minimum of 5 years of experience in Site Reliability Engineering (SRE) or related roles.

    #J-18808-Ljbffr