SRE Lead - Hartford, United States - Centraprise

    Default job background
    Technology / Internet
    Description

    SRE Lead

    Hartford CT (Onsite)

    Fulltime

    Job Description:

    • Hands On Experience in Application Production support, debugging, triaging till resolution experience of applications.
    • Hands on extensive automation experience in ansible
    • Experience in Incident, Change and AWS Infrastructure management.
    • Experience in using key metrics like SLO, SLI and Error budgets in measuring quality-of-service delivery and reliability.
    • Experience in using the system reliability metrics like latency, traffic, errors, and saturation.
    • Proactive monitor system and application performance, identifying bottlenecks and areas for improvement.
    • Experience in self-healing automation using tools like Ansible, Splunk, Dynatrace. Automate several operations tasks and increase team efficiency.
    • Experience in using Dynatrace, Splunk, AWS CloudWatch to monitor the health and performance of migrated services.
    • Set up logging and tracing to capture and analyze system behavior and application performance.
    • Configure alerting thresholds and notifications to quickly identify and respond to issues.
    • Experience in building IaC scripts in AWS CloudFormation or Terraform to automate the provisioning and management of AWS resources.
    • Experience in optimization of AWS resource usage (e.g., EC2 instances, ECS, RDS databases) for cost-effectiveness and performance.
    • Experience in using Cost Explorer and Cloud ability tools for monitoring and implementing AWS security best practices, including network configurations, IAM policies, and encryption, to protect data and resources.
    • Experience in using CICD pipelines and various deployments like Blue Green, Canary

    Roles & Responsibilities:

    • Build SRE Team, Site Reliability Engineering Capability
    • Resolve Complex Production issues for improving reliability.
    • Identify opportunities for fully automate operations with variety of tools/skillsets
    • Application Migration and release support
    • Application Support in production environment