site reliability engineer - Southfield, United States - Truck-Lite

    Truck-Lite
    Truck-Lite Southfield, United States

    1 month ago

    Default job background
    Description
    Run the production environment by monitoring availability and taking a holistic view of system health.
    Build software and systems to manage platform infrastructure and applications.
    Improve reliability, quality, and time-to-market of our suite of software solutions.

    Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.

    Provide primary operational support and engineering for multiple large-scale distributed software applications.
    SRE's will be focused on Automation, Monitoring, Incident Resolution and Culture.

    Responsibilities:
    Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding.
    Partner with development teams to improve services through rigorous testing and release procedures.
    Participate in system design consulting, platform management, and capacity planning.
    Create sustainable systems and services through automation and uplifts.
    Balance feature development speed and reliability with well-defined service-level objectives.
    After incidents, document actions to create automated solutions during incident response.
    Monitor infrastructure using SRE tools and suggest tools as necessary.
    Build monitoring alerts and incident response processes.
    Improve operational processes and team practices.
    Coding infrastructure automation across the CI/CD pipeline.
    As the solution scales, ensure reliability through designing, building, and maintaining the core infrastructure.
    Demonstrate strong programming skills and thorough knowledge of systems.
    Bring about cultural shifts to provide a foundation for process changes.

    REQUIREMENTS:
    Bachelor's degree (or equivalent) in computer science or related discipline.

    Experience with AWS multi-region/multi-AZ deployed systems, auto scaling of EC2 instances, CloudFormation, ELBs, VPCs, CloudWatch, SNS, SQS, S3, Route53, RDS, IAM roles, security groups, blue/green deployments, and A/B testing.

    Comfortable with large scale production systems and technologies, for example load balancing, monitoring, distributed systems, and configuration management.
    Strong coding skills in at least one programming language, and a desire to pick up more.
    Familiarity with and enthusiasm for software engineering best practices such as testing, continuous integration and continuous delivery.
    Exposure with cloud and Amazon Web Services (AWS) and APIs.

    Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.

    ,
    Strong Security mindset.
    Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
    Solid understanding of fundamental technologies like TCP/IP, HTTP.
    Strong working knowledge of Linux systems and applications.
    Experience with automation tooling such as Chef, Docker, AWS.
    Ability and willingness to collaborate.
    Strong problem-solving skills and ability to think under pressure.
    Strong analytical skills and management skills.
    Communication and documentation skills.

    Preferred skills and qualifications:
    Previous success in technical engineering.
    Coding experience beyond simple scripts.

    #J-18808-Ljbffr