Reliability Engineer - Menomonee Falls, United States - Kohl's Corporation

    Kohl's Corporation
    Kohl's Corporation Menomonee Falls, United States

    1 month ago

    Default job background
    OTHER
    Description

    JOB RESPONSIBILITIES

  • Follows software lifecycle, driving reliability, observability, and efficiency across product teams within your domain
  • Identifies repeated toil and finds opportunities for automation and risk reduction
  • On-call on a rotation to respond to production incidents and conduct blameless retros and root-cause analysis (RCAs) to drive a culture of continuous improvements
  • Proactively identifies failures before it becomes an outage using chaos engineering techniques such as edge cases, failure modes, and DR
  • Advises on capacity planning and provides continuous assessments on systems behavior and consumption working towards optimization and cost savings
  • Works with product managers to identify and prioritize tech debt for reliability best practices (e.g. SLIs/SLOs/Error Budgets)QUALIFICATIONS REQUIRED
  • Bachelor's Degree or equivalent in MIS, Computer Science or related field
  • 2+ years of experience in software development
  • Have strong programming skills in one or more languages - Java, Python, Go or
  • Experience working with one of major cloud platforms (GCP, AWS, or Azure) PREFERRED
  • Experience in one of more Observability platforms - Prometheus, InfluxDB, Grafana, ELK or APM
  • Knowledge of application design patterns, event-driven architecture, database schemas, and testing strategies
  • Experience with large scale application troubleshooting and performance tuning
  • Knowledge and experience with continuous integration, continuous deployment, and test driven development
  • Experience in at least one PasS & Containers - Openshift, Cloud Foundry, Kubernetes or equivalent
  • Experience with one or more configuration management systems like Chef, Ansible, Puppet
  • Good understanding of systems architecture, UNIX internals, networking topologies, multi-cluster applications, multi-tenant platforms, and systems/network security