Sr System Reliability Engineer - Saint Louis, United States - Fulcrum Digital

    Fulcrum Digital
    Fulcrum Digital Saint Louis, United States

    1 week ago

    Default job background
    Description
    Who are we

    Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, health care, and manufacturing.


    The Role

    • Provide L2 support to production system like application, database, middleware components, infrastructure and network components
    • Manage productions incidents end-to-end within defined SLAs with focus on resolution rather than who caused it.
    • Interact with various stake holders such as Release managers, program leads, service managers, development and test leads
    • Review operational readiness requirements such as monitoring and alerting, log rotation and resilience of the components and report the gaps
    • Provide pre-implementation support with activities such as release notes review and implementation dry runs.
    • Protect production components by running health checks, monitoring latency and memory utilization.
    • Automate day-to day activities and propose changes that improve reliability
    • Participate in CAB and provide feedback on change requests
    • Support the DevOps team in testing the promote pipelines and suggest automation of configuration items.
    • Practice incident management best practices and perform RCA.
    • Participate in disaster recovery tests and operational acceptance tests
    • Analyze the technology stack that makes up the product and optimize recovery time objective.
    • Work with team members spread across and time zones
    • Share knowledge, document improvements and mentor junior resources

    Requirements

    Responsibility Matrix

    • Deployments MTF/Prod
    • Maintenance items (including stop/start, Disaster Recovery-related activities, etc.)
    • Monitoring
    • Support TRTs
    • Incident creation
    • CR for changes in MTF/Prod
    Tools
    • Log Monitoring Tool - Splunk
    • Application Monitoring tool - Dynatrace
    • Ticketing incident/problem management tool - Remedy
    • Linux
    • SQL
    • Dev-ops Basics - CI-CD Basics, Overview of git, Bit bucket, SonarQube, Fortify, CI(Jenkins), ARA, Saltstack, Chef, Artifactory , MC DevOps Tool chain
    Skills -
  • Linux & Shell Scripting
  • ITIL / ITSM
  • PL/SQL
  • Application Troubleshooting
  • Monitoring Tool - Splunk (preferred), Dynatrace (preferred) or any other monitoring tool
  • Jenkins- CI/CD
  • Groovy
  • Any Cloud - AWS / Azure / PCF
  • Git basic/bit bucket
  • Even Framework architecture - good to have
  • Ansible/Chef