System Reliability Engineer - St Louis, United States - Fulcrum Digital Inc

    Default job background
    Description

    Who are we

    Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, health care, and manufacturing.

    Is this the next step in your career Find out if you are the right candidate by reading through the complete overview below.

    The Role

    • Provide L2 support to production systems like applications, databases, middleware components, infrastructure, and network components
    • Manage production incidents end-to-end within defined SLAs focusing on resolution rather than who caused it.
    • Interact with various stakeholders such as Release managers, program leads, service managers, development and test leads
    • Review operational readiness requirements such as monitoring and alerting, log rotation and resilience of the components, and report the gaps
    • Provide pre-implementation support with activities such as release notes review and implementation dry runs.
    • Protect production components by running health checks, and monitoring latency and memory utilization.
    • Automate day-to-day activities and propose changes that improve reliability
    • Participate in CAB and provide feedback on change requests
    • Support the DevOps team in testing the promoted pipelines and suggest automation of configuration items.
    • Practice incident management best practices and perform RCA.
    • Participate in disaster recovery tests and operational acceptance tests
    • Analyze the technology stack that makes up the product and optimize the recovery time objective.
    • Work with team members spread across time zones
    • Share knowledge, document improvements, and mentor junior resources

    Requirements

    • Deployments MTF/Prod
    • Maintenance items (including stop/start, Disaster Recovery-related activities, etc.)
    • Monitoring
    • Support TRTs
    • Incident creation
    • CR for changes in MTF/Prod

    Skills

    • Linux & Shell Scripting
    • ITIL / ITSM
    • PL/SQL
    • SQL
    • Application Troubleshooting
    • Ticketing incident/problem management tool - Remedy
    • Monitoring Tool - Splunk (preferred), Dynatrace (preferred), or any other monitoring tool
    • Jenkins- CI/CD - good to have
    • Groovy - good to have
    • Any Cloud - AWS / Azure / PCF - good to have
    • Git basic/bit bucket - good to have
    • Even Framework architecture - good to have
    • Ansible/Chef – good to have
    • Dev-ops Basics - CI-CD Basics, Overview of git, Bit bucket, SonarQube, Fortify, CI(Jenkins), ARA, Saltstack, Chef, Artifactory, MC DevOps Toolchain