Senior Site Reliability Engineer - Des Moines, United States - Workforce Connections
Description
Job TitleSenior Site Reliability Engineer
Contract Duration
6+ Months with possible contract to hire
Location:
Remote - Must reside in U.S.
Prefer EST or CST time zone
Work Hours
Business Hours
Qualifications/Skills Needed
A Bachelor's degree in a quantitative or business field (e.g., statistics, mathematics, engineering, computer science).
Requires 4 6 years of related experience.
AWS
Route 53
Lamba
Mongo DB
Kafka
Kubernetes
Load Balancing / Load Redirecting / Load Restricting strategies
Rancher, Axway API Gateway
Monitoring and Observability tools such as Prometheus, Grafana, Dynatrace, Splunk, Elk
Common Responsibilities Will Include
Building, reviewing and maintaining application design and architecture documents.
Ensuring the DR capabilities are built into each system.
Working with development teams to implement and maintain the DR capabilities.
Participate in DR testing exercises and evaluate the results for continuous improvement.
Helps lead projects that are focused on managing and maintaining optimum platform infrastructure performance, reliability, and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes and continuous delivery(CI/CD) tools, processes, and designs.
Develops complex services to automate monitoring activities and provide critical information to facilitate response and resolution of performance and availability issues and incidents.
Understands and advocates for standardized and scalable software tools to ensure that systems operate without interruption at optimum performance and leads project teams through out the deployment process.
Essential Functions
Troubleshoots and resolves more complex problems with systems and services and initiates regular deployment of new versions of the systems and their subcomponents
Leads more complex projects focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility.
Uses knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization
Identifies and implements necessary manual and automated procedures for improved collaborative response in real-time
Leads lower level Engineers in stress, security, and performance testing
Resolves issues that come up through support escalation
Keeps documentation and runbooks up to date to effectively deal with new incidents that might arise
Leads post incident reviews and documents findings for future informed decision making
Reviews proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability and makes decisions around which proposals should move forward.
CLIENT does not discriminate in employment on the basis of race, color, religion, sex (including pregnancy and gender identity), national origin, political affiliation, sexual orientation, marital status, disability, genetic information, age, membership in an employee organization, retaliation, parental status, military service, or other non-merit factor.