- Design and build highly scalable and resilient systems to support our applications and services, incorporating predictive analytics to anticipate reliability risks.
- Develop and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using machine learning anomaly detection to ensure systems meet reliability targets.
- Drive improvements in system reliability, availability, and performance through proactive measures, automation, and intelligent failure prediction.
- Implement and manage comprehensive monitoring and alerting solutions, integrating with intelligent observability platforms that reduce alert noise and correlate events.
- Develop and maintain dashboards and reporting tools that provide data-driven insights for actionable troubleshooting recommendations and performance optimization.
- Evaluate and integrate advanced monitoring tools and operational intelligence platforms to enhance observability and root cause identification.
- Lead and participate in incident response efforts, using intelligent log analysis and automated event correlation to speed up troubleshooting and root cause identification.
- Develop and maintain incident management processes incorporating automated decision support systems to improve response times and minimize service disruptions.
- Conduct post-incident reviews, using automated pattern recognition and trend analysis to identify systemic issues and implement preventive measures.
- Analyze performance metrics and logs, supported by advanced observability tools, to detect bottlenecks and inefficiencies.
- Collaborate with development teams to implement automated profiling and optimization recommendations for code and infrastructure improvements.
- Perform capacity planning using machine learning forecasting models to ensure systems can handle current and future loads.
- Develop and implement automation solutions, including intelligent runbook automation, self-healing systems, and automated incident triage.
- Identify and drive process improvements by applying machine learning to operational data for continuous optimization.
- Maintain documentation that includes automation and machine learning guidelines for monitoring, incident management, and SRE best practices.
- Work closely with engineering, operations, and product teams to align reliability and monitoring goals, including automation adoption strategies.
- Communicate effectively with stakeholders, providing regular updates on system health, incidents, performance improvements, and data-driven insights.
- Foster a culture of collaboration, knowledge sharing, and automation best practices within the team and across the organization.
- Bachelor's degree in computer science or a related technical field.
- At least 5 years of experience in Site Reliability Engineering or a similar role.
- Strong proficiency in at least one programming language such as Python, Go, or C#
- Demonstrated experience applying machine learning and automation to operational workflows such as monitoring, alerting and incident response.
- Expertise with infrastructure as code tools such as Terraform
- Proven experience working and monitoring container environments such as Cloud Run and Kubernetes.
- Hands-on experience using and working within an Azure, AWS, and GCP environment (GCP preferred)
- Strong understanding of networking, distributed systems, and cloud infrastructure.
- Familiarity with intelligent monitoring platforms and operational analytics tools such as Prometheus, Grafana, OpenSearch, Sentry, Google Cloud Observability
- Excellent problem-solving skills and the ability to work independently and as part of a team.
- Experience with incident management, root cause analysis, and automated operational workflows.
-
Reliability Engineer
1 week ago
Only for registered members Chicago, Illinois, United StatesThis role will ensure that the site manufacturing & support activities are uninterrupted to ensure compliance to Local legal regulations, Quality, Health Safety & Environment management systems and standards. · ...
-
Reliability Engineer
2 days ago
MCC ChicagoThe Reliability Engineer is responsible for building equipment reliability and asset care systems that enable world-class manufacturing performance. · Serve as a TPM subject matter expert, driving implementation of MCC's TPM pillars (Focused Improvement, Autonomous Maintenance, P ...
-
Reliability Engineer
3 days ago
MCC ChicagoReliability Engineer · This role leads Total Productive Maintenance (TPM) strategy at MCC. · Analyze, design and execute preventive and predictive maintenance programs · Embed TPM principles across all facility systems · ...
-
Reliability Engineer
6 hours ago
MCC ChicagoReliability Engineer · Build Your Career with an Industry Leader · As the global leader of premium labels, Multi-Color Corporation (MCC) helps brands stand out in competitive markets and inspire positive consumer experiences. Backed by over a century of printing expertise, MCC is ...
-
Reliability Engineer
2 days ago
MCC ChicagoThe Reliability Engineer is a key driver of MCC's Total Productive Maintenance (TPM) strategy—responsible for building equipment reliability and asset care systems that enable world-class manufacturing performance. · Serve as a TPM subject matter expert,drive implementation of MC ...
-
Reliability Engineer
4 days ago
Only for registered members Chicago, IL, United StatesWe are seeking a dynamic individual to join our reliability function as a Reliability Engineer. The successful candidate will be crucial in ensuring smooth operations and supporting our diverse team. · ...
-
Reliability Engineer
3 weeks ago
Only for registered members Chicago Full time $73,900 - $101,640 (USD)Ensure site manufacturing & support activities run without interruption, ensuring compliance to local legal regulations, quality, health safety & environment management systems and standards. · ...
-
Reliability Engineer
1 month ago
Only for registered members Chicago Full time $78,800 - $111,000 (USD)GATX Corporation is seeking a Reliability Engineer to support their data analytics and reliability analysis efforts. · ...
-
Reliability Engineer
1 month ago
Only for registered members Chicago Full time $77,600 - $106,700 (USD)The United States is the largest market in Mondelēz International with a significant employee footprint. · You will bring strong operational & manufacturing leadership experience in CPG industry with experience in TPM-Total productive maintenance, · 5s, · LEAN, · & LEAN tools · & ...
-
Reliability Engineer
1 month ago
Only for registered members ChicagoEnsure site manufacturing & support activities without interruption. · ...
-
Reliability Engineer
1 week ago
Only for registered members Chicago $78,800 - $111,000 (USD)+Job summary · We are proud of our high-performance culture, hard-working and enthusiastic management team, and beautiful office space in the Willis Tower · +Optimize the reliability of life-cycle maintenance on the GATX rail fleet. · Analyze and track all key fleet performance i ...
-
Reliability Engineer
1 month ago
Only for registered members ChicagoWe are seeking a Network Reliability Engineer III to join our dynamic team. · In this role, you will design, develop and maintain self-service tools and applications that enhance productivity and reduce operational costs.You will work across the full stack both front-end and back ...
-
Reliability Engineer
1 week ago
Only for registered members Chicago $77,600 - $106,700 (USD)The goal is to ensure manufacturing activities without interruption. · ...
-
Reliability Engineer
1 week ago
Only for registered members Chicago+Job Summary · We are professionals in Industrial Maintenance making factories run better. · +Responsibilities · • Extensive travel required · • Promotes and adheres to the ATS safety culture · • Engages in various work environments · +A market leading benefit programs including ...
-
Reliability Engineer
1 month ago
Only for registered members Chicago $55 - $68 (USD)Daily/Weekly travel to customer sites within Regional Area. · ...
-
Production Reliability Engineer
12 hours ago
Only for registered members Greater Chicago AreaA top global trading firm is seeking a Production Reliability Engineer to join its Central Operations and Reliability Engineering team within Production Infrastructure. · ...
-
Senior Reliability Engineer
1 month ago
Only for registered members Chicago, ILWe are seeking a Network Reliability Engineer III to join our dynamic team. In this role, you will design, develop and maintain self-service tools and applications that enhance productivity and reduce operational costs. · ...
-
Site Reliability Engineer
6 days ago
Only for registered members ChicagoBright Vision Technologies is a software development company that builds innovative solutions to automate and optimize operations. · ...
-
Site Reliability Engineer
1 week ago
Only for registered members Chicago, ILMA Capital US LLC is a proprietary trading firm seeking a Site Reliability Engineer to support and evolve its production trading environment with a strong focus on Linux performance, automation and reliability. · Reliability & Production Ownership: Own the availability, stability ...
-
Site Reliability Engineer
2 weeks ago
Only for registered members Chicago, IL+Job summary · The Site Reliability Engineer will work closely with the Product Development team and contribute to ensuring the reliability, performance, and availability of applications. · +QualificationsHands-on experience in DevOps, Infrastructure, or Site Reliability Engineer ...
-
Site Reliability Engineer
3 weeks ago
Only for registered members Chicago, ILWe are seeking Site Reliability Engineers who are passionate about marrying data with emerging technologies to join our team. · ...
Senior Site Reliability Engineer - Chicago - The Aspen Group
Description
The Aspen Group (TAG) is one of the largest and most trusted retail healthcare business support organizations in the U.S. and has supported over 20,000 healthcare professionals and team members with close to 1,500 health and wellness offices across 48 states in four distinct categories: dental care, urgent care, medical aesthetics, and animal health. Working in partnership with independent practice owners and clinicians, the team is united by a single purpose: to prove that healthcare can be better and smarter for everyone. TAG provides a comprehensive suite of centralized business support services that power the impact of five consumer-facing businesses: Aspen Dental, ClearChoice Dental Implant Centers, WellNow Urgent Care, Chapter Aesthetic Studio, and Lovet Pet Health Care. Each brand has access to a deep community of experts, tools and resources to grow their practices, and an unwavering commitment to delivering high-quality consumer healthcare experiences at scale.
As a Senior Site Reliability Engineer (SRE) at TAG - The Aspen Group, you will be responsible for ensuring the reliability, performance, and scalability of our core systems. This role involves proactively building and managing, monitoring solutions, lead incident response, and continuously optimizing system performance to exceed business objectives. We are actively integrating AI and machine learning into our operational workflows, and you will be on the front lines, leveraging intelligent automation and machine learning to build a proactive resilient infrastructure. This is an opportunity to go beyond SRE by applying cutting-edge technology to solve complex reliability challenges.
Responsibilities:
Intelligent Site Reliability Engineering:
-
Reliability Engineer
Only for registered members Chicago, Illinois, United States
-
Reliability Engineer
MCC- Chicago
-
Reliability Engineer
MCC- Chicago
-
Reliability Engineer
MCC- Chicago
-
Reliability Engineer
MCC- Chicago
-
Reliability Engineer
Only for registered members Chicago, IL, United States
-
Reliability Engineer
Full time Only for registered members Chicago
-
Reliability Engineer
Full time Only for registered members Chicago
-
Reliability Engineer
Full time Only for registered members Chicago
-
Reliability Engineer
Only for registered members Chicago
-
Reliability Engineer
Only for registered members Chicago
-
Reliability Engineer
Only for registered members Chicago
-
Reliability Engineer
Only for registered members Chicago
-
Reliability Engineer
Only for registered members Chicago
-
Reliability Engineer
Only for registered members Chicago
-
Production Reliability Engineer
Only for registered members Greater Chicago Area
-
Senior Reliability Engineer
Only for registered members Chicago, IL
-
Site Reliability Engineer
Only for registered members Chicago
-
Site Reliability Engineer
Only for registered members Chicago, IL
-
Site Reliability Engineer
Only for registered members Chicago, IL
-
Site Reliability Engineer
Only for registered members Chicago, IL