- Lead the management and monitoring of highly available replicated cloud systems.
- Oversee 24/7 Network Operations Center (NOC) operations, maintaining a minimum 99.9% annual uptime.
- Define golden signals for all services in our core SaaS application.
- Manage NOC engineer teams, including scheduling and responsibilities.
- Design PagerDuty escalation policies across various teams.
- Expertise in AWS technologies and building dashboards with leading observability platforms.
- Automate monitors and dashboards using modern programmatic methods.
- Provide regular reports to Engineering leadership and executive teams for continuous improvement.
- Minimum B.S. or B.A. in Computer Science.
- Minimum of 5 years of experience as a Site Reliability Engineer, including some experience in managing teams and leading projects.
- Stellar communication and interpersonal skills for effective collaboration with Development & Product teams.
- Proficiency in monitoring the networking stack using distributed tracing and profiling tools.
- Proficient with building dashboards with NewRelic, Kibana, Grafana, Prometheus and other observability platforms.
- Proficient with AWS technologies.
- Working knowledge in monitoring RESTful microservices and basic HTTP protocols.
- Able to automate monitors and dashboards using REST APIs, GraphQL, and other modern programmatic methods.
- Working knowledge of profiling tools for measuring CPU, Memory, I/O, Disk, and process threads dumps.
- Experience in managing, integrating, and automating alerting and escalation tools.
- Must be able to work a HYBRID WORK SCHEDULE (3 days in office, 2 days work from home) and come into Avetta's Dallas Office located at 2000 McKinney Ave, Dallas, TX 75201.
- Troubleshooting experience with modern container and networking technologies (Kubernetes, HAProxy, ALB).
- Familiarity with scripting languages like Bash, Python, and Go.
- Ability to administer and tune load balancer technologies.
- Experience in managing, monitoring, and benchmarking distributed file systems.
- Proficiency in configuration management tools (SaltStack, Ansible, Terraform).
- System Monitoring: Create and automate system monitor and escalation policies.
- System Management: Respond and resolve internal requests within business hours.
- High Availability & Resilience: Maintain 99.95% uptime and be the first responder in emergency situations.
- Full-Stack Observability: Build dashboards for end-to-end detection of system anomalies.
- Innovation: Propose new ideas and improvements to the team regularly.
-
Senior Site Reliability Engineer
3 weeks ago
Epsilon Grand Prairie, United StatesSonova · Radolfzell am Bodensee, BW 78315 · posted 04/27/2024 · More... · front runner · Buyer/Planner Fertigungsdisponent:in (d/w/m) · Danaher · Bodman-Ludwigshafen, Baden-Württemberg 78351 · posted 04/27/2024 · More... · front runner · Manager: in Software Team (d/f/m) ...
-
Reliability Engineer
3 weeks ago
Mass Staffing Projects Dallas, United StatesJob Description · Looking for something reliable? One of our top mining clients is looking for a skilled and experienced Reliability Engineer to join their team in Free State. · Requirements Include: · Senior Certificate · Degree in Mechanical / Electrical Engineering · GCC ...
-
Reliability Engineer
3 weeks ago
Mass Staffing Projects Dallas, United StatesJob Description Looking for something reliable?One of our top mining clients is looking for a skilled and experienced Reliability Engineer to join their team in Free State.Requirements Include: · •Senior Certificate · •Degree in Mechanical / Electrical Engineering · •GCC Mines & ...
-
Site Reliability Engineer
2 weeks ago
AllSTEM Connections Plano, United StatesSITE RELIABILTY ENGINEER · ON W2 · PLANO,TX/HOUSTON,TX/DELAWARE · HYBRID REPORTING: 3DAYS ONSITE · SKILLSET NEEDED: · AWS · BIG DATA · SPARK · PYTHON · SCRIPTING · SHELL · PERL · CONTROL-M · AUTOSYS · GRAFANA · ...
-
Senior Reliability Engineer
5 days ago
RTX Dallas, United StatesDate Posted: · :00 · Country: · United States of America · Location: · TX360: Dallas - North Bldg 13510 North Central Expressway North Building, Dallas, TX, 75243 USA · The Whole Life Engineering (WLE) Department is made up of several disciplines whose main objective is to ...
-
Site Reliability Engineer
1 week ago
Saicon Consultants Dallas, United StatesSite Reliability Engineer (Buffer) · Location:Dallas, TX · Posted On: 11/08/2023 · Requirement Code: 66074 · Requirement Detail · Job Description: Site Reliability Engineer (Buffer) · • Bachelor's Degree in Computer Science or related; or equivalent combination of education and ...
-
Electrical Reliability Engineer
3 weeks ago
WestRock Dallas, United StatesThe Electrical Reliability Engineer is considered the local expert on the equipment within their area of assignment and must either have demonstrated capability unique to the equipment or be capable of quickly assimilating new information to become e Reliability Engineer, Electri ...
-
Site Reliability Engineer
1 week ago
KTek Resourcing Dallas, United StatesJob Overview: · We are looking for a motivated Junior Operations Engineer to ensure the smooth operation of our software and systems. This role combines technical expertise with problem-solving skills to automate operational processes, enhance system functionality, and maintain t ...
-
Senior Reliability Engineer
4 days ago
RTX Dallas, United StatesDate Posted: · :00 · Country: · United States of America · Location: · TX360: Dallas - North Bldg 13510 North Central Expressway North Building, Dallas, TX, 75243 USA · The Whole Life Engineering (WLE) Department is made up of several disciplines whose main objective is to influe ...
-
Site Reliability Engineer
3 weeks ago
Global Mobility Services Dallas, United States** We are not looking for C2C profiles at the moment** · Site Reliability Engineer · 7 months contract to start (potential for CTH) · Looking for hybrid resources local to Denver or Dallas. · Summary · Our client is actively seeking a skilled and experienced Site Reliability Eng ...
-
Site Reliability Engineer
5 days ago
Diverse Lynx Dallas, United StatesJob Title: Site Reliability Engineer · Location: Dallas, TX//Onsite · Duration: Full Time-Only · Job Description · Responsible for ensuring the reliability of systems, minimizing downtime, and maintaining service-level objectives (SLOs). · Developing, automation and implemen ...
-
Site Reliability Engineer
2 weeks ago
Diverse Lynx Dallas, United StatesJob Title: Site Reliability Engineer · Location: Dallas, TX//Onsite · Duration: Full Time-Only · JOB DESCRIPTION: · Responsible for ensuring the reliability of systems, minimizing downtime, and maintaining service-level objectives (SLOs). · •Developing, automation and imple ...
-
Reliability Engineer, Remote
5 days ago
Oldcastle BuildingEnvelope Dallas, United StatesReliability Engineer, Remote · Who We Are · At OBE, together, we build excellence every day.We are driven by our passion to lead our industry and build a sustainable future, we focus on exceeding customer expectations and delivering innovative solutions. We succeed through the ...
-
Site Reliability Engineer
3 weeks ago
Suncaptech Dallas, United StatesRole: Site Reliability Engineer · Location: Dallas, TX (Onsite) · 2 Positions available · Implement SRE practices · Identify, craft, and maintain SLIs and SLOs for teams, as well as metrics such as MTTR, Lead time for change, Deployment Frequency and Change Failure Rate · Work ...
-
Site Reliability Engineer
16 hours ago
ConsultUSA Dallas, United StatesDescription: · Our client has an immediate need for a Site Reliability Engineer, who will be responsible for enabling engineering teams with guidance and tools to deliver frequent, high quality and reliable components as part of our digital platform · Requirements: · Bachelor's d ...
-
Site Reliability Engineer
6 days ago
Saxon Global Dallas, United StatesAs a member of the Production Support/SRE team you will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You'll excel if you have enthusiasm for digging deep, and ...
-
Site Reliability Engineer
3 days ago
K-Tek Resourcing LLC Dallas, United StatesJob Overview: · We are looking for a motivated Junior Operations Engineer to ensure the smooth operation of our software and systems. This role combines technical expertise with problem-solving skills to automate operational processes, enhance system functionality, and maintain t ...
-
Site Reliability Engineer
2 weeks ago
Avetta Dallas, United StatesJoin Avetta as a Site Reliability Engineer · Site Reliability Engineers are pioneers of the production systems, we believe in proactive discovery and analysis of our entire stack, continually optimizing, tuning, and scaling the system for maximal end-user experience on a globall ...
-
Site Reliability Engineer
1 week ago
Tecktiva Dallas, United StatesJob Title: Site Reliability Engineer · Location: Phoenix, AZ / dallas, TX · Duration: 6 Months+ · Responsibilities: · Expert in Observability & SRE principles, SLI, SLO and SLA definition and management · Experienced in Grafana stack and other Application performance managemen ...
-
Site Reliability Engineer
4 days ago
PMG, Inc. Dallas, United StatesPMG is a digital company that helps marketers connect people with their brand. Focused on people and grounded in data, our award-winning culture fosters meaningful careers. Partnering with the most iconic brands in the world, we put people at the center of everything we do to del ...
Site Reliability Engineer - Dallas, United States - Avetta
Description
Join Avetta as a Site Reliability Engineer
Site Reliability Engineers are pioneers of the production systems, we believe in proactive discovery and analysis of our entire stack, continually optimizing, tuning, and scaling the system for maximal end-user experience on a globally distributed cloud-based SaaS platform. Downtime is not within the SRE's vocabulary. The ability to maintain highly resilient and distributed systems, while integrating uptime monitors using programmatic APIs and developing intelligent scaling algorithms are important skills for the SRE. In addition, the SRE needs to be able to communicate effectively with both development and product teams to drive technical discovery and help prioritize features that maintain and exceed uptime goals and end-user experience.
Essential Duties and Responsibilities:
Minimum Qualifications:
Nice to Haves:
Metrics That Matter:
Join us at Avetta and be at the forefront of driving technical excellence and ensuring a seamless experience for our users across the globe.
#LI-HYBRID