Site Reliability Engineer - Reston, VA
13 hours ago

Job description
Overview
Microsoft has an exciting opportunity for a Senior Site Reliability Engineer (SRE) to join the Azure Silver and Sovereign Team as part of the Azure Data Transfer (ADT) team. Azure Data Transfer enables secure access and data transfer between enclaves and supports multiple transfer and access patterns for highly regulated industries. In this role, you will apply SRE principles—availability, latency, performance, efficiency, change management, and incident response—to help ensure ADT is dependable at scale.
We are looking for engineers to join a fast-paced team and solve complex reliability challenges in mission-critical distributed systems spanning data transmission across clouds. Our team works across all facets of isolated system engineering and is deeply involved in defining and improving service health through SLIs/SLOs and error budgets, building automation to reduce toil, strengthening observability (logs, metrics, traces), reducing systemic latency, validating and transforming data, and optimizing throughput and capacity. You will build, deploy, and operate systems that enable a broad set of Azure services to be consumed by customers in highly secured and regulated environments, meeting strict security policy and assurance requirements for public and private sector customers.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities
- Owns reliability architecture and end-to-end service understanding (dependencies, failure modes, and customer journeys) for distributed systems at scale. Defines and improves service health via SLIs/SLOs, error budgets, and well-defined operational readiness criteria. Drives cross-team reliability reviews and recommends design changes, runbooks, and safe rollout/rollback strategies that improve availability, latency, performance, and efficiency while managing cost.
- Maintains deep, current expertise in cloud reliability practices and the evolving technology landscape. Drives adoption of new platform capabilities and operational patterns (e.g., progressive delivery, resilience testing, chaos engineering where appropriate). Mentors engineers through design reviews, incident walkthroughs, and knowledge sharing to raise the reliability bar across related services.
- Implements reliable, scalable, and high-performance changes using SRE practices (progressive delivery, feature flags where applicable, safe rollouts/rollbacks). Owns implementation and rollback plans, validates operational readiness, and reduces toil through automation, self-healing, and standardized playbooks.
- Leverages telemetry and production signals to identify reliability risks and recurring failure patterns, then ships configuration changes, code fixes, or automation to address root causes. Expands infrastructure-as-code and operational tooling so teams can manage platforms and services safely and repeatably through code and policy.
- Builds and improves observability (metrics, logs, traces, dashboards, alerts) and uses it to detect, diagnose, and prevent incidents. Defines actionable alerting, reduces noise, and ensures instrumentation supports SLO reporting and rapid troubleshooting. Develops automation to validate telemetry pipelines and to enable automated mitigation and safer incident response.
- Participates in on-call rotations and leads response for complex, high-impact incidents by establishing incident command, assessing impact, coordinating responders, and driving mitigations to restore service within SLOs. Produces and contributes to blameless postmortems with corrective and preventative actions (CPAs), tracks them to completion, and implements automation and guardrails to prevent recurrence.
- Applies secure-by-design and compliance requirements to operations, monitoring, and automation (least privilege, auditability, change control, and data handling). Partners with security, privacy, and compliance teams to identify gaps, prioritize fixes, and implement automated controls and detection to prevent repeated violations
- Embody our culture and values
Qualifications
Required / Minimum Qualifications:
- Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
Other Requirements:
Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
- The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph. Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role. Failure to maintain or obtain the appropriate U.S. Government clearance and/or customer screening requirements may result in employment action up to and including termination.
- Clearance Verification: This position requires successful verification of the stated security clearance to meet federal government customer requirements. You will be asked to provide clearance verification information prior to an offer of employment.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
- Citizenship & Citizenship Verification: This position requires verification of U.S. citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local United States government agency customer and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, citizenship will be verified via a valid passport, or other approved documents, or verified US government Clearance
Preferred Qualifications:
- Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, service engineering, or systems engineering
OR equivalent experience. - 3+ years technical experience working with large-scale cloud or distributed systems
- Experience building automation with Ansible and developing/operating CI/CD pipelines (e.g., Azure DevOps, GitHub Actions) to deliver reliable, repeatable deployments.
- Expertise in problem solving and analyzing distributed systems and critical production service environments
- Expertise in Linux, specifically Rocky 9, Redhat, Mariner or similar in throughput management, troubleshooting and security hardening
Site Reliability Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process.
Similar jobs
The SRE role bridges the Development Engineer role and the Production Engineer role with a mixture of development, test, deploy, and support skills that contribute to application reliability and resiliency. · ...
1 week ago
Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. · We leverage cutting-edge reliability engineering practices to build scalable, secure, and highly ...
1 week ago
We are looking for a seasoned SRE to establish a culture of improvement in observability and reliability. · ...
2 weeks ago
We are looking for a Site Reliability Engineer IV to join a team responsible for building, managing and maintaining the Verisign Kubernetes platform on which our mission-critical services depend. · ...
1 month ago
+ Build operate and maintain our on-prem Virtualization platform · + Implement automation of operation procedures in accordance with Verisign change policies as well as industry standard best-practices · + Monitor and resolve alerts of the platform+ Troubleshoot platform and user ...
1 month ago
Lensa is a career site that helps job seekers find great jobs in the US. We are not a staffing firm or agency. Lensa does not hire directly for these jobs, but promotes jobs on LinkedIn on behalf of its direct clients. · ...
1 week ago
We are looking for a seasoned Site Reliability Engineer or Cloud DevOps guru with US Citizenship and an Active TS/SCI w/Poly security clearance to join our exciting growing Cloud DevOps team. · ...
1 week ago
We are looking for a Site Reliability Engineer III to join a team responsible for building, managing and maintaining the Verisign Kubernetes platform on which our mission-critical services depend. · Build, operate, and maintain our on-prem Kubernetes platform clusters, · Implemen ...
1 week ago
We are looking for a Site Reliability Engineer to join our growing Cloud DevOps team. The ideal candidate will have experience with cloud services, infrastructure as code, and automation. Our team is responsible for creating and managing Oracle Database Autonomous Recovery Servic ...
3 weeks ago
We are looking for a seasoned SRE to establish a culture of improvement in observability and reliability. · ...
3 weeks ago
The Silver Edge team brings the power of Azure to the edge for our customers tackling some of the most complex and mission-critical challenges in cloud and edge computing. · ...
1 month ago
As a Site Reliability Engineer on our team, you'll work with the DoD on the development of more robust systems by building resilient infrastructure. · ...
1 week ago
We are looking for a creative and hands-on leader who loves to mentor and solve complex problems at scale.The Silver Edge team brings the power of Azure to the edge for our customers. · ...
1 month ago
+Job summary · We are looking for a Site Reliability Engineer IV to join a team responsible for building, managing and maintaining the Verisign Kubernetes platform on which our mission-critical services depend. · +Build, operate, and maintain our on-prem Kubernetes platform clust ...
2 weeks ago
We are looking for a seasoned SRE to establish a culture of improvement in observability and reliability. · ...
3 weeks ago
We are looking for a Site Reliability Engineer IV to join a team responsible for building managing and maintaining the Verisign Kubernetes platform on which our mission-critical services depend. · Build operate and maintain our on-prem Kubernetes platform clusters container workl ...
1 month ago
We are looking for a seasoned SRE to establish a culture of improvement in observability and reliability. · You will work closely with software engineering teams to ensure that applications, databases pipelines and APIs run reliably. · ...
3 weeks ago
We are looking for a seasoned SRE to establish a culture of improvement in observability and reliability. · You will work closely with software engineering teams to ensure that applications, databases, pipelines and APIs run reliably. · ...
2 weeks ago
We are looking for a Site Reliability Engineer to join our exciting growing Cloud DevOps team. Our team is responsible for creating and managing Oracle Database Autonomous Recovery Service. Our mission is to provide industry leading data protection services our customers need and ...
3 weeks ago
We are looking for a seasoned SRE to establish a culture of improvement in observability and reliability. You will work closely with software engineering teams to ensure that applications, databases pipelines and APIs run reliably. · You will be expected to create, set, and excee ...
2 weeks ago