Jobs
>
Redmond

    Site Reliability Engineer II - Redmond, WA, United States - Microsoft

    Microsoft background
    Description
    Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further.

    This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.

    We are looking to hire a Site Reliabilty Engineer II to join our team.

    Microsoft's Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence.

    The products our portfolio include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI.

    Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture.​​Within Azure Data, the databases team builds and maintains Microsoft's operational Database systems.

    We store and manage data in a structured way to enable multitude of applications across various industries.

    We are on a journey to enable developer friendly, mission-critical, AI enabled operational Databases across relational, non-relational and OSS offerings.​​​Service Reliability Team Data and how it is interpreted is essential for every business to succeed.

    The team is responsible for ensuring our critical services are running efficiently, securely and with high reliability.

    We work with many different teams to improve service reliability by continually innovating tooling, automation services and processes to make supporting our products scalable and efficient.​​We do not just value differences or different perspectives.

    We seek them out and invite them in so we can tap into the collective power of everyone in the company.

    As a result, our customers are better served.
    Microsoft's mission is to empower every person and every organization on the planet to achieve more.

    As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals.

    Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

    Required/Minimum Qualifications​​4+ years technical experience in software engineering, network engineering, or systems administration o OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration o OR Master's Degree in Computer Science, Information Technology, or related field.​Other RequirementsAbility to meet Microsoft, customer and/or government security screening requirements are required for this role.

    These requirements include, but are not limited to the following specialized security screenings:

    Microsoft Cloud Background Check:

    o This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

    Preferred/Additional Qualifications​​Experience in developing highly scalable Service Reliability services and extensive experience using cloud online services. o Experience in prompt flow engineering and LLM systems.

    o Experience in developing reporting dashboards such as Power BI.Site Reliability Engineering IC- The typical base pay range for this role across the U.S.

    is USD $94,300 - $182,600 per year.

    There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $120,900 - $198,600 per year.

    Certain roles may be eligible for benefits and other compensation.

    Find additional benefits and pay information here:
    Microsoft will accept applications for the role until May 13, 2024. Microsoft is an equal opportunity employer.

    All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances.

    We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.

    If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.

    Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.#azdat #azuredata ​​#databases #reliability #security #llm​ ​Technical Knowledge and Domain-Specific Expertise· Demonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures.

    Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the reliability and operability of supported products with minimal guidance from other engineers.· Develops an understanding of the code, features, and operations of specific products at scale as required to contribute to incremental improvements in product availability, reliability, efficiency, observability, and/or performance; participates in on-boarding, code/design reviews, and regular meetings with the engineering teams that develop and/or manage those products.· Researches and maintains an awareness in industry trends, advances in distributed systems and cloud technologies, new tools, and/or processes for maintaining and improving product availability, reliability, efficiency, observability, and/or performance.

    Contributes to the implementation of new solutions within their team by identifying ways they can be applied to solve persistent problems.

    Contributions to Development and Design· Leverages technical expertise in large scale distributed systems and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or code to improve the availability, reliability, efficiency, observability, and performance of product components or features supported by their team.· Designing and implementing Service Reliability services, tooling and processes.


    • Generating software specifications, proof-of-concepts, and prototype solutions given high level feature requirements.
    • Using data and telemetry to improve feature work and propose feature improvements to existing products. ​ · Develops and tests basic changes to optimize code and improve the observability, reliability and operability of a defined range of platform, system, or product components or features with direction from other engineers.· Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles; leverages technical expertise on underlying systems/platforms and insights drawn from engagements with product engineering teams and telemetry analyses to propose potential improvements in code base and designs across components and features of one or more products.

    Driving Operational Excellence· Independently develops code or scripts that automate the performance of repetitive and easily scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.· Leverages technical expertise and telemetry analysis across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.· Identifies opportunities to leverage existing tools and automation to enable product engineering teams to increase the velocity in which they can reliably and safely implement changes in production; monitors the effects of changes across multiple components or features within a single platform or system.· Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale.

    Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations; monitors the impact of changes on operations metrics (e.g., Time-to-X).· Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, reliability, performance, and/or efficiency of components and features; proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams.· Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed.

    Shares detailsrelated to incidents and their resolution through post-mortem reports and during regular review meetings.· Develops alerts and instrumentation across components and features to monitor product capacity and resource demands and analyze telemetry data using existing capacity planning models; draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters.· Utilizes insights from performance and resource monitoring tools to identify whether there is a need to optimize the efficiency of component and feature code, or if changes to compute resources are required; models the predicted effect of changes to code and/or compute resources across components or features to document the efficacy of proposed solutions.· Shares insights and best practices that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams.

    Employment typeFull-TimeWork siteUp to 50% work from homeRole typeIndividual ContributorDisciplineSite Reliability EngineeringProfessionSoftware Engineering


  • Quadrant Technologies Redmond, United States

    Responsibilities include but are not limited to: · Monitor and maintain the Reliability, Availability, and Performance of the Cosmos DB service. · Design and implement Disaster Recovery and Business Continuity plans. · Collaborate with engineering teams to build and enhance tooli ...


  • Space Exploration Technologies Corporation Redmond, United States

    Starshield leverages SpaceXs Starlink technology and launch capability to support national security efforts. While Starlink is designed for consumer and commercial use, Starshield is designed for government use, with an initial focus on earth observ Reliability Engineer, Liabilit ...


  • Quadrant Technologies Redmond, United States

    Responsibilities include but are not limited to: · Monitor and maintain the Reliability, Availability, and Performance of the Cosmos DB service. · Design and implement Disaster Recovery and Business Continuity plans. · Collaborate with engineering teams to build and enhance to ...


  • Quadrant Technologies Redmond, United States

    Responsibilities include but are not limited to: · Monitor and maintain the Reliability, Availability, and Performance of the Cosmos DB service. · Design and implement Disaster Recovery and Business Continuity plans. · Collaborate with engineering teams to build and enhance tooli ...


  • Space Exploration Technologies Corporation Redmond, United States

    The Equipment Reliability Engineer is responsible for providing engineering support on planned and unplanned repairs, modifications, and upgrades of production equipment in the Starlink programs in Redmond. Equipment Reliability Engineers are the pri Reliability Engineer, Equipme ...


  • Space Exploration Technologies Corp. Redmond, United States

    SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars. ...


  • Quadrant Technologies Redmond, United States

    Role: Site Reliability Engineer · Location: Redmond, WA · Responsibilities include but are not limited to: · Monitor and maintain the Reliability, Availability, and Performance of the Cosmos DB service. · Design and implement Disaster Recovery and Business Continuity plans. · Col ...


  • WaferWire LLC Redmond, United States

    WaferWire is currently seeking a Site Reliability Engineer to join its innovative team. This role involves implementing and maintaining robust DevOps practices within an Azure cloud environment to ensure smooth deployment and operation of services. · The responsibilities include ...


  • WaferWire Cloud Technologies Redmond, United States

    Role: Site Reliability Engineer (SRE) · Location: Redmond, WA [Onsite] · Job Description: · Implement and maintain robust DevOps practices within Azure cloud environment to ensure seamless deployment and operation of services. · Utilize PowerShell scripting expertise to automate ...


  • Quadrant Technologies Redmond, United States

    Role: Site Reliability Engineer · Location: Redmond, WA · Responsibilities include but are not limited to: · Monitor and maintain the Reliability, Availability, and Performance of the Cosmos DB service. · Design and implement Disaster Recovery and Business Continuity plans. ...


  • WaferWire Cloud Technologies Redmond, United States

    WaferWire is currently seeking a Site Reliability Engineer to join its innovative team. This role involves implementing and maintaining robust DevOps practices within an Azure cloud environment to ensure smooth deployment and operation of services. The responsibilities include us ...


  • WaferWire LLC Redmond, United States

    Role: Site Reliability Engineer (SRE) · Location: Redmond, WA [Onsite] · Job Description: · Implement and maintain robust DevOps practices within Azure cloud environment to ensure seamless deployment and operation of services. · Utilize PowerShell scripting expertise to automate ...


  • Microsoft Corporation Redmond, United States

    Microsoft Silicon and Cloud Hardware Infrastructure Engineering (SCHIE) is the team behind Microsofts expanding Cloud Infrastructure and responsible for powering Microsofts Intelligent Cloud mission. CHIE delivers the core infrastructure and foundational technologies for Microsof ...


  • SpaceX Redmond, United States

    SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal ofenabling human life on Mars. ...


  • QData Redmond, United States contract

    Responsibilities Work with team of engineers focused on improving the reliability scalability latency and efficiency of PSTN services powering cloud communications. Managing problem resolution with service providers. Learning existing tools enhancing them to meet new scale and fe ...


  • QData Redmond, United States contract

    HiHope you are doing good...We have an urgent requirement below please go through Job description and send your updated profile and expected rate ASAP.Please reach me at .comJob Role Site Reliability EngineerJob Location Redmond WAJob Description Ability to build tooling (dashboa ...


  • Microsoft Corporation Redmond, United States

    Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft's expanding Cloud Infrastructure and responsible for powering Microsoft's "Intelligent Cloud" mission. SCHIE delivers the core infrastructure and foundational technologies for M ...


  • Microsoft Redmond, WA, United States

    Microsoft has an exciting opportunity for a Senior Site Reliability Engineer in the Cloud+AI Silver Team. This team will be responsible for deploying and operating a Secure Work Area, including the infrastructure for collaboration within an airgapped environment. In this role, yo ...


  • Microsoft Redmond, WA, United States

    Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.Microsoft's Azure Data en ...


  • Microsoft Corporation Redmond, United States

    Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft's expanding Cloud Infrastructure and responsible for powering Microsoft's "Intelligent Cloud" mission. SCHIE delivers the core infrastructure and foundational technologies for M ...