Sr Site Reliability Engineer/SRE with Data Dog, Dynatrace, and PageDuty - Houston, United States - Elite Mente LLC

    Default job background
    Description
    Job Description

    Job Description

    Sr Site Reliability Engineer

    Location:
    Remote ( CST or EST time Preferred)
    Appliacnt shall have strong technical experince on Data Dog, Dynatrace, and page duty.
    Applicant shall have experience with Incident managment and service managment(ITSM).


    Responsibilities:
    This is a strategic and hands-on position where you will work closely with cross-functional teams to develop and optimize Service Management
    Processes

    (Incident/Problem/Change

    management), drive continuous improvement, and enhance our proactive capabilities.
    Monitor system management consoles and respond to alerts.
    Facilitate Major Incident conference calls independently performing multiple roles including, Situation Leader, Scribe and Communications to executive Leadership.

    Lead and coordinate the end-to-end incident management process, from detection and diagnosis to resolution and post-incident analysis including RCAs to ensure correct monitoring and automated alerting is in place to prevent any repetition.

    Help increase problem tracking and root cause analysis and availability of products across Technology.
    Proactively detect and prevent future problems/incidents and initiate the Problem Management process to allow quicker diagnosis and resolution.
    Conducting in the weekly Change Advisory Board calls, etc. and tooling automation (requirements, testing, adoption) to support Change Management Operations.
    Develop trend analysis and prepare service improvement plans to address identified gaps.
    Implement and enforce OLAs/SLAs to ensure effective governance of change requests through the Change Management lifecycle.
    Define and inspect metrics, KPI, and trend reports for use in the problem management process.

    Build strong relationships with key stakeholders, including senior management, department heads, and external partners, to ensure their support and engagement in incident management initiatives.

    Foster a culture of continuous improvement, staying abreast of industry trends, emerging technologies, and best practices to enhance incident management capabilities.

    Create dashboards and reports to provide insights into operational performance and health.
    Leverage automation to optimize processes and workflows.

    Complete any assigned project work or tasks, with a view to improving existing processes, capabilities and seek out automation opportunities.

    Collaborate with engineering teams to ensure that incident learnings are integrated into the software development lifecycle to improve overall system resilience.


    What we expect of you:
    7+ years of overall service management experience

    Influence:

    Persuades others to support and commit to desired actions; communicates the urgency and importance needed to mobilize others; uses expertise, credibility, and personal style to convince others to accept recommendations or adopt new attitudes.


    Business Acumen:

    Behaviors are aligned with how the business operates; considers trends and competitive information in decisions and actions; aligns with the culture, strategy, priorities, and practices of the business; aligns with and understand the customer and the market to solve problems and capture opportunities.


    Effective Communicator:

    Actively listens to, summarizes, and considers comments to ensure understanding; Prepares and delivers proposals and presentations; Proactively ensures the timely sharing of relevant information to appropriate people.


    Results Orientation:

    Works towards goals, overcoming obstacles, setbacks, and uncertainty; identifies barriers to goal achievements; plans for contingencies to ensure delivery.


    Initiative:

    Takes prompt and independent action when appropriate; shares ideas; recommends solutions to problems; does what's necessary without being prompted or incented; seeks increasing levels of challenge and opportunity.


    Negotiation:

    Provides strong arguments for positions; achieves desired negotiating outcomes while meeting others' needs; effectively deals with emerging issues; resolves challenging problems; wins concessions without harming relationships.


    Organization and Execution:

    Sets, manages, and completes work on multiple priorities and work activities; organizes work in ways that improve efficiency; completes assignments rapidly and thoroughly; effectively handles conflicting priorities and demands; eliminates roadblocks.


    Problem Solving & Analysis:

    Systematically gathers relevant information and input; considers a broad range of factors; grasps complexities and sees relationships among data, events, or problems; applies fact-based logic; generates alternatives.

    Availability for on-call rotations and off-hours as needed.
    Hands on experience with monitoring and performance monitoring tools like DataDog, Dynatrace, Splunk, etc.
    Experience with ServiceNow ITSM modules - Incident Management, Problem Management, Change Management, Reporting and Analytics
    Azure foundation certification, analytical skills, a plus.

    #J-18808-Ljbffr