- Reliability of the platform KPI (SLI, SLOs)
- Direct work using data (metrics)
- Continuous analysis and highlight areas of reliability deficiency
- Advocate, influence, and follow up on action items regarding reliability
- Monitoring and metrics
- Collaborate with global engineering stakeholders to establish a higher-level platform dashboard that will allow a shared centralized view of key platform metrics.
- Work closely with service teams to build and maintain an SLO framework
- Support critical hardware and software releases and product launches
- Create, monitor, and maintain platform monitors and assist in triage
- Actively participate and improve the SmartThings Incident Commander Group
- Active participant in on-call rotation for incidence
- Best practice development, adoption, and evaluation
- Review post-mortems and look for areas of optimization patterns we should focus on, and work to correct these by working with teams or building company-wide best practices.
- Facilitate a community of practice for operations and site reliability concepts to extend the capabilities of service teams through a culture of trust and team empowerment
- Mentor engineers on Site Reliability Engineering principles, practices, and tools
- Develop Platform Reliability Operational Health Guidance
- Bachelor's degree in Computer Science or Electrical/Computer Engineering or similar experience
- +8 years of software engineering experience
- +5 years of operational experience in improving Service Reliability, Availability, and Performance.
- Advanced knowledge of distributed systems and network infrastructure protocols.
- Demonstrated ability to manage, troubleshoot, and resolve incidents in distributed environments.
- Experience solving problems.
- Expertise in analyzing and fixing large-scale distributed systems
- Experience with Observability tooling (e.g. Sumologic and Datadog)
- Experience with AWS cloud technologies
- Programming experience with an object-oriented programming language (eg. Java, Kotlin), and scripting languages (eg. Python).
- Proficiency in Linux Operating Systems
- Excellent communication skills, including the ability to build trust and influence others
- Experience working across time zones, geographies, languages, and cultures.
- Experience working in the IoT Industry
- Experience leading change initiatives or coaching cloud operations
- A deep understanding of web technologies and site reliability engineering (SRE).
- Experience as a technical lead
- Experience working in a multi-cloud service provider environment
- We offer an attractive compensation package with comprehensive health benefits, including medical, dental, vision, and mental health; an HSA with employer contribution; life & disability insurance; FSAs for health and dependent care expenses; a competitive 401k with a 5% employer match, and more.
- All of our employees enjoy unlimited PTO, 12 paid holidays, and a generous parental leave policy (8 weeks fully paid parental leave and 8 more fully paid weeks for childbirth recovery leave).
- Eligible employees benefit from our education reimbursement program, and all employees enjoy access to learning resources through O'Reilly.
- Our commitment to diversity, equity, inclusion and belonging is embedded into our culture and our work, and everyone has frequent opportunities to join forums and groups and participate in ongoing projects.
-
Liebherr Group Mountain View, United StatesSafety & Reliability Engineer für den Bereich Elektroniksysteme (m/w/d) Lindenberg | Job ID 70278 · Organization · Liebherr-Aerospace Lindenberg GmbH · Country · Deutschland · Entry level · Berufserfahrene · Faszinierendes schaffen: Ihre Aufgaben · Entwicklung, Nachweisführu ...
-
Site Reliability Engineer
4 days ago
Advantis Global is now INSPYR Solutions Sunnyvale, United StatesABOUT THIS FEATURED OPPORTUNITY · The QoS Infrastructure Tools Team is responsible for building and maintaining tools that are essential for Site Reliability Engineers (SREs) and engineers across the organization. The team primarily develops applications using Golang for backend ...
-
Site Reliability Engineer
4 days ago
Lawrence Harvey Sunnyvale, United StatesSite Reliability Engineer · Status: Full Time · Compensation: 120k to 145k · Hybrid Requirements: 3 days in office, 2 days remote · Lawrence Harvey has partnered with a leading Chinese fintech startup that is committed to democratizing payment services and empowering people and ...
-
Packaging Reliability Engineer
1 day ago
Yoh Mountain View, United StatesPackaging Reliability Engineer · As a Packaging Reliability Engineer, you will be responsible for qualifying packaging for consumer electronic products. The company creates iconic packaging that meets a high bar for reliability and demonstrates care for the people who use them an ...
-
Site Reliability Engineer
3 weeks ago
Wayve Mountain View, United StatesAt Wayve, we're not just another autonomous vehicle company. We stand out with our revolutionary approach to self-driving technology, embracing the power of embodied AI to redefine the boundaries of what's possible. While others depend on static maps and rigid rules, we believe i ...
-
Site Reliability Engineering
3 weeks ago
NewsBreak Mountain View, United StatesAbout NewsBreak · NewsBreak is redefining the way users interact with local news and their communities. By bridging local users, local content creators, and local businesses, our mission is to foster safer, more vibrant, and authentically connected lives. Through robust collabor ...
-
Site Reliability Engineer
4 weeks ago
TikTok Mountain View, CA, United StatesTikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Mumbai, Singapore, Jakarta, Seoul and Tokyo. Our Trust and Safety engineerin ...
-
Reliability Engineer
21 hours ago
Apple Cupertino, United StatesReliability Engineer · Cupertino,California,United States · Hardware · Do you ever wonder what goes into making Apple products an amazing user experience? Apples innovative reliability team is responsible for insuring that our products exceed our customers expectations for rob ...
-
Site Reliability Engineer
1 week ago
Amiseq Inc. Sunnyvale, United StatesSite Reliability Engineer · Sunnyvale, CA - Hybrid · 6-12 Months W2 Contract · Job Description: · Hands on development on building n-tier applications using RESTful Services, Java/J2EE, JavaScript, Python, NoSql. · • Working knowledge of one or more cloud technologies such as AZ ...
-
Operations Reliability Engineer
21 hours ago
Apple Sunnyvale, United StatesOperations Reliability Engineer · Sunnyvale,California,United States · Operations and Supply Chain · Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication ...
-
Site Reliability Engineer
2 weeks ago
Tech Mahindra Sunnyvale, United StatesProficiency with the architecture, deployment, performance tuning, and troubleshooting large scale distributed systems on AWS · Understanding of SRE principals including monitoring, alerting, error budgets, fault analysis, and automation · Make sure to apply quickly in order to ...
-
Site Reliability Engineer
2 weeks ago
Tech Mahindra Sunnyvale, United StatesProficiency with the architecture, deployment, performance tuning, and troubleshooting large scale distributed systems on AWS · Understanding of SRE principals including monitoring, alerting, error budgets, fault analysis, and automation · Skilled at writing clean, high-performan ...
-
Staff Site Reliability Engineer
3 weeks ago
SmartThings Mountain View, United StatesStaff Site Reliability Engineer (Mountain View, CA) · Department: Behaviors, Execution and Foundation · Employment Type: Full Time · Location: Mountain View, CA · Reporting To: Angela Tan · Description · We're SmartThings, one of the leading IoT ecosystems in the world, crea ...
-
Staff Site Reliability Engineer
3 weeks ago
SmartThings Mountain View, United StatesJob Description · Job DescriptionDescriptionWe're SmartThings, one of the leading IoT ecosystems in the world, creating the most effortless way for anyone to create a smart home. As a wholly owned subsidiary of Samsung, our corporate offices are based in Minneapolis and the Bay A ...
-
Staff Site Reliability Engineer
3 days ago
SmartThings Mountain View, United StatesJob Description · Job DescriptionDescriptionWe're SmartThings, one of the leading IoT ecosystems in the world, creating the most effortless way for anyone to create a smart home. As a wholly owned subsidiary of Samsung, our corporate offices are based in Minneapolis and the Bay A ...
-
Site Reliability Engineering Intern
3 weeks ago
NewsBreak Mountain View, United StatesAbout NewsBreak · NewsBreak is redefining the way users interact with local news and their communities. By bridging local users, local content creators, and local businesses, our mission is to foster safer, more vibrant, and authentically connected lives. Through robust collabor ...
-
Reliability Engineer
1 week ago
Natron Energy Santa Clara, United StatesNatron is seeking a Reliability Engineer to support the development and test of our high-power battery systems for data center UPS and EV charging applications. The occupant of this position will work with the Product Engineering, Reliability, Technology, and Operations teams to ...
-
Reliability Engineer
3 weeks ago
Comtech TCS Santa Clara, United StatesJob Description · Job Description · Comtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a · Reliability/Failure · Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside cu ...
-
Reliability Engineer
3 weeks ago
Comtech Telecom Santa Clara, United StatesComtech Telecommunications Corp. has an opportunity in Santa Clara, CA for a Reliability/Failure Analysis Engineer. In this important role, you will collaborate with a diverse team of technical professionals and interact with outside customers, providing solutions to a variety of ...
-
OneHouse LLC Sunnyvale, United StatesAbout Onehouse · Onehouse is a mission-driven company dedicated to freeing data from data platform lock-in. We deliver the industry's most interoperable data lakehouse through a cloud-native managed service built on Apache Hudi. Onehouse enables organizations to ingest data at sc ...
Staff Site Reliability Engineer - Mountain View, United States - SmartThings
Description
Job Description
Job DescriptionDescriptionWe're SmartThings, one of the leading IoT ecosystems in the world, creating the most effortless way for anyone to create a smart home. As a wholly owned subsidiary of Samsung, our corporate offices are based in Minneapolis and the Bay Area.More than 270 million people worldwide use SmartThings to control and manage their connected life. SmartThings delivers simple, powerful experiences across Samsung's leading portfolio of phones, TVs, and appliances, and we offer the most versatile smart home experience as an open platform with a rich partner ecosystem. As a founding member of Matter, we are a leader in the industry to help make smart homes more secure, reliable and seamless to use.
Like the smartphone revolution, smart home technology is transforming the way we interact with the world around us. With SmartThings products, we're reducing global emissions, improving service industries, and creating a safer, smarter planet. Come be a part of the transformation with us Do the SmartThings
SmartThings Culture & Ways of Working
SmartThings' dynamic culture continuously moves forward with agility and determination, providing an opportunity for impactful contributions across all roles. Our commitment to diversity, equity, inclusion and belonging is deeply ingrained in our core values, fostering a culture that values,celebrates, and honors the unique perspectives and experiences of every individual. Embracing inclusive practices, we strive to cultivate a work environment where everyone thrives.
At SmartThings, we're creating immersive, interconnected experiences for both our customers and our team members. Our workplace mirrors this ethos, offering a versatile hybrid environment that nurtures personal connections and fosters collaborative efforts. It's a place where we can harness the power of teamwork and also delve into focused solo work from the comfort of home a couple of days a week. Joining our team means being based in the vibrant Bay area and working with us at our Mountain View office three days a week, adding your unique touch to our collective journey.
About The TeamSmartThings is seeking a Staff Site Reliability Engineer to be the technical leader on a newly formed SRE team whose mission is to drive platform reliability and operations improvements across critical areas such as availability, latency, efficiency, capacity, change management, monitoring, and incident response. This is a unique and exciting opportunity that will fill in a critical platform need by providing technical leadership and expertise at a platform level to complement service team operations and practices while also helping to guide the SmartThings operational practices and culture through team-level engagements.
Key Responsibilities
Skills Knowledge and ExpertiseInclusive Hiring Practices
If your skills and experience are close to what we're looking for, we encourage you to apply. We know that abilities can be developed in many different ways, and some of the most educational paths have twists and turns. Diversity of thought creates the most creative teams, and we're passionate about adding new perspectives to the conversation at SmartThings. Even if you aren't certain you meet every requirement, we encourage you to apply
What You Bring Day One (Required Qualifications)
SmartThings Benefits
Compensation for this role for a candidate based in Minneapolis is expected to be between $169,015 and $253,523