- Own and resolve production reliability issues including OOMs, deadlocks, connection pool exhaustion, and race conditions
- Optimize performance across hot paths including spend tracking, database writes, and health checks
- Improve Redis and in-memory cache reliability across multi-pod deployments
- Make the proxy self-healing with graceful degradation, retry logic, and proper health checks when DB or Redis is unavailable
- Build and maintain Prometheus metrics, alerting, and observability for production deployments
- Collaborate directly with customers and the open source community to turn real-world issues into platform improvements
- 1-4 years running Python services in production at scale
- Experience debugging OOMs, memory leaks, race conditions, and deadlocks in live environments
- Strong familiarity with PostgreSQL, Redis, and Kubernetes in live environments
- Comfortable owning production systems and debugging customer-facing incidents
- Solid understanding of distributed systems, connection pooling, and caching layers
- Excited to work in an early-stage, high-ownership, fast-shipping environment
-
Cisco Silicon One ASICs are transforming the Future of the Internet. · Owning reliability test plans for new products. · Supporting High power Burn In, biased HAST and ESD/LU bring-up and debug for reliability qualification and evaluation. · ...
San Francisco3 weeks ago
-
About Us · Sieve is the only AI research lab exclusively focused on video data. We combine exabyte-scale video infrastructure, novel video understanding techniques, and dozens of data sources to develop datasets that push the frontier of video modeling. Video makes up 80% of inte ...
San Francisco $85,000 - $145,000 (USD) per year10 hours ago
-
We are making sure that when businesses build AI agents the experience of doing so doesn't suck.Our team is a group of ex-athletes founders and builders with low egos and a high belief that life not about taking the easy road but challenging ourselves to find the most we can be. ...
San Francisco $130,000 - $190,000 (USD)3 weeks ago
-
We're hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our product. · You'll work across our distributed workflow engine, serverless pipelines, containerized services and Postgres based data laye ...
San Francisco1 month ago
-
We are looking for experienced problem-solving engineers to ensure our systems scale. We seek to learn from deployment and distribute the benefits of AI while ensuring that this powerful tool is used responsibly and safely. · ...
San Francisco, CA1 month ago
-
We are looking for a Senior Site Reliability Engineer (SRE) to build the reliability foundation for a mission-critical healthcare platform. · This is not a "keep the lights on" SRE role. You'll own reliability end-to-end, · define what good looks like: SLIs, SLOs, incident respon ...
San Francisco1 month ago
-
We are seeking an experienced Site Reliability Engineer to join our Platform Engineering team in the Bay Area. · Design and implement scalable infrastructure on Google Cloud Platform. · Own critical platform services. · ...
San Francisco Full time2 months ago
-
We are looking for a Hardware · Reliability Engineer. In this role, · you will be responsible for planning · and executing hardware reliability tasks · for Oura's wearable products.- Plan document and also execute reliability testing for Oura hardware products and accessories un ...
San Francisco, CA1 month ago
-
The role · We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. · You'll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra-low-lat ...
San Francisco $115,000 - $185,000 (USD) per year6 days ago
-
Job summary · Our mission is to increase economic freedom in the world. It's a massive opportunity that demands the best of us every day · ,Responsibilities:Improve observability reliability and availability by defining and measuring key metrics · Build automation and improve sys ...
San Francisco $186,065 - $218,900 (USD)1 month ago
-
We're hiring an SRE to join our engineering team at Plenful. · You'll bring strong technical judgment, calm problem solving during incidents and a practical approach to improving reliability. · ...
San Francisco Full time1 month ago
-
We're a fully distributed team with employees across North American time zones. · We build the systems and practices that keep everything running smoothly—handling hundreds of millions of requests, · minimizing downtime, and continuously improving service performance.The Site Rel ...
San Francisco $175,000 - $250,000 (USD)1 month ago
-
Verrus is redefining the future of data centers with an emphasis on innovation, flexibility, and sustainability. · ...
San Francisco1 month ago
-
We're building a software platform that empowers today's commercial contractors. Join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and opera ...
San Francisco, CA $115,000 - $185,000 (USD) per year2 weeks ago
-
We're looking for engineers who are excited to improve the reliability of complex systems and enjoy digging into how things work. · Bring a generalist mindset and are comfortable working across infrastructure layers—from compute and networking to storage, databases, and app runti ...
San Francisco $175,000 - $250,000 (USD)1 month ago
-
Seeking an experienced Site Reliability Engineer (SRE) to help them scale their platform with reliability, observability, and operational excellence at the core. · ...
San Francisco1 month ago
-
· PLEASE CLICK HERE TO SEE *ALL* OF OUR JOB OPENINGS · Site Reliability Engineer · Seeking an experienced Site Reliability Engineer (SRE) to help them scale their platform with reliability, observability, and operational excellence at the core. You'll partner with engineers and ...
San Francisco, CA $115,000 - $185,000 (USD) per year10 hours ago
-
We are building the infrastructure for abundant intelligence at FluidstackWe partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta · , · Fluidstack seeks a Network Engineer to champion and build process reliability metrics f ...
San Francisco $150,000 - $250,000 (USD)1 month ago
-
We are looking for a Site Reliability Engineer (SRE) with strong experience in Microsoft Azure cloud services and Java-based application development.This role blends software engineering and operations, with a focus on building reliable, · scalable,and highly available systems. · ...
San Francisco1 month ago
-
We are seeking a Site Reliability Engineer (SRE) with strong expertise in Identity and Access Management (IAM) and cloud platforms. · Design and implement IAM/IGA solutions using Okta (OAuth, SAML, OIDC, MFA, FIDO, Zero Trust). · Manage and configure Microsoft Entra ID (Azure AD) ...
San Francisco3 weeks ago
-
About the Team · OpenAI's Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research par ...
San Francisco9 hours ago
Site Reliability Engineer - San Francisco - Client Services
Description
Site Reliability Engineer (AI Infrastructure - DevTool Start-Up)$200,000 - $280,000 + Equity + Benefits + PTO
San Francisco, CA
Are you passionate about keeping production AI infrastructure fast, reliable, and self-healing? Do you thrive in environments where you directly own the systems that millions of LLM requests flow through every day?
This is an opportunity to join a fast-growing, profitable startup at the forefront of AI infrastructure, building the reliability layer that powers how real customers deploy and use language models in production. Backed by top-tier investors and trusted by major enterprises, the team has built a unified LLM gateway used as a critical proxy by engineering teams worldwide. Now, they're looking for a founding SRE to own the reliability, performance, and observability of that proxy in production.
As a founding member of the engineering team, you'll take ownership of the systems keeping the core proxy alive under load, debugging OOMs, resolving database connection exhaustion, fixing race conditions, and making the platform resilient when dependencies go down. You'll work directly with senior leadership, engage with a large open source community, and ensure that when customers put their entire AI stack behind this gateway, it never lets them down.
If you're looking for a role where you can combine deep systems debugging with real customer impact and directly influence the infrastructure that underpins modern AI applications, this is an outstanding opportunity.
The Role:
The salary advertised is the bracket available for this position. The actual salary paid will be dependent on your level of experience, qualifications and skill set and will be decided by our client, the employer. Rise are not responsible or liable for any hiring decisions made by the end client.
-
Reliability Engineer
Only for registered members San Francisco
-
Reliability Engineer
Only for registered members San Francisco
-
Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Software Engineer, Reliability
Only for registered members San Francisco, CA
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Full time Only for registered members San Francisco
-
Hardware Reliability Engineer
Only for registered members San Francisco, CA
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Full time Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco, CA
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco, CA
-
Network Engineer, Reliability
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Reliability/DFX Engineer
Only for registered members San Francisco