- Multi‑cloud capacity management
- Inference on B200 GPUs
- Multi‑node inference
- Fractional H100 GPUs for efficient model serving
- Build and maintain scalable infrastructure to support the deployment and operation of machine learning models.
- Establish standards and best practices for reliability and performance across the infrastructure.
- Automate processes when relevant, particularly for managing CI/CD pipelines.
- Own products and projects end‑to‑end, functioning as both an engineer and a project manager, with a focus on user empathy, project specification, and end‑to‑end execution.
- Collaborate with cross‑functional teams to understand project requirements and translate them into technical solutions.
- Mentor junior team members and contribute to knowledge sharing within the organization.
- Navigate ambiguity and exercise good judgment on tradeoffs and tools needed to solve problems, avoiding unnecessary complexity.
- Demonstrate pride, ownership, and accountability for your work, expecting the same from your teammates.
- Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or related field.
- 5+ years of professional work experience in a fast‑paced, high‑growth environment.
- Extensive experience with Kubernetes.
- Experience in building and maintaining scalable infrastructure.
- Experience with infrastructure‑as‑code tools (e.g., Terraform, CloudFormation, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, CircleCI, Jenkins).
- Relevant OSS observability experience (Prometheus, ELK stack, Grafana stack, OpenTelemetry) is a plus.
- Ability to own projects end‑to‑end, from project specification to execution.
- No prior machine learning experience required, but should be open to learning about it.
- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employee and dependents.
- Generous PTO policy including company‑wide Winter Break (our offices are closed from Christmas Eve to New Year's Day).
- Paid parental leave.
- Company‑facilitated 401(k).
- Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
-
Cisco Silicon One ASICs are transforming the Future of the Internet. · Owning reliability test plans for new products. · Supporting High power Burn In, biased HAST and ESD/LU bring-up and debug for reliability qualification and evaluation. · ...
San Francisco1 week ago
-
We are making sure that when businesses build AI agents the experience of doing so doesn't suck.Our team is a group of ex-athletes founders and builders with low egos and a high belief that life not about taking the easy road but challenging ourselves to find the most we can be. ...
San Francisco $130,000 - $190,000 (USD)1 week ago
-
We're building a software platform that empowers today's commercial contractors. Join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and opera ...
San Francisco, CA $115,000 - $185,000 (USD) per year2 days ago
-
We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. · This hands-on individual contributor will sit within our hardware team and work closely with chip design, platform design, hardware health, and the broader industry ...
San Francisco1 week ago
-
We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. · ...
San Francisco, CA1 month ago
-
We're a fully distributed team with employees across North American time zones. · We build the systems and practices that keep everything running smoothly—handling hundreds of millions of requests, · minimizing downtime, and continuously improving service performance.The Site Rel ...
San Francisco $175,000 - $250,000 (USD)1 month ago
-
We're hiring an SRE to join our engineering team at Plenful. · You'll bring strong technical judgment, calm problem solving during incidents and a practical approach to improving reliability. · ...
San Francisco Full time1 month ago
-
We believe in thinking bigger—and moving faster. We're a family-founded company on a mission to create the world's first AI-powered Personal & Entrepreneurial Resource Planner (PRP), and we need your passion and ambition to help us change how people plan, work, and live. · Here, ...
San Francisco $116,000 - $200,000 (USD)1 month ago
-
We are creating a new category of work where expertise powers AI advancement. · Ambitious team that works alongside researchers, operators, · and AI companies shaping systems redefining society.. · ...
San Francisco1 month ago
-
We are building the infrastructure for abundant intelligence at FluidstackWe partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta · , · Fluidstack seeks a Network Engineer to champion and build process reliability metrics f ...
San Francisco $150,000 - $250,000 (USD)1 week ago
-
We are looking for a Senior Site Reliability Engineer (SRE) to build the reliability foundation for a mission-critical healthcare platform. · This is not a "keep the lights on" SRE role. You'll own reliability end-to-end, · define what good looks like: SLIs, SLOs, incident respon ...
San Francisco3 weeks ago
-
We're a team of doctors, engineers, designers, researchers and creatives building tools that help clinicians stay focused on what matters most: their patients. · In just 18 months Heidi has given back more than 18 million hours to healthcare professionals — supporting 73 million ...
San Francisco $140,000 - $185,000 (USD) Full time1 week ago
-
We are looking for a Site Reliability Engineer (SRE) with strong experience in Microsoft Azure cloud services and Java-based application development.This role blends software engineering and operations, with a focus on building reliable, · scalable,and highly available systems. · ...
San Francisco2 weeks ago
-
we are seeking an experienced site reliability engineer to join our platform engineering team in the bay area you ll be instrumental in ensuring the high availability performance and scalability of coderabbit s ai powered code review platform this role sits at the intersection of ...
San Francisco1 month ago
-
We are seeking a Site Reliability Engineer (SRE) with strong expertise in Identity and Access Management (IAM) and cloud platforms. · Design and implement IAM/IGA solutions using Okta (OAuth, SAML, OIDC, MFA, FIDO, Zero Trust). · Manage and configure Microsoft Entra ID (Azure AD) ...
San Francisco6 days ago
-
We envision a future where every individual can enjoy the luxury of a comfortable home without contributing to carbon emissions. Our high-efficiency, low-carbon heat pump is just the beginning of our journey toward fully decarbonizing buildings. · Lead lifetime reliability strate ...
San Francisco, CA USA1 week ago
-
Patreon powers creators to do what they love and get paid by the people who love what they do.We're continuing to invest heavily in building the best creator platform with the best team in the creator economy. · ...
San Francisco1 month ago
-
We are looking for experienced problem-solving engineers to ensure our systems scale. We seek to learn from deployment and distribute the benefits of AI while ensuring that this powerful tool is used responsibly and safely. · ...
San Francisco, CA1 week ago
-
About EngFlow · At EngFlow, we help developers save time by accelerating software builds and tests. Our cloud-based, distributed service optimizes developer workflows through remote execution and caching, improving efficiency, productivity, and product quality. · Backed by top in ...
San Francisco $115,000 - $185,000 (USD) per year2 days ago
-
+ Reliability expert to maintain and enhance the stability and scalability of our rapidly evolving infrastructure. · + Design and implement solutions to ensure the scalability of our infrastructure. · + Build and maintain load, chaos and synthetic testing software. · Job summary: ...
San Francisco1 week ago
-
Job summary · Our mission is to increase economic freedom in the world. It's a massive opportunity that demands the best of us every day · ,Responsibilities:Improve observability reliability and availability by defining and measuring key metrics · Build automation and improve sys ...
San Francisco $186,065 - $218,900 (USD)3 weeks ago
Site Reliability Engineer - San Francisco - Baseten
Description
About Baseten
Baseten powers inference for the world's most dynamic AI companies, like OpenEvidence, Clay, Mirage, Gamma, Sourcegraph, Writer, Abridge, Bland, and Zed. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting‑edge models into production. With our recent $150M Series D funding, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction, we're scaling our team to meet accelerating customer demand.
The Role
As a Site Reliability Engineer, you'll envision and build robust systems and processes that ensure our infrastructure is scalable, reliable, and efficient. This can range from automating deployments and monitoring systems to optimizing performance and managing incidents.
We all work closely with our users, learning from their past struggles in operationalizing ML, onboarding them onto our platform, and turning our learnings into ideas for improving Baseten.
Example Initiatives
Responsibilities
Requirements
Benefits
Apply now to embark on a rewarding journey in shaping the future of AI If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward‑thinking team, we would love to hear from you.
At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.
#J-18808-Ljbffr
-
Reliability Engineer
Only for registered members San Francisco
-
Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco, CA
-
Reliability/DFX Engineer
Only for registered members San Francisco
-
Reliability/DFX Engineer
Only for registered members San Francisco, CA
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Full time Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Network Engineer, Reliability
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Full time Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco
-
Senior Reliability Engineer
Only for registered members San Francisco, CA USA
-
Site Reliability Engineer
Only for registered members San Francisco
-
Software Engineer, Reliability
Only for registered members San Francisco, CA
-
Site Reliability Engineer
Only for registered members San Francisco
-
Software Engineer, Reliability
Only for registered members San Francisco
-
Site Reliability Engineer
Only for registered members San Francisco