About Supabase: Supabase is a premier, internationally recognized open-source technology juggernaut, Postgres development platform titan, and Firebase alternative pioneer operating on an absolute mission to protect, optimize, and transform how developers manage backend infrastructure. Offering an all-in-one suite that includes deeply integrated Postgres Databases, Authentication grids, Edge Functions, Realtime channels, Storage vaults, and Vector Search clusters, Supabase serves an accelerating user base managing millions of database instances. Backed by $500M in venture funding and scaling a high-vibe global network of over 500,000+ community members, Supabase is a born-remote, open-source-first organization that builds in public, values technical excellence, and utilizes its own product stack in everyday internal operations. The company provides high-agency systems engineering leaders with an uncompromised remote canvas to leverage state-of-the-art cloud systems, manipulate multi-tenant data pipelines, and deploy robust, automation-driven SRE frameworks globally.

Position Overview

We are seeking a highly analytical, detail-obsessed, and systems-minded Site Reliability Engineer to join our core centralized Service Operations collective in a full-time remote capacity open to qualified infrastructure authorities resident anywhere across the globe. As we scale to support millions of concurrent Postgres nodes, we are concentrating our platform-wide availability initiatives into a dedicated SRE practice designed to tie our observability, release engineering, and incident pipelines together. Shifting completely away from routine manual system operations, reactive standalone alert logging, or acting as an isolated infrastructure cleanup crew, you will run an active reliability strategy and automation engineering laboratory—embedding alongside software feature teams to build the tools, runbooks, and feedback loops that allow them to own availability themselves. This position requires an infrastructure or developer-tooling veteran with 7+ years of craft depth who maps out scalable cloud patterns fluidly natively using DevOps mechanics, builds internal platform extensions or reliability dashboards cleanly natively leveraging Python or alternative software engineering code bases, and commands high-concurrency cloud deployments confidently under asynchronous, influence-driven distributed models.

Key Responsibilities

SRE Practice and Policy Architecture: Collaborate directly with distributed engineering units to formulate, document, and embed meaningful Service Level Indicators (SLIs) and Objectives (SLOs) tied to end-user experiences, enforcing code-driven error budgets cleanly natively utilizing DevOps methodologies.
Operational Readiness Governance (ORR): Own and evolve the systemic Operational Readiness Review (ORR) framework, conducting exhaustive architecture reviews, dependency mapping, capacity audits, and failure mode analyses for major platform updates.
Incident-to-Improvement Orchestration: Maximize the impact of our postmortem pipeline, facilitating deep root-cause investigations, identifying cross-platform failure signatures, and driving systemic code improvements to eliminate recurring operational risks.
Operational Toil Elimination: Identify, track, and quantify recurring administrative manual friction points across the engineering organization, writing automated developer-facing reliability tools cleanly natively leveraging Python or cloud-native script interfaces to replace them.
Sustainable On-Call Design: Help development teams engineer resilient on-call protocols, optimizing alert routing systems, minimizing warning noise, and ensuring absolute runbook documentation coverage.
Maturity and Resilience Tracking: Monitor and map organizational infrastructure maturity vectors, surfacing foundational design gaps and advising leadership blocks on systemic engineering remediation priorities.
Asynchronous Cloud Deployment: Write and optimize infrastructure-as-code definitions to manage complex multi-tenant system footprints inside Amazon Web Services (AWS) or alternative cloud topologies.

Required Skills & Qualifications

A minimum of 7 years of verified professional history running advanced Site Reliability Engineering (SRE), production software engineering, infrastructure architecture, or cloud-scale systems optimization.
Expert-tier capability automating infrastructure environments, managing multi-tenant networks, and deploying cloud systems cleanly natively utilizing DevOps parameters.
Practical operational familiarity developing testing runbooks, automating diagnostic loops, or parsing system logging outputs natively using Python or related software development runtimes.
Demonstrated software engineering mindset, showing a powerful track record of writing code, building customized reliability tools (such as SLO dashboards or ORR frameworks), and developing APIs rather than simply adjusting vendor configuration templates.
Hands-on experience operationalizing multi-tenant SLOs/SLIs at scale, including building out explicit error budget systems that actively directed high-level product engineering resource decisions.
Deep professional familiarity with distributed cloud infrastructure management (with an absolute preference for AWS) and programmatic Infrastructure-as-Code frameworks (with a preference for Pulumi, or advanced Terraform/AWS CDK models).
Outstanding written and scannable technical communication attributes in business-fluent English, enabling uncompromised capability to influence engineering structures without authority across an entirely distributed organization.
Location Context: Position open to qualified engineering craftspeople based anywhere globally to operate under a 100% remote work-from-home layout.

Preferred Strategic Indicators (Nice to Have)

Prior technical operations history managing large-scale distributed cloud database platforms, orchestrating cluster configurations, or handling Postgres engines at enterprise scale.
Direct hands-on experience structuring container operations inside Kubernetes-based production topologies.
Familiarity with cloud-native open-source observability ecosystems, including OpenTelemetry specifications, VictoriaMetrics datastores, or Grafana instrumentation.

What We Offer

Vetted Open-Source Sector Salaried Blueprint: A highly competitive, full-time global baseline annual corporate salary scale calibrated precisely to evaluate your SRE authority and systems craftsmanship, paired with immediate equity ownership through an impactful Employee Stock Ownership Plan (ESOP).
The spectacular professional canvas to claim absolute strategic ownership over the reliability systems protecting database instances for hundreds of thousands of developers worldwide.
Profound work-from-home remote parameters offering a 100% remote virtual layout anywhere on earth, complete scheduling trust, and zero physical geographic commuting friction, complemented by a global co-working allowance or WeWork membership.
Immediate access to top-tier health benefits, featuring 100% company-paid premium medical coverage for employees alongside an immediate 80% coverage match for dependents.
Access to elite lifestyle and wealth accumulation tracks, including a dedicated personal Tech Allowance budget to configure your ideal laptop, monitor, and accessory layout, an annual professional development education allowance, and highly flexible asynchronous work hours.
Direct company-funded access to our spectacular Annual Team Offsites, bringing the entire global team together in a new international city for a week of intense collaboration and connection.

Site Reliability Engineer

Job Description

Key Skills Required

Position Overview

Key Responsibilities

Required Skills & Qualifications

Preferred Strategic Indicators (Nice to Have)

What We Offer

How would you rate this job post?

Is this company safe?

Safety First