Back to Jobs
RedditDevelopment 2h ago

Staff Site Reliability Engineer - Site Experience

United KingdomUnited Kingdom
Full-time
Not Disclosed
Senior

Job Description

Key Skills Required

Master these to land this role

ReactHighly Demanded
Learn in 26 Hours
Next.jsHigh Demand 💼
Learn in 40 Hours
PythonBestseller 🔥
Learn in 56 Hours
SREGoKubernetes

Want to know if you're a match for this job?

Calculate My Match Score

About Reddit: Reddit is an internet-scale platform, a massive global community of communities built on shared interests, deep passions, and authentic open dialogue. Home to over 100,000+ highly active subreddits and serving approximately 126 million daily active unique visitors, Reddit represents one of the largest and most influential sources of real-time information, content curation, and consumer discussion boards on the modern internet.

Position Overview

We are seeking a highly technical, high-agency Staff Site Reliability Engineer to lead reliability engineering initiatives for our critical user-facing systems at absolute internet scale. Sitting at the vital intersection of infrastructure, core product engineering, and user experience, our Site Experience SRE team ensures that every transaction across the web app, mobile platforms, APIs, feed generation loops, and real-time messaging engines remains blazing fast, highly resilient, and stable. In this technical leadership track, you will design architectural safety guards under massive global load, eliminate operational risk, eliminate toil via automation, and influence engineering culture across the entire organization.

Key Responsibilities

  • User Experience Reliability Leadership: Drive operational excellence, scalability, and latency improvements across Reddit’s most business-critical endpoints, search grids, and media delivery layers.
  • Architecture for Hyperscale: Partner with infrastructure and product groups to design highly available distributed networks, guiding architectural decisions around failover paths, cluster redundancy, graceful degradation, and global traffic engineering.
  • Systemic Risk Mitigation: Audit dependencies, microservices, and deployments to uncover systemic bottlenecks, building proactive mitigation playbooks that continuously reduce severe incident counts.
  • Toil Elimination & Automation: Build programmatic remediation tools, deployment safety rails, and reliability guardrails to replace manual or repetitive on-call operational work.
  • Blameless Incident Management: Lead complex, multi-team incident response actions across global outages, orchestrating blameless postmortems, root-cause diagnostics, and structural long-term code fixes.
  • Engineering Standards & Mentorship: Champion company-wide best practices for SLIs/SLOs, capacity management, and release engineering while providing technical leadership and mentorship to SRE and software engineering peers.

Required Skills & Qualifications

  • 8+ years of verified professional history operating as a Site Reliability Engineer, Infrastructure Engineer, or Systems Architect managing large-scale, high-traffic distributed systems.
  • Demonstrated history supporting high-throughput, user-facing production environments with exceptional availability thresholds.
  • Deep systems-level understanding of Linux operating systems, cloud-native container architectures, network routing, and distributed components.
  • Strong programming and scripting capability using systems languages, preferably Go or Python.
  • Advanced operational mastery of telemetry and observability layers, including distributed tracing, logging, structured alerts, and metric aggregation.
  • Location Context: 100% remote-first operational infrastructure flexibility open exclusively to qualified engineering leaders permanently based within the United Kingdom.

Preferred Strategic Indicators (Nice to Have)

  • Production experience orchestrating containers using Kubernetes and managing public cloud hyperscaler environments.
  • Familiarity with distributed infrastructure tools such as **Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, or Redis**.
  • Prior history optimizing Content Delivery Networks (CDNs), edge reliability nodes, or global traffic management rules.
  • Active contributions to open-source software communities or a history of leading large-scale organizational reliability transformations.

What We Offer

  • The exceptional engineering canvas to shape the performance and availability of one of the internet’s most influential platforms.
  • Highly competitive UK compensation package supplemented by a Group Personal Pension Scheme with matching employer contributions.
  • Comprehensive private medical and dental healthcare schemes paired with income replacement security programs.
  • A remote-first workspace environment providing global lifestyle benefit credits, professional development budgets, and caregiving support.
  • Flexible vacation schedules, paid volunteer time off, and highly generous paid parental leave brackets.
  • Access to premium mental health resources, coaching support networks, and localized perks like the Bike to Work scheme.

How would you rate this job post?

See what other professionals think about this role.

Is this company safe?

Ask Hyrizon AI to scan this company for potential red flags before you apply.

Safety First

  • Never pay for a job application.
  • Do not share sensitive bank info.
  • Verify the client before starting work.