Site Reliability Engineer
United StatesJob Description
Key Skills Required
Master these to land this role
Want to know if you're a match for this job?
About Runpod: Runpod is the foundational platform for developers to build, deploy, and run custom AI systems that scale. Empowering over 500,000 developers worldwide and operating with an annual recurring revenue run rate exceeding $120M, we build infrastructure purpose-built for modern AI workloads. As a remote-first, globally distributed enterprise, we enable seamless deployment flexibility across cloud, on-prem, and hybrid environments, powering the next generation of artificial intelligence ecosystems.
Position Overview
We are seeking a highly technical Site Reliability Engineer (SRE) to join our core Reliability team. In this high-impact execution track, you will own the availability, performance, and operational excellence of Runpod’s global distributed platform. Blending software engineering with deep production operations, you will partner directly with Infrastructure and Product Engineering units to operationalize SLIs/SLOs, strengthen observability matrices, and prevent systemic incidents before they happen. This role is central to maintaining uncompromised trust with developers running mission-critical AI/GPU workloads.
Key Responsibilities
- SLI/SLO Architecture: Define, implement, and enforce strict Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical microservices across the platform.
- Incident & Recovery Management: Lead high-stakes incident response workflows, coordinate cross-team mitigation efforts, and conduct thorough blameless postmortems to drive preventative improvements.
- Advanced Observability: Design and refine telemetry alerting systems and dashboards utilizing tools like Prometheus and Grafana, optimizing signal-to-noise ratios to eliminate alert fatigue.
- Toil Reduction & Automation: Build custom internal tooling, deployment guardrails, and automated scripts leveraging Python, Go, or Bash to systematically eliminate manual operational overhead.
- GPU Health Tracking: Improve infrastructural visibility into distributed GPU cluster performance and AI-workload hardware health.
- Production Readiness: Perform exhaustive production readiness reviews (PRR) for new services, offering architectural guidance on fault tolerance, release safety, and scalability.
Required Skills & Qualifications
- 5+ years of verified professional history operating within Site Reliability Engineering (SRE), Production Engineering, or highly available DevOps environments.
- Profound foundational expertise navigating Linux systems architecture and complex networking protocols.
- Strong, production-grade experience managing and scaling containerized deployment ecosystems.
- Demonstrated capability defining and governing SLI/SLO metrics alongside proven leadership in incident response and root-cause analysis (RCA).
- Solid programming and scripting capabilities utilizing Python, Go, or Bash for automation.
- Location Context: 100% remote-first operational infrastructure availability open to qualified site reliability engineers permanently residing within the United States.
Preferred Strategic Indicators (Nice to Have)
- Direct operational experience managing high-performance GPU infrastructure or AI/ML deployment platforms.
- Familiarity navigating advanced Infrastructure as Code (IaC) architectures and GPU observability tooling inside fast-moving, high-growth startup environments.
What We Offer
- Targeted Base Compensation: $150,000 – $200,000 USD per annum (Calibrated meticulously based on candidate experience depth, technical alignment, and regional location factors).
- Meaningful corporate equity via stock options—ensuring that as you drive our infrastructural growth, you share directly in the financial upside.
- Generous, comprehensive healthcare profiles covering top-tier medical, dental, and vision plans.
- Flexible Paid Time Off (PTO) allowances ensuring you take the time needed to recharge and avoid operational burnout.
- Inclusive, remote-first culture utilizing Slack for seamless global collaboration on the cutting edge of AI infrastructure.
How would you rate this job post?
See what other professionals think about this role.
Is this company safe?
Ask Hyrizon AI to scan this company for potential red flags before you apply.
Safety First
- Never pay for a job application.
- Do not share sensitive bank info.
- Verify the client before starting work.