About Backblaze: Backblaze is the pioneering object storage leader in the open cloud movement, fueling customer success with cloud storage solutions built purposefully to unlock IT budgets, unburden system administrators, and unleash technology innovators. Founded in 2007 and scaling with under $3 million in outside funding until our traditional Nasdaq IPO in 2021, Backblaze generates over $100M in annual revenue as the leading specialized storage cloud. Today, we manage over three billion gigabytes of production data storage for more than 500,000 customers in 175+ countries, supporting businesses, developers, and IT professionals worldwide.

Position Overview

We are seeking a highly analytical, systems-focused Site Reliability Engineer II (SRE II) to play an essential role in ensuring the ongoing stability, scalability, and extreme reliability of our distributed services and open storage infrastructure. In this high-trust operational seat, your primary focus centers on engineering robust system automation, maintaining deep telemetry observability, and anchoring incident response paths to keep customer-facing environments performing flawlessly. Operating within an agile infrastructure framework, you will partner cross-functionally alongside Engineering, Product, and Operations divisions to embed proactive reliability practices, eliminate operational toil, and optimize resource performance.

Key Responsibilities

Service Reliability & Operations: Support and safeguard the high availability, durability, and resilience of critical cloud storage services across multi-region production environments.
Observability & SLO Tracking: Monitor cloud service health metrics utilizing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and systemic error budgets, executing proactive escalations before thresholds are breached.
Incident Response & On-Call: Participate responsibly in distributed team on-call rotations, live incident response triage, and post-incident blameless root-cause reviews to continually harden system limits.
Infrastructure Automation: Code and deploy robust automation tools for common operational tasks, systematically eliminating manual engineering intervention and engineering toil.
Telemetry Framework Engineering: Contribute directly to core monitoring, logging, and distributed alerting frameworks utilizing Prometheus, Grafana, Catchpoint, and ELK stacks.
Infrastructure as Code & CI/CD: Configure, maintain, and scale infrastructure deployments using Infrastructure as Code (IaC) and configuration management tools including Terraform, Ansible, and Jenkins.
Capacity & Disaster Recovery: Assist with proactive capacity planning modeling, hardware lifecycle tracking, and routine disaster recovery simulation exercises.

Required Skills & Qualifications

2–4 years of verified professional history operating as a Site Reliability Engineer, Systems Engineer, or Cloud Operations Administrator within high-scale software architectures.
A **Bachelor’s degree in Computer Science, Engineering**, or an equivalent highly technical quantitative field (or equivalent professional experience).
Solid, foundational **Linux systems administration** experience, including deep low-level system troubleshooting and network diagnostic skills.
Demonstrated script automation proficiency authored natively in at least one backend language, preferably including Python, Go, or Bash.
Familiarity with cloud-native runtime environments, container engines, and microservices topologies leveraging **Kubernetes or Docker**.
Proven understanding of core service reliability concepts, structured root-cause analysis, and ITIL/OSS change and capacity management practices.
Location Context: 100% remote-first full-time operational framework open to qualified infrastructure engineers permanently based within Bangalore, India.

Preferred Strategic Indicators (Nice to Have)

Prior systems engineering experience operating within a distributed systems, high-volume SaaS, or specialized cloud service provider environment.
Familiarity with foundational multi-account administration or resource management across hyper-scaler cloud environments (such as AWS, GCP, or Azure).
Outstanding written communication mechanics, with a track record of authoring clear operational playbooks, runbooks, and architectural diagrams.

What We Offer

The exceptional technical canvas to directly optimize, scale, and secure the foundational data pipelines handling billions of gigabytes of storage worldwide.
Highly competitive compensation metrics structured transparently to match your verified Linux systems depth and infrastructure automation capability.
Stable remote-first full-time parameters providing profound work-from-home schedule flexibility across Bangalore.
A corporate workspace that champions diversity, equity, and inclusion at its core, fostering a high-trust engineering culture where individuals can deliver their best work.

Site Reliability Engineer

Job Description

Key Skills Required

Position Overview

Key Responsibilities

Required Skills & Qualifications

Preferred Strategic Indicators (Nice to Have)

What We Offer

How would you rate this job post?

Safety First