Back to Jobs
PostHogDevelopment 2h ago
SRE - Infra
🌍Global
Full-time
Not Disclosed
Be the first applicant! 🚀
Job Description
About the Role: We're looking for people that like deep ownership of production systems, people that are not afraid of working with stateful infrastructure and love working in AWS, VMs, automation, and making messy systems reliable. You won't be in a typical "keep the lights on" SRE role. The work is about turning a fast-growing, stateful system into a predictable, well-automated platform (provisioning, scaling, rebalancing, recovery). You'll work on the kind of problems that only show up at large scale (petabytes of data, thousands of cores, constant ingestion) across a multi-region, multi-account AWS platform running many services on Kubernetes.
What you'll be doing
- Operating EKS clusters across several environments with Karpenter autoscaling, Cilium networking, and ArgoCD-driven GitOps deployments.
- Managing and evolving a multi AWS account organization, provisioning, networking, access control, and cross-account connectivity.
- Maintaining the Terraform/Terragrunt IaC platform - modules, automated plan-on-PR / apply-on-merge pipelines, and safe patterns for shared infrastructure.
- Improving operational tooling around deploys, schema changes, backups, restores, and incident response.
- Reducing operational load by identifying repeat pain points and eliminating them through code and self-healing automation.
- Optimizing cloud spend as you go.
- Participating in on-call and incident response, with a strong focus on making incidents rarer over time.
Requirements
- Deep hands-on experience with Kubernetes in production (EKS preferred). You've debugged node pressure, networking issues, and deployment failures at scale (thousands of nodes).
- Strong experience operating production infrastructure on AWS. Not just one account, but understanding organizational boundaries, IAM, and networking between many.
- Experience automating infrastructure using Terraform or Terragrunt at scale, including module design and state management.
- Solid understanding of Linux systems (disk, memory, networking, failure modes).
- Experience supporting stateful systems (databases, queues, storage systems, etc.).
- Ability to debug and reason about performance and reliability issues in production.
- You're comfortable owning systems end-to-end, including on-call responsibilities.
Nice to have
- Experience with GitOps workflows (ArgoCD) and CI/CD pipelines (GitHub Actions).
- Experience with building AI agent-enabled base-level infra services for teams that move fast.
- Familiarity with multi-region infrastructure and the consistency/availability tradeoffs that come with it.
Is this company safe?
Ask Hyrizon AI to scan this company for potential red flags.
Safety First
- Never pay for a job application.
- Do not share sensitive bank info.
- Verify the client before starting work.