Staff Platform Reliability Engineer
Job Description
Who we are
At Domino, we build software that helps the largest, AI-driven organizations build and operate advanced data science and AI solutions at scale. Our platform integrates a streamlined model development environment, MLOps capabilities, and novel features for collaboration, reuse, and reproducibility — all of which make data science teams more productive, reduce time to value, and ensure compliance. Our customers — like Johnson & Johnson, GSK, Bristol Myers, UBS, FINRA and the US Navy — are using our software to solve some of the most important challenges in the world, such as developing new medicines, securing our financial markets, or protecting our country. Backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake and other leading investors, we have been in business for a decade but are still a small team operating with the spirit of a startup. Especially in the world of AI today, we believe that the future is still being invented — and we want to be the ones building it. For more information, visit www.domino.ai
What we are building
The Automation Team at Domino acts as a force multiplier for engineering, building the tools and systems that enable teams to ship code confidently and consistently. A core part of this mission is Tempest, an in-house platform that orchestrates realistic, long-duration workloads against live Kubernetes clusters and validates the results against real observability data. Today, when scale testing surfaces a bottleneck, a resource misconfiguration, or a regression in system behavior, the team can identify and report the issue — but we need someone who can take the next step: profiling services, tracing root causes through Prometheus and New Relic data, and partnering with platform engineers to drive durable fixes. Focused on iteration and continuous improvement, the team looks for targeted enhancements that create outsized impact, and this role will close the gap between detection and resolution at the infrastructure level.
What your impact will be
- Serve as the technical owner of Tempest, Domino's scale and reliability platform, ensuring it remains reliable, extensible, and aligned with evolving infrastructure needs
- Diagnose and drive resolution of performance bottlenecks and resource misconfigurations surfaced by scale testing — working directly with platform and infrastructure teams to ship fixes, not just file tickets
- Deliver accurate, data-driven sizing recommendations for customer-facing documentation based on rigorous empirical testing across deployment sizes
- Strengthen observability across scale testing by improving Prometheus and New Relic instrumentation, making it faster to pinpoint root causes during and after multi-day load runs
- Establish and operationalize scale testing on cloud platforms, ensuring appropriate sizing and configuration guidance for this increasingly divergent product line
- Partner with platform teams to enable effective scale and reliability testing across additional cloud providers, helping position Domino for future multi-cloud success
- Increase the efficiency and leverage of a small team by building infrastructure automation that scales operationally as the product and customer base grow
What we look for in this role
- Background in SRE, platform engineering, or infrastructure with hands-on experience operating and troubleshooting distributed systems in production Kubernetes environments
- Strong proficiency in Python and comfort working in a large, modular codebase that spans orchestration, infrastructure automation, and systems integration
- Experience with observability stacks (Prometheus, Grafana, New Relic, or similar) — writing queries, building dashboards, and using metrics to diagnose performance and reliability issues at the systems level
- Demonstrated ability to go beyond detection to resolution: profiling services, identifying resource bottlenecks, and working with engineering teams to ship durable fixes
- Familiarity with performance and load testing methodologies (e.g., Locust, k6, or similar) as part of a broader infrastructure or reliability practice
- Clear ownership mindset — self-directed, accountable, and able to communicate priorities and status effectively in a remote, async environment
What we value
- We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
- We believe in individuals who seek truth and speak the truth and can be their whole selves at work
- We value all of you that believe improving is always possible At Domino Everything is a work in progress – we can do better at everything
- We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company
- We strongly believe in the value of growing a diverse team and encourage people of all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply
Is this company safe?
Ask Hyrizon AI to scan this company for potential red flags.
Safety First
- Never pay for a job application.
- Do not share sensitive bank info.
- Verify the client before starting work.