Staff Backend Engineer - Adaptive Telemetry
Job Description
Grafana Labs is a remote-first, open-source powerhouse, providing an observability platform with over 20M users and helping more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack. We are scaling fast, staying true to our open-source legacy, global collaborative culture, and passion for meaningful work. We encourage applying even if you don't meet every requirement.
The Opportunity:
What is Grafana Cloud?
Grafana Cloud is our composable observability platform, integrating metrics, logs, traces, and profiles with Grafana. It enables customers to use open-source observability software like Prometheus, Mimir, Loki, Tempo, and Pyroscope without the overhead of self-management.
The Databases department manages telemetry databases for Grafana Cloud, including Mimir for metrics, Loki for logs, Tempo for traces, and Pyroscope for profiles.
Adaptive Telemetry Group
The Adaptive Telemetry group, within the Databases department, ensures all telemetry data stored in our databases is valuable. The group develops Adaptive Metrics, Adaptive Logs, Adaptive Traces, and Adaptive Profiles to help users control and optimize their telemetry data, ensuring only the most valuable data is retained based on usage patterns.
As a remote-first and global company, we embrace diversity and new perspectives.
What will you be doing:
- Drive technical strategy and roadmap. Proactively define the architectural vision, prioritize work that unlocks major product or platform improvements, and influence product and engineering decisions.
- Lead end-to-end delivery of large, cross-functional projects. Own planning, design, execution, rollout and long-term operation of large initiatives.
- Own architecture, reliability, performance and cost for critical systems. Make pragmatic architecture choices that balance scalability, availability, latency and cost while ensuring systems remain maintainable and evolvable.
- Define SLOs/SLIs and lead incident response. Establish measurable reliability targets, run high-severity incident response, lead blameless post-mortems, and drive systemic fixes and automation to prevent recurrence.
- Improve observability, automation and operational readiness. Champion telemetry, alerting, runbooks, capacity planning and automation efforts that reduce toil, speed debugging and lower MTTR.
- Align stakeholders and remove blockers. Coordinate across Product, Design and other teams to align priorities, negotiate tradeoffs, and unblock delivery for large initiatives.
- Mentor and grow engineering talent. Coach senior and mid-level engineers, lead design reviews, raise engineering standards, and help teammates make sound technical tradeoffs.
- Represent engineering internally and externally. Communicate technical strategy clearly to non-engineering stakeholders and represent the team in cross-team planning.
We invest heavily in developer productivity, offering modern AI coding assistants with a company-funded usage budget. We encourage pragmatic AI-assisted development for faster prototyping, test generation, refactors, documentation, and incident follow-ups, always paired with strong code review and quality standards. You’ll also have access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro).
What makes you a great fit:
You are a motivated self-starter with a bias towards action and are customer-focused, passionate about creating intuitive products that fit customers’ needs.
- Proven delivery of large distributed systems. Experience shipping and operating complex systems that span multiple teams, with clear evidence of technical leadership and impact.
- Strong systems-design instincts. Deep understanding of tradeoffs around latency, consistency, availability, scaling and cost.
- Hands-on cloud and platform experience. Solid experience with cloud-native architectures (microservices, containers/Kubernetes, IaC) and the operational practices that keep them healthy.
- Reliability and performance ownership. Comfortable defining SLOs/SLIs, doing capacity planning, tuning performance, and driving reliability work end-to-end.
- Excellent coding and design skills. You write clear, maintainable, well-tested code and can lead technical designs — we use Go, but Python/C/C++/Rust or similar translate well.
- Comfort with AI-assisted development. We embrace AI and agentic development so we expect you to be curious and comfortable using AI-powered developer tools and ideally have practical experience folding them into a team’s workflow.
- Experience with messaging and telemetry. Familiarity with streaming/messaging systems (e.g., Kafka) and observability tooling (Prometheus/Grafana or equivalents).
- Influence without authority. Ability to align cross-functional stakeholders, set priorities and drive outcomes in a remote-first environment.
- Strong communicator. Clear written and verbal communication that works across engineers and non-technical stakeholders.
Why You’ll Thrive at Grafana Labs:
- 100% Remote, Global Culture - As a remote-only company, we bring together talent from around the world, united by a culture of collaboration and shared purpose.
- Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment.
- Transparent Communication – Expect open decision-making and regular company-wide updates.
- Innovation-Driven – Autonomy and support to ship great work and try new things.
- Open Source Roots – Built on community-driven values that shape how we work.
- Empowered Teams – High trust, low ego culture that values outcomes over optics.
- Career Growth Pathways – Defined opportunities to grow and develop your career.
- Approachable Leadership – Transparent execs who are involved, visible, and human.
- Passionate People – Join a team of smart, supportive folks who care deeply about what they do.
- In-Person onboarding - We want you to thrive from day 1 with your fellow new ‘Grafanistas’ to learn all about what we do and how we do it.
- Balance is Key - We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect.
Is this company safe?
Ask Hyrizon AI to scan this company for potential red flags.
Safety First
- Never pay for a job application.
- Do not share sensitive bank info.
- Verify the client before starting work.