LivePerson (NASDAQ: LPSN) is the global leader in enterprise conversations. Hundreds of the world’s leading brands — including HSBC, Chipotle, and Virgin Media — use our award-winning Conversational Cloud platform to connect with millions of consumers. We power nearly a billion conversational interactions every month, providing a uniquely rich data set and safety tools to unlock the power of Conversational AI for better customer experiences.

At LivePerson, we foster an inclusive workplace culture that encourages meaningful connection, collaboration, and innovation. Everyone is invited to ask questions, actively seek new ways to achieve success and reach their full potential. We are continually looking for ways to improve our products and make things better. This means spotting opportunities, solving ambiguities, and seeking effective solutions to the problems our customers care about.

Overview:

We are looking for a Site Reliability Engineer (Level II) to support and enhance reliability across the Echo ecosystem. This role is responsible for maintaining existing production systems while actively supporting new platform initiatives and feature rollouts.

The ideal candidate has strong hands-on experience with GKE, GitOps-driven deployments, cloud-native networking, and proactive reliability engineering. This role requires close collaboration with application development teams to ensure safe, reliable, and observable production releases.

You will:

Production Reliability & Ownership

Maintain and support existing products within the Echo ecosystem.
Ensure high availability, performance, and reliability of platform services.
Define, monitor, and improve SLOs, SLIs, and error budgets.
Proactively identify system risks and implement reliability improvements.
Participate in incident response, troubleshooting, and post-incident reviews.

Cloud & Kubernetes (GKE)

Deploy, manage, and optimize workloads on Google Kubernetes Engine (GKE).
Manage cluster capacity, scaling strategies, and resource allocation.
Optimize CPU, memory, and storage utilization to improve performance and reduce cost.
Ensure cluster security, upgrades, and best practices are followed.
Troubleshoot networking, service mesh (if applicable), ingress, and service-to-service communication issues.

GitOps & Release Engineering

Implement and manage GitOps-based deployment workflows.
Ensure infrastructure and application changes are version-controlled and automated.
Work closely with developers to safely release code to production using CI/CD best practices.
Support progressive delivery techniques (e.g., canary, blue/green deployments).
Reduce deployment risk through automation and validation mechanisms.

Observability & Monitoring

Implement and enhance observability practices across services.
Build and maintain dashboards, alerts, and health metrics.
Implement and manage OpenTelemetry (OTEL) for tracing and metrics collection.
Ensure proactive alerting aligned with SLOs.
Drive improvements in monitoring coverage and signal quality.

Networking & System Understanding

Strong understanding of Kubernetes networking, services, ingress, load balancing, DNS, and service communication.
Diagnose latency, connectivity, and traffic routing issues.
Understand how distributed services interact across the ecosystem.

You have:

4–7 years of experience in SRE, DevOps, or Platform Engineering roles.
Strong hands-on experience managing production workloads on GKE.
Solid experience with GitOps practices (ArgoCD, Flux, or similar).
Strong understanding of Kubernetes networking and cloud networking fundamentals.
Experience optimizing resource allocation and scaling in Kubernetes.
Experience implementing observability solutions using OpenTelemetry (OTEL).
Experience defining and operating with SLOs and SLIs.
Hands-on experience with CI/CD pipelines and automated deployments.
Strong troubleshooting and incident management experience.

Benefits:

Health: medical, dental, and vision
Time away: vacation and holidays
Development: Generous tuition reimbursement and access to internal professional development resources
Equal opportunity employer
#LI-Remote

Site Reliability Engineer (SRE) II

Job Description