Disaster Recovery Patterns for Kubernetes-Based Applications
This is a question we are often asked: What is the best approach to delivering a high SLA for application availability? In fact, we are actually asked “how do we implement DR in Kubernetes”, but what’s generally meant is application availability.
In order to come up with an answer to this question, organizations must first decide whether they want to hide failure from applications through infrastructure abstraction (as was popularized with VMware Fault Tolerance/High Availability), or whether they want applications to be explicitly resilient to failure across locations (as is recommended by the Cloud Providers with their multi-region architectures).
Kubernetes, as a fundamental technology, supports both approaches, but it does not neutralize the trade-offs between them. Each model places complexity in a different layer of the stack, carries different operational risks, and scales in very different ways as environments grow. This document outlines two common patterns used to provide disaster recovery for Kubernetes workloads, and examines their characteristics, dependencies, and limitations.
The first pattern focuses on transparent multi-site failover using a single stretched Kubernetes cluster. The second focuses on application-level resilience using multiple independent clusters with external traffic management.
Option 1: Transparent Multi-Site Failover Using a Stretched Kubernetes Cluster
Architectural Overview
In this model, a Kubernetes cluster is treated as a single logical system spanning multiple physical locations. A primary and secondary site host an equal number of controlplane nodes and as many worker nodes as needed to run x% of the application (x depends on the desire for active/active or active/passive distribution of load). A third location acts as a witness to maintain quorum for the control plane and hosts a single “tie break” control-plane node. From the perspective of applications and operators, there is one cluster, one API endpoint, and one set of workloads.
The objective is to ensure that a site failure does not require application awareness or intervention. Workloads should continue running, or restart automatically, without changes to application configuration or client access patterns.
Typical Technical Characteristics
A stretched cluster architecture usually relies on several tightly coupled infrastructure components:
The Kubernetes control plane is distributed across sites, with etcd members placed in each location to maintain quorum. This requires predictable latency and highly reliable connectivity between sites, as etcd is sensitive to both delay and packet loss.
Layer 2 networking is extended between locations so that pod IPs, service IPs, and node subnets remain consistent regardless of where workloads are running. This often involves stretched VLANs or overlay networks that span data centers.
Persistent storage (if used) is replicated between sites, commonly using synchronous or near-synchronous replication. From Kubernetes’ point of view, a persistent volume must remain accessible (and using a consistent identity, eg IP address or FDQN) regardless of which site a pod is scheduled in.
Ingress and egress traffic is typically handled by externalized load balancers or network appliances capable of redirecting traffic without changing service endpoints. These components must also be highly available and aware of site health. These devices normally front the presentation of the application, which is exposed as a “hostport” to ensure that traffic is only directed to worker nodes that actually host pods.
Operational Implications
The primary advantage of this approach is transparency. Applications generally do not need to be modified, and failover can be fast if the underlying infrastructure behaves as expected. This makes the model attractive for legacy applications or commercial software that cannot easily be changed.
The trade-offs are mostly operational. The cluster becomes extremely sensitive to network instability, especially as the distance between sites increases. A transient network issue can impact the control plane, storage replication, or both, even if application workloads themselves are healthy. Often, timeouts are increased to accommodate network disruptions, but these same timeouts then directly delay failover when a real site failure occurs.
The blast radius of failure is also large. Because the cluster is a single failure domain, misconfiguration, failed upgrades, or control plane instability can affect all sites simultaneously. Maintenance operations such as upgrades, certificate rotation, or network changes must be planned and executed with extreme care.
Cost and complexity tend to rise over time. Stretched networking, replicated storage, and specialized load-balancing infrastructure are not only expensive to deploy but also expensive to operate and troubleshoot. This model is typically viable only within metro distances and well-controlled network environments.
Option 2: Application-Level Resilience Using Multiple Independent Clusters
Architectural Overview
In this model, each site runs its own independent Kubernetes cluster, with no shared control plane, networking, or storage. Clusters are treated as isolated failure domains rather than extensions of a single system.
Applications are deployed concurrently across two or more clusters. Rather than relying on Kubernetes to fail workloads between sites, availability is managed externally through traffic routing, and data consistency is managed explicitly within the application and its data layer (eg with DB replication).
Typical Technical Characteristics
Each Kubernetes cluster operates independently, with its own control plane, networking, and storage stack. There is no requirement for low-latency connectivity or stretched Layer2 networks between clusters, beyond what the application itself needs for data replication or coordination.
Applications are deployed into isolated subnets or network segments per cluster. This reduces coupling and ensures that failures remain local to a single environment.
A geo-distributed load balancer or DNS-based traffic manager sits in front of the application. It continuously performs health checks against application endpoints and routes traffic only to healthy backends. If an entire cluster becomes unavailable, traffic is simply directed elsewhere.
Stateful components handle consistency at the application or data layer. This may involve database replication, leader election, quorum-based writes, eventual consistency models, or application-specific reconciliation mechanisms, depending on the workload’s requirements.
Operational Implications
This approach shifts complexity away from infrastructure and into application design. Applications must tolerate concurrent execution across sites and handle partial failure gracefully. For some legacy workloads, this can require significant redesign or may not be feasible at all.
However, the operational characteristics are more predictable. Each cluster can be operated, upgraded, and even deliberately taken offline without directly impacting others. Failure testing is simpler because disaster scenarios can be exercised by shutting down entire clusters rather than simulating partial infrastructure faults. This configuration facilitates “blue/green” deployment modes, meaning the value is far beyond failure prevention.
The blast radius of failures is smaller by design. A control plane issue, storage problem, or misconfiguration affects only one cluster. Scaling to additional regions or sites becomes a repeatable pattern rather than an architectural redesign. Taken to its extreme, a pool of single node clusters could in fact deliver higher availability than a single multi-node cluster.
This model aligns well with cloud-native principles and long-term scalability, particularly for organizations operating across wide geographic regions or hybrid environments.
Comparative Considerations
While both approaches can provide disaster recovery, and in fact fault tolerance, they optimize for very different outcomes.
Stretched clusters prioritize application transparency at the cost of infrastructure complexity and operational risk. They work best when sites are close together, networks are highly reliable, and application change is not an option.
Application-level resilience prioritizes isolation and scalability, but requires explicit ownership of failure handling within the application stack. It is generally better suited to modern applications, geographically distributed deployments, and organizations willing to invest in resilience as a design principle rather than an infrastructure feature.
Importantly, Kubernetes itself does not remove the need to choose. It enables both patterns, but the long-term sustainability of each depends on factors outside Kubernetes, including network topology, storage architecture, application design, and organizational maturity.
So, what’s the right answer?
Disaster recovery in Kubernetes is not a single problem with a single solution, and so therefore there is no one “right” answer. Transparent failover and application-level resilience represent fundamentally different philosophies about how systems should behave under failure.
The first attempts to hide failure through infrastructure abstraction. The second assumes failure is inevitable and designs applications to survive it. Kubernetes can support both, but it does not make their trade-offs disappear. Kubernetes’ natural affinity is to the second option.
Selecting the right approach requires an honest assessment of application constraints, operational tolerance for complexity, geographic distribution, and the organization’s ability to design and operate for failure rather than against it.
