Learning Notes #52 β Hybrid Origin Failover Pattern
Today, i learnt about failover patterns from AWS https://aws.amazon.com/blogs/networking-and-content-delivery/three-advanced-design-patterns-for-high-available-applications-using-amazon-cloudfront/ . In this blog i jot down my understanding on this pattern for future reference,
Hybrid origin failover is a strategy that combines two distinct approaches to handle origin failures effectively, balancing speed and resilience.
The Need for Origin Failover
When an applicationβs primary origin server becomes unavailable, the ability to reroute traffic to a secondary origin ensures continuity. The failover process determines how quickly and effectively this switch happens. Broadly, there are two approaches to implement origin failover:
- Stateful Failover with DNS-based Routing
- Stateless Failover with Application Logic
Each has its strengths and limitations, which the hybrid approach aims to mitigate.
Stateful Failover
Stateful failover is a system that allows a standby server to take over for a failed server and continue active sessions. Itβs used to create a resilient network infrastructure and avoid service interruptions.
This method relies on a DNS service with health checks to detect when the primary origin is unavailable. Hereβs how it works,
- Health Checks: The DNS service continuously monitors the health of the primary origin using health checks (e.g., HTTP, HTTPS).
- DNS Failover: When the primary origin is marked unhealthy, the DNS service resolves the originβs domain name to the secondary originβs IP address.
- TTL Impact: The failover process honors the DNS Time-to-Live (TTL) settings. A low TTL ensures faster propagation, but even in the most optimal configurations, this process introduces a delayβoften around 60 to 70 seconds.
- Stateful Behavior: Once failover occurs, all traffic is routed to the secondary origin until the primary origin is marked healthy again.
Implementation from AWS (as-is from aws blog)
The first approach is usingΒ Amazon Route 53 Failover routing policy with health checks on the origin domain name thatβs configured as the origin in CloudFront. When the primary origin becomes unhealthy, Route 53 detects it, and then starts resolving the origin domain name with the IP address of the secondary origin. CloudFront honors the origin DNS TTL, which means that traffic will start flowing to the secondary origin within the DNS TTLs.Β The most optimal configuration (Fast Check activated, a failover threshold of 1, and 60 second DNS TTL) means that the failover will take 70 seconds at minimum to occur. When it does, all of the traffic is switched to the secondary origin, since itβs a stateful failover. Note that this design can be further extended with Route 53 Application Recovery Control for more sophisticated application failover across multiple AWS Regions, Availability Zones, and on-premises.
The second approach is using origin failover, a native feature of CloudFront. This capability of CloudFront tries for the primary origin of every request, and if a configured 4xx or 5xx error is received, then CloudFront attempts a retry with the secondary origin. This approach is simple to configure and provides immediate failover. However, itβs stateless, which means every request must fail independently, thus introducing latency to failed requests. For transient origin issues, this additional latency is an acceptable tradeoff with the speed of failover, but itβs not ideal when the origin is completely out of service. Finally, this approach only works for the GET/HEAD/OPTIONS HTTP methods, because other HTTP methods are not allowed on a CloudFront cache behavior with Origin Failover enabled.
Advantages
- Works for all HTTP methods and request types.
- Ensures complete switchover, minimizing ongoing failures.
Disadvantages
- Relatively slower failover due to DNS propagation time.
- Requires a reliable health-check mechanism.
Approach 2: Stateless Failover with Application Logic
This method handles failover at the application level. If a request to the primary origin fails (e.g., due to a 4xx or 5xx HTTP response), the application or CDN immediately retries the request with the secondary origin.
How It Works
- Primary Request: The application sends a request to the primary origin.
- Failure Handling: If the response indicates a failure (configurable for specific error codes), the request is retried with the secondary origin.
- Stateless Behavior: Each request operates independently, so failover happens on a per-request basis without waiting for a stateful switchover.
Implementation from AWS (as-is from aws blog)
The hybrid origin failover pattern combines both approaches to get the best of both worlds. First, you configure both of your origins with a Failover Policy in Route 53 behind a single origin domain name. Then, you configure an origin failover group with the single origin domain name as primary origin, and the secondary origin domain name as secondary origin. This means that when the primary origin becomes unavailable, requests are immediately retried with the secondary origin until the stateful failover of Route 53 kicks in within tens of seconds, after which requests go directly to the secondary origin without any latency penalty. Note that this pattern only works with the GET/HEAD/OPTIONS HTTP methods.
Advantages
- Near-instantaneous failover for failed requests.
- Simple to configure and doesnβt depend on DNS TTL.
Disadvantages
- Adds latency for failed requests due to retries.
- Limited to specific HTTP methods like GET, HEAD, and OPTIONS.
- Not suitable for scenarios where the primary origin is entirely down, as every request must fail first.
The Hybrid Origin Failover Pattern
The hybrid origin failover pattern combines the strengths of both approaches, mitigating their individual limitations. Hereβs how it works:
- DNS-based Stateful Failover: A DNS service with health checks monitors the primary origin and switches to the secondary origin if the primary becomes unhealthy. This ensures a complete and stateful failover within tens of seconds.
- Application-level Stateless Failover: Simultaneously, the application or CDN is configured to retry failed requests with a secondary origin. This provides an immediate failover mechanism for transient or initial failures.
Implementation Steps
- DNS Configuration
- Set up health checks on the primary origin.
- Define a failover policy in the DNS service, which resolves the origin domain name to the secondary origin when the primary is unhealthy.
- Application Configuration
- Configure the application or CDN to use an origin failover group.
- Specify the primary origin domain as the primary origin and the secondary origin domain as the backup.
Behavior
- Initially, if the primary origin encounters issues, requests are retried immediately with the secondary origin.
- Meanwhile, the DNS failover switches all traffic to the secondary origin within tens of seconds, eliminating retry latencies for subsequent requests.
Benefits of Hybrid Origin Failover
- Faster Failover: Immediate retries for failed requests minimize initial impact, while DNS failover ensures long-term stability.
- Reduced Latency: After DNS failover, subsequent requests donβt experience retry delays.
- High Resilience: Combines stateful and stateless failover for robust redundancy.
- Simplicity and Scalability: Leverages existing DNS and application/CDN features without complex configurations.
Limitations and Considerations
- HTTP Method Constraints: Stateless failover works only for GET, HEAD, and OPTIONS methods, limiting its use for POST or PUT requests.
- TTL Impact: Low TTLs reduce propagation delays but increase DNS query rates, which could lead to higher costs.
- Configuration Complexity: Combining DNS and application-level failover requires careful setup and testing to avoid misconfigurations.
- Secondary Origin Capacity: Ensure the secondary origin can handle full traffic loads during failover.