Module 16: Reliability and Disaster Recovery
Learning Objectives
By the end of this module, you will be able to:
- Analyze availability requirements by calculating target uptime percentages (nines) and evaluating the cost and complexity trade-offs of achieving higher availability levels
- Assess Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements for a workload and recommend a disaster recovery strategy that meets those objectives
- Evaluate the four AWS disaster recovery strategies (backup and restore, pilot light, warm standby, multi-site active-active) and justify which strategy is appropriate for a given workload based on RTO, RPO, cost, and complexity
- Recommend multi-AZ and multi-Region architectures for high availability by analyzing single points of failure and designing redundancy at each layer
- Analyze resilience patterns (retry with exponential backoff, circuit breaker, bulkhead, timeout, fallback) and evaluate when to apply each pattern in a distributed application
- Assess AWS Backup configurations and recommend backup plans that meet data protection requirements for RDS, DynamoDB, EBS, and S3
- Evaluate Route 53 failover routing configurations and health checks for automated DNS-based disaster recovery
- Justify the use of chaos engineering practices to validate reliability assumptions and identify weaknesses before they cause production incidents
Prerequisites
- Completion of Module 03: Networking Basics (VPC) (multi-AZ VPC architecture, subnets across Availability Zones, and NAT gateway redundancy)
- Completion of Module 04: Compute with Amazon EC2 (Auto Scaling groups that maintain instance health across AZs)
- Completion of Module 06: Databases with Amazon RDS and DynamoDB (RDS Multi-AZ deployments, automated backups, read replicas, and DynamoDB global tables)
- Completion of Module 07: Load Balancing and DNS (ALB cross-zone load balancing, Route 53 routing policies, and health checks)
- Completion of Module 14: Monitoring and Observability (CloudWatch alarms and metrics that detect failures and trigger recovery actions)
- Familiarity with all prior modules, as this module applies reliability principles to infrastructure built throughout the bootcamp
Concepts
Availability: Uptime, SLAs, and the Cost of Nines
Availability measures how often your system is up and reachable. You will see it expressed as a percentage or in "nines" notation:
| Availability | Nines | Downtime per Year | Downtime per Month |
|---|---|---|---|
| 99% | Two nines | 3.65 days | 7.3 hours |
| 99.9% | Three nines | 8.77 hours | 43.8 minutes |
| 99.95% | Three and a half nines | 4.38 hours | 21.9 minutes |
| 99.99% | Four nines | 52.6 minutes | 4.38 minutes |
| 99.999% | Five nines | 5.26 minutes | 26.3 seconds |
Each additional nine roughly increases the cost and complexity by an order of magnitude. Moving from 99.9% to 99.99% requires eliminating single points of failure at every layer, implementing automated failover, and testing recovery procedures regularly. Most production web applications target 99.9% to 99.99% availability.
AWS publishes Service Level Agreements (SLAs) for its services. For example, EC2 and RDS Multi-AZ offer a 99.95% SLA, and S3 offers a 99.9% SLA for standard storage. Your application's availability is limited by the lowest-availability component in the critical path.
Tip: Availability is a business decision, not just a technical one. Higher availability costs more (redundant infrastructure, automated failover, multi-Region deployment). Work with stakeholders to define the availability target based on the business impact of downtime, then design the architecture to meet that target.
RTO and RPO: Defining Recovery Objectives
Two metrics define your disaster recovery requirements:
- Recovery Time Objective (RTO) is the maximum acceptable time between a disruption and the restoration of service. If your RTO is 1 hour, you must be able to restore service within 1 hour of a failure.
- Recovery Point Objective (RPO) is the maximum acceptable amount of data loss measured in time. If your RPO is 15 minutes, you can lose at most 15 minutes of data (meaning backups or replication must occur at least every 15 minutes).
RPO RTO
<─────────────────|────────────────────>
Last backup Disaster Service restored
or replication occurs and operational
RTO and RPO are independent. A system might have an RPO of 1 hour (hourly backups are acceptable) but an RTO of 5 minutes (the service must be restored quickly). The combination of RTO and RPO determines which disaster recovery strategy is appropriate.
| RTO/RPO | Appropriate DR Strategy | Relative Cost |
|---|---|---|
| Hours/Hours | Backup and restore | Lowest |
| Minutes to hours/Minutes | Pilot light | Low to moderate |
| Minutes/Seconds to minutes | Warm standby | Moderate to high |
| Near-zero/Near-zero | Multi-site active-active | Highest |
Disaster Recovery Strategies on AWS
AWS supports four disaster recovery strategies, each trading cost for speed of recovery. The Disaster Recovery of Workloads on AWS whitepaper covers these in depth.
Backup and Restore
The simplest and cheapest approach. You copy data and configuration to another Region (using AWS Backup, S3 cross-Region replication, or EBS snapshots) and rebuild from those backups if disaster strikes. No infrastructure runs in the recovery Region during normal operations.
- RTO: Hours (time to provision infrastructure and restore data)
- RPO: Hours (depends on backup frequency)
- Cost: Lowest (you pay only for backup storage, not for running infrastructure)
- Best for: Non-critical workloads where hours of downtime are acceptable
Pilot Light
A minimal footprint of the production environment stays running in the recovery Region. Core components (database replicas, AMIs, network configuration) are maintained, but compute resources (EC2 instances, ECS tasks) remain off. When disaster strikes, you spin up compute and switch traffic.
- RTO: Minutes to hours (time to scale up compute and switch DNS)
- RPO: Minutes (depends on replication lag)
- Cost: Low to moderate (you pay for database replicas and minimal infrastructure)
- Best for: Workloads that need faster recovery than backup/restore but do not justify the cost of a full standby environment
Warm Standby
A scaled-down but fully running copy of production exists in the recovery Region. This standby environment can handle a fraction of traffic (or none) but is ready to scale up quickly. When disaster strikes, you increase capacity and redirect all traffic.
- RTO: Minutes (the environment is already running; you only need to scale and switch traffic)
- RPO: Seconds to minutes (continuous replication)
- Cost: Moderate to high (you pay for a running environment, though at reduced scale)
- Best for: Business-critical workloads that require recovery within minutes
Multi-Site Active-Active
Your workload runs simultaneously in two or more Regions, with each Region handling live traffic. If one Region fails, the remaining Regions absorb the load. There is no failover delay because every Region is already serving users.
- RTO: Near-zero (traffic is already distributed; failed Region is simply removed from rotation)
- RPO: Near-zero (data is replicated synchronously or near-synchronously across Regions)
- Cost: Highest (you pay for full production infrastructure in multiple Regions)
- Best for: Mission-critical workloads where any downtime is unacceptable (financial trading, healthcare, global SaaS platforms)
Strategy Comparison
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup and restore | Hours | Hours | $ | Low |
| Pilot light | Minutes to hours | Minutes | $$ | Moderate |
| Warm standby | Minutes | Seconds to minutes | $$$ | High |
| Multi-site active-active | Near-zero | Near-zero | $$$$ | Very high |
Multi-AZ and Multi-Region Architectures
Multi-AZ: The Minimum for Production
In Module 03, you built a VPC with subnets in two Availability Zones. Multi-AZ deployment is the foundation of high availability on AWS. Each AZ is a physically separate data center (or group of data centers) with independent power, cooling, and networking. If one AZ experiences a failure, resources in the other AZ continue operating.
AWS services that support Multi-AZ deployment:
| Service | Multi-AZ Mechanism |
|---|---|
| EC2 + Auto Scaling | Auto Scaling group spans multiple AZs; unhealthy instances are replaced automatically |
| RDS Multi-AZ | Synchronous standby replica in a different AZ with automatic failover |
| ALB | Distributes traffic across targets in multiple AZs |
| ECS/Fargate | Tasks distributed across AZs by the ECS scheduler |
| ElastiCache | Multi-AZ replication with automatic failover |
| NAT Gateway | Deploy one per AZ for AZ-independent outbound internet access |
Multi-Region: For Disaster Recovery and Global Reach
Multi-Region architectures place resources in two or more AWS Regions. This guards against Region-level failures (extremely rare, but possible) and brings your application closer to globally distributed users.
Key AWS services for multi-Region architectures:
| Service | Multi-Region Capability |
|---|---|
| Route 53 | DNS-based failover routing between Regions using health checks |
| S3 Cross-Region Replication | Automatic replication of objects to a bucket in another Region |
| RDS Cross-Region Read Replicas | Asynchronous replication to a read replica in another Region (can be promoted to primary) |
| DynamoDB Global Tables | Multi-Region, multi-active replication with automatic conflict resolution |
| Aurora Global Database | Cross-Region replication with less than 1 second lag and fast failover |
| AWS Elastic Disaster Recovery | Continuous block-level replication of servers to a recovery Region |
Tip: Multi-Region adds significant complexity (data replication lag, conflict resolution, deployment coordination, cost). Start with multi-AZ for high availability. Add multi-Region only when your RTO/RPO requirements or geographic distribution needs justify the additional complexity and cost.
Resilience Patterns for Distributed Applications
When your application is spread across multiple services (microservices, Lambda functions, containers), a failure in one component can ripple outward and take down the whole system. The patterns below, drawn from the Reliability Pillar of the Well-Architected Framework, prevent that cascade.
Retry with Exponential Backoff and Jitter
When a service call fails due to a transient error (network timeout, throttling, temporary unavailability), retry the call after a delay. Increase the delay exponentially with each retry (1s, 2s, 4s, 8s) and add random jitter to prevent all clients from retrying at the same time.
Attempt 1: immediate
Attempt 2: wait 1s + random(0-500ms)
Attempt 3: wait 2s + random(0-500ms)
Attempt 4: wait 4s + random(0-500ms)
(give up after max retries)
The AWS SDKs implement retry with exponential backoff automatically for most API calls. You can configure the maximum number of retries and the backoff strategy.
Circuit Breaker
A circuit breaker monitors the failure rate of calls to a downstream service. When the failure rate exceeds a threshold, the circuit "opens" and subsequent calls fail immediately without attempting the downstream call. After a timeout period, the circuit enters a "half-open" state and allows a limited number of test calls. If the test calls succeed, the circuit closes and normal operation resumes.
Closed (normal) --> failure rate exceeds threshold --> Open (fail fast)
Open --> timeout expires --> Half-Open (test calls)
Half-Open --> test calls succeed --> Closed
Half-Open --> test calls fail --> Open
Circuit breakers prevent a failing downstream service from consuming resources (threads, connections, time) in the calling service, which could cause the caller to fail as well (cascading failure).
Bulkhead
The bulkhead pattern isolates components so that a failure in one does not affect others. Named after the watertight compartments in a ship's hull, bulkheads in software limit the blast radius of failures.
Examples on AWS:
- Use separate SQS queues for different message types so that a backlog in one queue does not block processing of others.
- Use separate Lambda functions (with separate concurrency limits) for different API endpoints so that a traffic spike on one endpoint does not exhaust concurrency for others.
- Use separate ECS services for different microservices so that a memory leak in one service does not affect others.
Timeout
Set timeouts on all external calls (HTTP requests, database queries, API calls). A timeout ensures that a slow or unresponsive dependency does not block the calling service indefinitely. Without timeouts, a single slow dependency can consume all available threads or connections, causing the calling service to become unresponsive.
Fallback
When a service call fails (even after retries), provide a degraded but functional response instead of returning an error. For example, if a recommendation engine is unavailable, return a default set of popular items instead of showing an error page.
AWS Backup: Centralized Backup Management
AWS Backup gives you a single place to define and enforce backup policies for all your AWS resources. Rather than juggling separate snapshot schedules for RDS, EBS, and DynamoDB (which you configured individually in earlier modules), you create one backup plan that covers everything.
Key concepts:
| Concept | Description |
|---|---|
| Backup plan | A policy that defines backup frequency, retention period, and lifecycle rules |
| Backup vault | A secure container where backups are stored (encrypted with KMS) |
| Backup rule | A schedule within a backup plan (for example, daily at 2:00 AM, retain for 30 days) |
| Resource assignment | Which resources are included in the backup plan (by tag, resource type, or ARN) |
| Cross-Region copy | Automatically copy backups to another Region for disaster recovery |
Supported services include EC2 (AMIs), EBS (snapshots), RDS (snapshots), DynamoDB (backups), EFS (backups), S3 (backups), and Aurora (snapshots).
Tip: Use tag-based resource assignment in your backup plans. For example, assign all resources tagged
Backup=dailyto a daily backup plan. This ensures that new resources are automatically included in the backup plan when they are tagged correctly, without manual configuration.
Chaos Engineering: Testing Reliability
Chaos engineering means deliberately breaking things on purpose, in a controlled way, to find out how your system responds before a real outage does it for you. Think of it like a fire drill for your infrastructure.
AWS Fault Injection Service (FIS) lets you run these experiments without building custom failure-injection tooling. It ships with pre-built actions for common scenarios:
| FIS Action | What It Simulates |
|---|---|
| Stop EC2 instances | Instance failure in an Auto Scaling group |
| Throttle EBS I/O | Degraded storage performance |
| Inject Lambda errors | Function failures for testing error handling |
| Disrupt network connectivity | Network partition between services |
| Failover RDS | Database failover to the standby instance |
A chaos engineering experiment follows this process:
- Define a hypothesis. "If one EC2 instance in the Auto Scaling group fails, the ALB will route traffic to the remaining healthy instances, and the Auto Scaling group will launch a replacement within 5 minutes."
- Design the experiment. Use FIS to stop one EC2 instance.
- Run the experiment. Execute the FIS experiment in a staging environment first, then in production during low-traffic periods.
- Observe the results. Monitor CloudWatch metrics, ALB health checks, and Auto Scaling activity.
- Improve. If the system did not recover as expected, fix the issue and re-run the experiment.
Tip: Start chaos engineering in non-production environments. Run experiments during business hours when the team is available to respond. Gradually increase the scope and severity of experiments as your confidence in the system's resilience grows.
Instructor Notes
Estimated lecture time: 90 to 105 minutes
Common student questions:
-
Q: What is the difference between high availability and disaster recovery? A: High availability (HA) is about minimizing downtime during normal operations by eliminating single points of failure (multi-AZ deployment, Auto Scaling, health checks). Disaster recovery (DR) is about recovering from a major disruption that affects an entire AZ or Region. HA keeps the system running during small failures; DR restores the system after large failures. A well-designed architecture includes both.
-
Q: Do I always need multi-Region deployment? A: No. Multi-Region adds significant complexity and cost. Most applications achieve sufficient availability with multi-AZ deployment within a single Region. Consider multi-Region only when your RTO/RPO requirements demand near-zero downtime, when you need to serve users in multiple geographic locations with low latency, or when regulatory requirements mandate data residency in specific Regions.
-
Q: How do I choose between the four DR strategies? A: Start with your RTO and RPO requirements. If hours of downtime and data loss are acceptable, use backup and restore (cheapest). If you need recovery in minutes with minimal data loss, use pilot light or warm standby. If you need near-zero downtime and data loss, use multi-site active-active (most expensive). The choice is a trade-off between cost and recovery speed.
-
Q: What is the difference between a retry and a circuit breaker? A: A retry attempts the same operation again after a failure, hoping the transient issue has resolved. A circuit breaker stops attempting the operation entirely after repeated failures, preventing the calling service from wasting resources on a dependency that is clearly down. Use retries for transient failures (network blips, throttling). Use circuit breakers when a dependency is consistently failing and retries would only add load to an already struggling service.
Teaching tips:
- Start the lecture by asking students: "Your production database goes down at 3 AM. What happens next?" Walk through the scenario to motivate the need for defined RTO/RPO, automated recovery, and tested DR procedures.
- When explaining the four DR strategies, draw them on a whiteboard as a spectrum from low cost/high RTO to high cost/low RTO. Use a concrete example (an e-commerce site) and ask students which strategy they would choose for different components (product catalog vs. payment processing vs. marketing blog).
- Pause after the resilience patterns section for a group exercise. Present a microservices architecture diagram and ask each team to identify where they would apply retry, circuit breaker, bulkhead, and timeout patterns.
- The chaos engineering section is a good opportunity for a live demo. If time permits, show the AWS FIS console and walk through creating a simple experiment (stopping an EC2 instance in an Auto Scaling group).
- Connect this module to previous modules: multi-AZ VPC (Module 03), Auto Scaling (Module 04), RDS Multi-AZ (Module 06), ALB health checks (Module 07), and CloudWatch alarms (Module 14) are all building blocks of the reliability architecture discussed here.
Key Takeaways
- Define RTO and RPO for every critical workload before designing the architecture; these objectives determine which disaster recovery strategy is appropriate and how much to invest in redundancy.
- Multi-AZ deployment is the minimum for production workloads; it protects against single-AZ failures with automatic failover for most AWS services (RDS, ALB, Auto Scaling, ECS).
- The four DR strategies (backup/restore, pilot light, warm standby, multi-site active-active) represent a spectrum of cost vs. recovery speed; choose based on your workload's business criticality and budget.
- Apply resilience patterns (retry with backoff, circuit breaker, bulkhead, timeout) in distributed applications to prevent cascading failures when individual components fail.
- Test your reliability assumptions through chaos engineering; a disaster recovery plan that has never been tested is a plan that may not work when you need it.
AWS Bootcamp: From Novice to Architect Author: Samuel Ogunti License: CC BY-NC 4.0