Why DR Planning Cannot Wait
Every organization believes a major outage will not happen to them — until it does. AWS region outages, database corruption, ransomware attacks, and human error can take down production systems for hours or days. The cost of downtime varies by industry, but for most SaaS companies, a one-hour outage costs between $10,000 and $500,000 in lost revenue, SLA penalties, and customer trust.
Disaster recovery planning is a core part of architecture planning that transforms reactive panic into structured recovery procedures with predictable outcomes.
Understanding RPO and RTO
Two metrics define every DR strategy:
Recovery Point Objective (RPO): The maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose at most 1 hour of data.
Recovery Time Objective (RTO): The maximum acceptable downtime. An RTO of 4 hours means your systems must be operational within 4 hours of an incident.
DR Tier Framework
| Tier | RPO | RTO | Strategy | Monthly Cost (Example) | |------|-----|-----|----------|----------------------| | Tier 1 | Near-zero | < 15 min | Multi-region active-active | $15,000+ | | Tier 2 | < 1 hour | < 1 hour | Warm standby | $5,000-10,000 | | Tier 3 | < 4 hours | < 4 hours | Pilot light | $1,000-3,000 | | Tier 4 | < 24 hours | < 24 hours | Backup and restore | $200-500 |
Most organizations need different tiers for different systems. Customer-facing APIs might need Tier 1, while internal reporting can tolerate Tier 4.
Backup Strategy
Database Backups
#!/bin/bash
# Automated database backup script
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="production"
BACKUP_DIR="/backups/mysql"
S3_BUCKET="s3://company-backups/mysql"
# Full backup with point-in-time recovery support
mysqldump --single-transaction --routines --triggers \
--master-data=2 --flush-logs \
-u backup_user -p"$DB_PASS" "$DB_NAME" | \
gzip > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"
# Upload to S3 with server-side encryption
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
"${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
--storage-class STANDARD_IA \
--sse aws:kms
# Cross-region replication handles copying to DR region
# Verify backup integrity
gunzip -t "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"
echo "Backup completed: ${DB_NAME}_${TIMESTAMP}.sql.gz"
# Cleanup local backups older than 7 days
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete
Backup Verification
Backups that have never been tested are not backups — they are assumptions. Schedule weekly restore tests:
#!/bin/bash
# Weekly backup restore verification
LATEST_BACKUP=$(aws s3 ls s3://company-backups/mysql/ | sort | tail -1 | awk '{print $4}')
# Download to restore test server
aws s3 cp "s3://company-backups/mysql/${LATEST_BACKUP}" /tmp/
# Restore to test database
gunzip -c "/tmp/${LATEST_BACKUP}" | mysql -u root -p"$TEST_DB_PASS" test_restore
# Run integrity checks
EXPECTED_USERS=$(mysql -u root -p"$PROD_DB_PASS" -h prod-db -e "SELECT COUNT(*) FROM users" -sN)
RESTORED_USERS=$(mysql -u root -p"$TEST_DB_PASS" -e "SELECT COUNT(*) FROM test_restore.users" -sN)
if [ "$EXPECTED_USERS" -eq "$RESTORED_USERS" ]; then
echo "PASS: Backup integrity verified ($RESTORED_USERS users)"
else
echo "FAIL: Count mismatch (expected $EXPECTED_USERS, got $RESTORED_USERS)"
# Send alert
fi
# Cleanup
mysql -u root -p"$TEST_DB_PASS" -e "DROP DATABASE test_restore"
rm -f "/tmp/${LATEST_BACKUP}"
Multi-Region Architecture
Warm Standby with AWS
# Terraform configuration for DR region
resource "aws_rds_cluster" "dr" {
provider = aws.dr_region
cluster_identifier = "production-dr"
engine = "aurora-mysql"
engine_version = "8.0.mysql_aurora.3.05.2"
# Read replica of production cluster
replication_source_identifier = aws_rds_cluster.production.arn
# Smaller instance in DR (scale up during failover)
instance_class = "db.r6g.large"
tags = {
Purpose = "disaster-recovery"
}
}
resource "aws_autoscaling_group" "dr_app" {
provider = aws.dr_region
# Keep minimal capacity in DR region
min_size = 1
max_size = 20
desired_capacity = 1 # Scale up during failover
launch_template {
id = aws_launch_template.app_dr.id
version = "$Latest"
}
}
DNS Failover with Route 53
resource "aws_route53_health_check" "primary" {
fqdn = "api-primary.example.com"
port = 443
type = "HTTPS"
resource_path = "/healthz"
failure_threshold = 3
request_interval = 10
}
resource "aws_route53_record" "api" {
zone_id = aws_route53_zone.main.id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
health_check_id = aws_route53_health_check.primary.id
set_identifier = "primary"
}
resource "aws_route53_record" "api_dr" {
zone_id = aws_route53_zone.main.id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = aws_lb.dr.dns_name
zone_id = aws_lb.dr.zone_id
evaluate_target_health = true
}
set_identifier = "secondary"
}
Recovery Runbooks
Automated Failover Runbook
#!/bin/bash
# disaster-recovery-failover.sh
set -euo pipefail
echo "=== DISASTER RECOVERY FAILOVER INITIATED ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "Operator: $USER"
# Step 1: Verify primary is truly down
echo "[Step 1] Verifying primary region failure..."
for i in {1..5}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 \
https://api-primary.example.com/healthz || echo "000")
if [ "$STATUS" == "200" ]; then
echo "Primary appears healthy. Aborting failover."
exit 1
fi
sleep 2
done
echo "Primary confirmed down after 5 checks."
# Step 2: Promote DR database
echo "[Step 2] Promoting DR database to primary..."
aws rds promote-read-replica-db-cluster \
--db-cluster-identifier production-dr \
--region us-west-2
echo "Waiting for DB promotion..."
aws rds wait db-cluster-available \
--db-cluster-identifier production-dr \
--region us-west-2
# Step 3: Scale up DR application tier
echo "[Step 3] Scaling DR application servers..."
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name dr-app-asg \
--desired-capacity 6 \
--region us-west-2
# Step 4: Update DNS (if not using Route 53 health checks)
echo "[Step 4] DNS failover in progress via Route 53 health checks..."
# Step 5: Verify DR environment
echo "[Step 5] Running smoke tests against DR..."
sleep 30
DR_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
https://api-dr.example.com/healthz)
if [ "$DR_STATUS" == "200" ]; then
echo "DR environment is healthy."
else
echo "WARNING: DR health check returned $DR_STATUS"
fi
echo "=== FAILOVER COMPLETE ==="
echo "Action items:"
echo "1. Monitor DR environment for 30 minutes"
echo "2. Notify customers of incident"
echo "3. Begin root cause analysis on primary"
echo "4. Plan failback procedure"
DR Testing Schedule
DR plans must be tested regularly to remain reliable:
- Monthly: Backup restore verification (automated)
- Quarterly: Tabletop exercise with the engineering team
- Semi-annually: Full failover test to DR region during maintenance window
- Annually: Chaos engineering exercise (simulate unexpected failure)
Communication Plan
During a disaster, clear communication is as important as technical recovery:
- Internal: Incident commander announces failover in dedicated Slack channel
- Status page: Update status.example.com within 5 minutes of incident detection
- Customer email: Send notification within 30 minutes for prolonged outages
- Post-mortem: Publish incident report within 5 business days
Cost Optimization
DR infrastructure does not need to match production capacity during normal operations. Use the pilot light model:
- Keep minimal compute running in the DR region (1-2 instances)
- Use database read replicas that can be promoted
- Store AMIs and container images in the DR region
- Script the scale-up process for rapid capacity expansion during failover
For comprehensive disaster recovery planning tailored to your infrastructure, our team conducts DR assessments that map business requirements to technical recovery strategies.
Need help with this?
Our team handles this kind of work daily. Let us take care of your infrastructure.
Related Articles
The Secret SEO Killer: How Neglected Server Maintenance Hurts Your Rankings
Discover how neglected server maintenance silently erodes search rankings through unplanned downtime, and learn the best practices for protecting both SEO and revenue.
StrategyMastering Cloud Migration: Strategies and Best Practices
A comprehensive guide to cloud migration covering lift-and-shift, replatforming, refactoring, and rebuilding strategies, with Terraform and AWS CLI examples and best practices for security, cost, and performance.
StrategyOpenSearch vs Elasticsearch: Key Differences Explained
A detailed comparison of OpenSearch and Elasticsearch covering licensing, features, security, plugins, visualization tools, compatibility, community support, and guidance on choosing between them.