Why DR Planning Cannot Wait

Every organization believes a major outage will not happen to them - until it does. AWS region outages, database corruption, ransomware attacks, and human error can take down production systems for hours or days. The cost of downtime varies by industry, but for most SaaS companies, a one-hour outage costs between $10,000 and $500,000 in lost revenue, SLA penalties, and customer trust.

Disaster recovery planning is a core part of architecture planning that transforms reactive panic into structured recovery procedures with predictable outcomes.

Understanding RPO and RTO

Two metrics define every DR strategy:

Recovery Point Objective (RPO): The maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose at most 1 hour of data.

Recovery Time Objective (RTO): The maximum acceptable downtime. An RTO of 4 hours means your systems must be operational within 4 hours of an incident.

DR Tier Framework

Tier	RPO	RTO	Strategy	Monthly Cost (Example)
Tier 1	Near-zero	< 15 min	Multi-region active-active	$15,000+
Tier 2	< 1 hour	< 1 hour	Warm standby	$5,000-10,000
Tier 3	< 4 hours	< 4 hours	Pilot light	$1,000-3,000
Tier 4	< 24 hours	< 24 hours	Backup and restore	$200-500

Most organizations need different tiers for different systems. Customer-facing APIs might need Tier 1, while internal reporting can tolerate Tier 4.

Backup Strategy

Database Backups

#!/bin/bash
# Automated database backup script
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="production"
BACKUP_DIR="/backups/mysql"
S3_BUCKET="s3://company-backups/mysql"

# Full backup with point-in-time recovery support
mysqldump --single-transaction --routines --triggers \
  --master-data=2 --flush-logs \
  -u backup_user -p"$DB_PASS" "$DB_NAME" | \
  gzip > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"

# Upload to S3 with server-side encryption
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
  "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
  --storage-class STANDARD_IA \
  --sse aws:kms

# Cross-region replication handles copying to DR region
# Verify backup integrity
gunzip -t "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"
echo "Backup completed: ${DB_NAME}_${TIMESTAMP}.sql.gz"

# Cleanup local backups older than 7 days
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete

Backup Verification

Backups that have never been tested are not backups - they are assumptions. Schedule weekly restore tests:

#!/bin/bash
# Weekly backup restore verification
LATEST_BACKUP=$(aws s3 ls s3://company-backups/mysql/ | sort | tail -1 | awk '{print $4}')

# Download to restore test server
aws s3 cp "s3://company-backups/mysql/${LATEST_BACKUP}" /tmp/

# Restore to test database
gunzip -c "/tmp/${LATEST_BACKUP}" | mysql -u root -p"$TEST_DB_PASS" test_restore

# Run integrity checks
EXPECTED_USERS=$(mysql -u root -p"$PROD_DB_PASS" -h prod-db -e "SELECT COUNT(*) FROM users" -sN)
RESTORED_USERS=$(mysql -u root -p"$TEST_DB_PASS" -e "SELECT COUNT(*) FROM test_restore.users" -sN)

if [ "$EXPECTED_USERS" -eq "$RESTORED_USERS" ]; then
  echo "PASS: Backup integrity verified ($RESTORED_USERS users)"
else
  echo "FAIL: Count mismatch (expected $EXPECTED_USERS, got $RESTORED_USERS)"
  # Send alert
fi

# Cleanup
mysql -u root -p"$TEST_DB_PASS" -e "DROP DATABASE test_restore"
rm -f "/tmp/${LATEST_BACKUP}"

Multi-Region Architecture

Warm Standby with AWS

# Terraform configuration for DR region
resource "aws_rds_cluster" "dr" {
  provider = aws.dr_region

  cluster_identifier = "production-dr"
  engine            = "aurora-mysql"
  engine_version    = "8.0.mysql_aurora.3.05.2"

  # Read replica of production cluster
  replication_source_identifier = aws_rds_cluster.production.arn

  # Smaller instance in DR (scale up during failover)
  instance_class = "db.r6g.large"

  tags = {
    Purpose = "disaster-recovery"
  }
}

resource "aws_autoscaling_group" "dr_app" {
  provider = aws.dr_region

  # Keep minimal capacity in DR region
  min_size         = 1
  max_size         = 20
  desired_capacity = 1  # Scale up during failover

  launch_template {
    id      = aws_launch_template.app_dr.id
    version = "$Latest"
  }
}

DNS Failover with Route 53

resource "aws_route53_health_check" "primary" {
  fqdn              = "api-primary.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/healthz"
  failure_threshold  = 3
  request_interval   = 10
}

resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.id
  name    = "api.example.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
}

resource "aws_route53_record" "api_dr" {
  zone_id = aws_route53_zone.main.id
  name    = "api.example.com"
  type    = "A"

  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = aws_lb.dr.dns_name
    zone_id                = aws_lb.dr.zone_id
    evaluate_target_health = true
  }

  set_identifier = "secondary"
}

Recovery Runbooks

Automated Failover Runbook

#!/bin/bash
# disaster-recovery-failover.sh
set -euo pipefail

echo "=== DISASTER RECOVERY FAILOVER INITIATED ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "Operator: $USER"

# Step 1: Verify primary is truly down
echo "[Step 1] Verifying primary region failure..."
for i in {1..5}; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 \
    https://api-primary.example.com/healthz || echo "000")
  if [ "$STATUS" == "200" ]; then
    echo "Primary appears healthy. Aborting failover."
    exit 1
  fi
  sleep 2
done
echo "Primary confirmed down after 5 checks."

# Step 2: Promote DR database
echo "[Step 2] Promoting DR database to primary..."
aws rds promote-read-replica-db-cluster \
  --db-cluster-identifier production-dr \
  --region us-west-2

echo "Waiting for DB promotion..."
aws rds wait db-cluster-available \
  --db-cluster-identifier production-dr \
  --region us-west-2

# Step 3: Scale up DR application tier
echo "[Step 3] Scaling DR application servers..."
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name dr-app-asg \
  --desired-capacity 6 \
  --region us-west-2

# Step 4: Update DNS (if not using Route 53 health checks)
echo "[Step 4] DNS failover in progress via Route 53 health checks..."

# Step 5: Verify DR environment
echo "[Step 5] Running smoke tests against DR..."
sleep 30
DR_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  https://api-dr.example.com/healthz)
if [ "$DR_STATUS" == "200" ]; then
  echo "DR environment is healthy."
else
  echo "WARNING: DR health check returned $DR_STATUS"
fi

echo "=== FAILOVER COMPLETE ==="
echo "Action items:"
echo "1. Monitor DR environment for 30 minutes"
echo "2. Notify customers of incident"
echo "3. Begin root cause analysis on primary"
echo "4. Plan failback procedure"

DR Testing Schedule

DR plans must be tested regularly to remain reliable:

Monthly: Backup restore verification (automated)
Quarterly: Tabletop exercise with the engineering team
Semi-annually: Full failover test to DR region during maintenance window
Annually: Chaos engineering exercise (simulate unexpected failure)

Communication Plan

During a disaster, clear communication is as important as technical recovery:

Internal: Incident commander announces failover in dedicated Slack channel
Status page: Update status.example.com within 5 minutes of incident detection
Customer email: Send notification within 30 minutes for prolonged outages
Post-mortem: Publish incident report within 5 business days

Cost Optimization

DR infrastructure does not need to match production capacity during normal operations. Use the pilot light model:

Keep minimal compute running in the DR region (1-2 instances)
Use database read replicas that can be promoted
Store AMIs and container images in the DR region
Script the scale-up process for rapid capacity expansion during failover

For comprehensive disaster recovery planning tailored to your infrastructure, our team conducts DR assessments that map business requirements to technical recovery strategies.

Talk to the engineer who will own your stack.

No account managers, no offshore handoff. Senior DevOps, direct. Tell us what you are dealing with and you get a straight answer.

View Related Service Discuss

Strategy

Disaster Recovery Plans for Cloud Infrastructure