Skip to main content
Back to Articles
StrategyNovember 15, 20259 min read

Disaster Recovery Plans for Cloud Infrastructure

Design and implement disaster recovery strategies for cloud infrastructure with RPO/RTO planning, multi-region failover, and automated recovery runbooks.

Why DR Planning Cannot Wait

Every organization believes a major outage will not happen to them — until it does. AWS region outages, database corruption, ransomware attacks, and human error can take down production systems for hours or days. The cost of downtime varies by industry, but for most SaaS companies, a one-hour outage costs between $10,000 and $500,000 in lost revenue, SLA penalties, and customer trust.

Disaster recovery planning is a core part of architecture planning that transforms reactive panic into structured recovery procedures with predictable outcomes.

Understanding RPO and RTO

Two metrics define every DR strategy:

Recovery Point Objective (RPO): The maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose at most 1 hour of data.

Recovery Time Objective (RTO): The maximum acceptable downtime. An RTO of 4 hours means your systems must be operational within 4 hours of an incident.

DR Tier Framework

| Tier | RPO | RTO | Strategy | Monthly Cost (Example) | |------|-----|-----|----------|----------------------| | Tier 1 | Near-zero | < 15 min | Multi-region active-active | $15,000+ | | Tier 2 | < 1 hour | < 1 hour | Warm standby | $5,000-10,000 | | Tier 3 | < 4 hours | < 4 hours | Pilot light | $1,000-3,000 | | Tier 4 | < 24 hours | < 24 hours | Backup and restore | $200-500 |

Most organizations need different tiers for different systems. Customer-facing APIs might need Tier 1, while internal reporting can tolerate Tier 4.

Backup Strategy

Database Backups

#!/bin/bash
# Automated database backup script
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="production"
BACKUP_DIR="/backups/mysql"
S3_BUCKET="s3://company-backups/mysql"

# Full backup with point-in-time recovery support
mysqldump --single-transaction --routines --triggers \
  --master-data=2 --flush-logs \
  -u backup_user -p"$DB_PASS" "$DB_NAME" | \
  gzip > "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"

# Upload to S3 with server-side encryption
aws s3 cp "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
  "${S3_BUCKET}/${DB_NAME}_${TIMESTAMP}.sql.gz" \
  --storage-class STANDARD_IA \
  --sse aws:kms

# Cross-region replication handles copying to DR region
# Verify backup integrity
gunzip -t "${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"
echo "Backup completed: ${DB_NAME}_${TIMESTAMP}.sql.gz"

# Cleanup local backups older than 7 days
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete

Backup Verification

Backups that have never been tested are not backups — they are assumptions. Schedule weekly restore tests:

#!/bin/bash
# Weekly backup restore verification
LATEST_BACKUP=$(aws s3 ls s3://company-backups/mysql/ | sort | tail -1 | awk '{print $4}')

# Download to restore test server
aws s3 cp "s3://company-backups/mysql/${LATEST_BACKUP}" /tmp/

# Restore to test database
gunzip -c "/tmp/${LATEST_BACKUP}" | mysql -u root -p"$TEST_DB_PASS" test_restore

# Run integrity checks
EXPECTED_USERS=$(mysql -u root -p"$PROD_DB_PASS" -h prod-db -e "SELECT COUNT(*) FROM users" -sN)
RESTORED_USERS=$(mysql -u root -p"$TEST_DB_PASS" -e "SELECT COUNT(*) FROM test_restore.users" -sN)

if [ "$EXPECTED_USERS" -eq "$RESTORED_USERS" ]; then
  echo "PASS: Backup integrity verified ($RESTORED_USERS users)"
else
  echo "FAIL: Count mismatch (expected $EXPECTED_USERS, got $RESTORED_USERS)"
  # Send alert
fi

# Cleanup
mysql -u root -p"$TEST_DB_PASS" -e "DROP DATABASE test_restore"
rm -f "/tmp/${LATEST_BACKUP}"

Multi-Region Architecture

Warm Standby with AWS

# Terraform configuration for DR region
resource "aws_rds_cluster" "dr" {
  provider = aws.dr_region

  cluster_identifier = "production-dr"
  engine            = "aurora-mysql"
  engine_version    = "8.0.mysql_aurora.3.05.2"

  # Read replica of production cluster
  replication_source_identifier = aws_rds_cluster.production.arn

  # Smaller instance in DR (scale up during failover)
  instance_class = "db.r6g.large"

  tags = {
    Purpose = "disaster-recovery"
  }
}

resource "aws_autoscaling_group" "dr_app" {
  provider = aws.dr_region

  # Keep minimal capacity in DR region
  min_size         = 1
  max_size         = 20
  desired_capacity = 1  # Scale up during failover

  launch_template {
    id      = aws_launch_template.app_dr.id
    version = "$Latest"
  }
}

DNS Failover with Route 53

resource "aws_route53_health_check" "primary" {
  fqdn              = "api-primary.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/healthz"
  failure_threshold  = 3
  request_interval   = 10
}

resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.id
  name    = "api.example.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
}

resource "aws_route53_record" "api_dr" {
  zone_id = aws_route53_zone.main.id
  name    = "api.example.com"
  type    = "A"

  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = aws_lb.dr.dns_name
    zone_id                = aws_lb.dr.zone_id
    evaluate_target_health = true
  }

  set_identifier = "secondary"
}

Recovery Runbooks

Automated Failover Runbook

#!/bin/bash
# disaster-recovery-failover.sh
set -euo pipefail

echo "=== DISASTER RECOVERY FAILOVER INITIATED ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "Operator: $USER"

# Step 1: Verify primary is truly down
echo "[Step 1] Verifying primary region failure..."
for i in {1..5}; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 \
    https://api-primary.example.com/healthz || echo "000")
  if [ "$STATUS" == "200" ]; then
    echo "Primary appears healthy. Aborting failover."
    exit 1
  fi
  sleep 2
done
echo "Primary confirmed down after 5 checks."

# Step 2: Promote DR database
echo "[Step 2] Promoting DR database to primary..."
aws rds promote-read-replica-db-cluster \
  --db-cluster-identifier production-dr \
  --region us-west-2

echo "Waiting for DB promotion..."
aws rds wait db-cluster-available \
  --db-cluster-identifier production-dr \
  --region us-west-2

# Step 3: Scale up DR application tier
echo "[Step 3] Scaling DR application servers..."
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name dr-app-asg \
  --desired-capacity 6 \
  --region us-west-2

# Step 4: Update DNS (if not using Route 53 health checks)
echo "[Step 4] DNS failover in progress via Route 53 health checks..."

# Step 5: Verify DR environment
echo "[Step 5] Running smoke tests against DR..."
sleep 30
DR_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  https://api-dr.example.com/healthz)
if [ "$DR_STATUS" == "200" ]; then
  echo "DR environment is healthy."
else
  echo "WARNING: DR health check returned $DR_STATUS"
fi

echo "=== FAILOVER COMPLETE ==="
echo "Action items:"
echo "1. Monitor DR environment for 30 minutes"
echo "2. Notify customers of incident"
echo "3. Begin root cause analysis on primary"
echo "4. Plan failback procedure"

DR Testing Schedule

DR plans must be tested regularly to remain reliable:

  • Monthly: Backup restore verification (automated)
  • Quarterly: Tabletop exercise with the engineering team
  • Semi-annually: Full failover test to DR region during maintenance window
  • Annually: Chaos engineering exercise (simulate unexpected failure)

Communication Plan

During a disaster, clear communication is as important as technical recovery:

  1. Internal: Incident commander announces failover in dedicated Slack channel
  2. Status page: Update status.example.com within 5 minutes of incident detection
  3. Customer email: Send notification within 30 minutes for prolonged outages
  4. Post-mortem: Publish incident report within 5 business days

Cost Optimization

DR infrastructure does not need to match production capacity during normal operations. Use the pilot light model:

  • Keep minimal compute running in the DR region (1-2 instances)
  • Use database read replicas that can be promoted
  • Store AMIs and container images in the DR region
  • Script the scale-up process for rapid capacity expansion during failover

For comprehensive disaster recovery planning tailored to your infrastructure, our team conducts DR assessments that map business requirements to technical recovery strategies.

Need help with this?

Our team handles this kind of work daily. Let us take care of your infrastructure.