Do Your Backups Actually Work? How to Test a Real Restore
Test your backups properly. Database restore, file restore, integrity verification — the 3-2-1 rule explained.
TL;DR
A backup that has never been restored is not a backup — it is a hope. This article walks you through practical, repeatable methods for testing restores of MySQL and PostgreSQL databases, file-level backups using rsync and tar, integrity verification with SHA-256 checksums, automated verification via cron jobs, and cloud backup testing with AWS S3 lifecycle policies and snapshot restores. If you implement even half of what is covered here, you will be far ahead of most organizations when disaster strikes.
Why Untested Backups Aren't Backups
Every sysadmin has heard the horror story: a critical production database goes down, the team reaches for the backups, and discovers they are corrupted, incomplete, or simply do not restore. According to various industry surveys, roughly 30-40% of backup restores fail when actually attempted. The reasons range from silent write errors and misconfigured retention policies to schema changes that make old dumps incompatible.
The fundamental truth is this: your backup strategy is only as good as your last successful restore test. A backup file sitting on a remote server is nothing more than a blob of bytes until you have proven, under controlled conditions, that it can be turned back into a working system.
Untested backups fail for many reasons:
- Silent corruption — disk errors, interrupted transfers, or storage degradation can corrupt backup files without any alert.
- Schema drift — database schemas evolve over time. A dump taken six months ago may not restore cleanly against the current application.
- Permission issues — file ownership and permissions often differ between the backup and restore environments.
- Missing dependencies — a database restore may succeed, but the application fails because a configuration file, an encryption key, or a related service was not included in the backup scope.
- Human error — the backup script had a typo, the cron job was disabled during maintenance and never re-enabled, or the wrong database was being backed up entirely.
The 3-2-1 Backup Rule
Before diving into restore testing, ensure your backup strategy follows the 3-2-1 rule:
- 3 copies of your data (the original plus two backups)
- 2 different storage media (e.g., local disk and object storage)
- 1 offsite copy (geographically separated from your primary infrastructure)
Some organizations extend this to 3-2-1-1-0: one copy must be immutable (write-once), and there must be zero errors in your last restore verification. That final zero is what this entire article is about.
Prerequisites
- A dedicated restore testing environment (VM, container, or staging server) — never test restores on production
- Access to your backup storage (local, NFS, S3, etc.)
- Installed tools:
mysql/mysqldump,psql/pg_dump/pg_restore,rsync,tar,sha256sum,awsCLI (for cloud testing) - Sufficient disk space on the test system (at least 2x your largest backup)
- Root or sudo access on the test machine
- A documented inventory of what is being backed up and where backups are stored
Step 1: Testing MySQL Restores
Creating a Backup with mysqldump
If you are not already using mysqldump, here is a solid baseline command that produces a consistent, restorable dump:
# Full backup of a single database with routines and triggers
mysqldump \
--single-transaction \
--routines \
--triggers \
--set-gtid-purged=OFF \
-u backup_user -p \
production_db > /backup/mysql/production_db_$(date +%Y%m%d_%H%M%S).sql
# Compress immediately
gzip /backup/mysql/production_db_*.sql
The --single-transaction flag is critical for InnoDB tables — it ensures a consistent snapshot without locking the database.
Performing the Test Restore
# Step 1: Create a disposable test database
mysql -u root -p -e "CREATE DATABASE restore_test CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"
# Step 2: Restore the backup into the test database
gunzip -c /backup/mysql/production_db_20260329_020000.sql.gz | \
mysql -u root -p restore_test
# Step 3: Verify row counts against known-good values
mysql -u root -p restore_test -e "
SELECT 'users' AS tbl, COUNT(*) AS row_count FROM users
UNION ALL
SELECT 'orders', COUNT(*) FROM orders
UNION ALL
SELECT 'products', COUNT(*) FROM products;"
# Step 4: Run a checksum comparison
mysql -u root -p -e "CHECKSUM TABLE restore_test.users, restore_test.orders;"
# Step 5: Clean up
mysql -u root -p -e "DROP DATABASE restore_test;"
Validating Application Compatibility
Row counts alone do not prove your backup works. Point a staging instance of your application at the restored database and verify that key operations function correctly: user login, data retrieval, report generation. If your application has a health check endpoint, use it.
# Point staging app at restored DB and run smoke tests
export DATABASE_URL="mysql://root:password@localhost:3306/restore_test"
curl -s http://localhost:8080/health | jq '.database_status'
Step 2: Testing PostgreSQL Restores
Creating a Backup with pg_dump
# Custom format backup (recommended — supports parallel restore)
pg_dump \
-Fc \
--no-owner \
--no-acl \
-h localhost \
-U backup_user \
production_db > /backup/postgresql/production_db_$(date +%Y%m%d_%H%M%S).dump
# Alternative: plain SQL format (human-readable, useful for debugging)
pg_dump \
--no-owner \
--no-acl \
-h localhost \
-U backup_user \
production_db > /backup/postgresql/production_db_$(date +%Y%m%d_%H%M%S).sql
Performing the Test Restore
# Step 1: Create a test database
createdb -U postgres restore_test
# Step 2: Restore from custom format dump
pg_restore \
--no-owner \
--no-acl \
-d restore_test \
-U postgres \
--jobs=4 \
/backup/postgresql/production_db_20260329_020000.dump
# Step 3: Verify table counts
psql -U postgres -d restore_test -c "
SELECT schemaname, relname AS table_name, n_live_tup AS row_count
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;"
# Step 4: Check for restore errors (pg_restore returns warnings)
# Redirect stderr to a log and inspect
pg_restore \
--no-owner \
--no-acl \
-d restore_test \
-U postgres \
/backup/postgresql/production_db_20260329_020000.dump 2> /tmp/restore_errors.log
if [ -s /tmp/restore_errors.log ]; then
echo "WARNING: Restore produced errors:"
cat /tmp/restore_errors.log
fi
# Step 5: Verify data integrity with a known query
psql -U postgres -d restore_test -c "
SELECT COUNT(*) AS total_users,
MAX(created_at) AS latest_record
FROM users;"
# Step 6: Clean up
dropdb -U postgres restore_test
Testing Point-in-Time Recovery (PITR)
If you use WAL archiving for point-in-time recovery, test that as well. Restore the base backup, then replay WAL segments to a specific timestamp:
# In recovery.conf or postgresql.auto.conf (PG 12+)
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2026-03-29 14:30:00'
recovery_target_action = 'promote'
PITR is one of the most powerful recovery tools available, but it is also one of the most commonly left untested.
Step 3: Testing File Restores
Restoring from tar Archives
# Create a backup (for reference)
tar -czf /backup/files/webroot_$(date +%Y%m%d).tar.gz \
-C /var/www html/
# Test restore to a temporary directory
mkdir -p /tmp/restore_test_files
tar -xzf /backup/files/webroot_20260329.tar.gz \
-C /tmp/restore_test_files/
# Verify file count matches
ORIGINAL_COUNT=$(find /var/www/html -type f | wc -l)
RESTORED_COUNT=$(find /tmp/restore_test_files/html -type f | wc -l)
if [ "$ORIGINAL_COUNT" -eq "$RESTORED_COUNT" ]; then
echo "OK: File count matches ($ORIGINAL_COUNT files)"
else
echo "FAIL: Original has $ORIGINAL_COUNT files, restored has $RESTORED_COUNT"
fi
# Deep comparison using diff
diff -rq /var/www/html/ /tmp/restore_test_files/html/
# Clean up
rm -rf /tmp/restore_test_files
Restoring with rsync
# Restore from rsync backup to test directory
rsync -avz --dry-run \
/backup/rsync/webroot/ \
/tmp/restore_test_rsync/
# If dry run looks correct, execute
rsync -avz \
/backup/rsync/webroot/ \
/tmp/restore_test_rsync/
# Verify with rsync's built-in checksum comparison
rsync -avcn \
/var/www/html/ \
/tmp/restore_test_rsync/
# The -c flag forces checksum comparison instead of timestamp/size
# Output should show no differences if backup is current
Step 4: Verifying Backup Integrity
SHA-256 Checksums
Always generate checksums at backup time and verify them before attempting a restore.
# At backup time: generate checksum
sha256sum /backup/mysql/production_db_20260329_020000.sql.gz \
> /backup/mysql/production_db_20260329_020000.sql.gz.sha256
# Before restore: verify checksum
sha256sum -c /backup/mysql/production_db_20260329_020000.sql.gz.sha256
# Expected output: production_db_20260329_020000.sql.gz: OK
# Batch verification of all backups in a directory
cd /backup/mysql
for f in *.sha256; do
if ! sha256sum -c "$f" 2>/dev/null; then
echo "INTEGRITY FAILURE: $f"
fi
done
Verifying Compressed Archives
# Test gzip integrity without extracting
gzip -t /backup/mysql/production_db_20260329_020000.sql.gz && \
echo "Archive OK" || echo "Archive CORRUPT"
# Test tar.gz integrity
tar -tzf /backup/files/webroot_20260329.tar.gz > /dev/null && \
echo "Archive OK" || echo "Archive CORRUPT"
Step 5: Automating Restore Verification with Cron
Manual testing is good. Automated, scheduled testing is better. The following script performs a full restore verification and sends an alert on failure.
#!/usr/bin/env bash
# /usr/local/bin/verify-backup-restore.sh
# Run weekly via cron to validate backup restorability
set -euo pipefail
LOG_FILE="/var/log/backup-verify.log"
ALERT_EMAIL="ops@example.com"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
STATUS="OK"
log() {
echo "[$TIMESTAMP] $1" | tee -a "$LOG_FILE"
}
alert_failure() {
local msg="$1"
log "FAIL: $msg"
echo "Backup restore verification FAILED: $msg" | \
mail -s "[ALERT] Backup Restore Failure" "$ALERT_EMAIL"
STATUS="FAIL"
}
# --- MySQL Restore Test ---
log "Starting MySQL restore verification..."
LATEST_MYSQL=$(ls -t /backup/mysql/*.sql.gz 2>/dev/null | head -1)
if [ -z "$LATEST_MYSQL" ]; then
alert_failure "No MySQL backup files found"
else
# Check age (fail if older than 25 hours)
BACKUP_AGE=$(( ($(date +%s) - $(stat -c %Y "$LATEST_MYSQL")) / 3600 ))
if [ "$BACKUP_AGE" -gt 25 ]; then
alert_failure "Latest MySQL backup is ${BACKUP_AGE}h old (>25h)"
fi
# Verify checksum
if [ -f "${LATEST_MYSQL}.sha256" ]; then
if ! sha256sum -c "${LATEST_MYSQL}.sha256" > /dev/null 2>&1; then
alert_failure "MySQL backup checksum mismatch"
fi
fi
# Attempt restore
mysql -u root -e "DROP DATABASE IF EXISTS verify_restore; CREATE DATABASE verify_restore;"
if gunzip -c "$LATEST_MYSQL" | mysql -u root verify_restore 2>> "$LOG_FILE"; then
TABLES=$(mysql -u root verify_restore -N -e "SHOW TABLES;" | wc -l)
log "MySQL restore OK: $TABLES tables restored"
if [ "$TABLES" -lt 1 ]; then
alert_failure "MySQL restore produced 0 tables"
fi
else
alert_failure "MySQL restore command failed"
fi
mysql -u root -e "DROP DATABASE IF EXISTS verify_restore;"
fi
# --- PostgreSQL Restore Test ---
log "Starting PostgreSQL restore verification..."
LATEST_PG=$(ls -t /backup/postgresql/*.dump 2>/dev/null | head -1)
if [ -z "$LATEST_PG" ]; then
alert_failure "No PostgreSQL backup files found"
else
dropdb -U postgres --if-exists verify_restore 2>/dev/null
createdb -U postgres verify_restore
if pg_restore --no-owner -d verify_restore -U postgres "$LATEST_PG" 2>> "$LOG_FILE"; then
TABLES=$(psql -U postgres -d verify_restore -tAc \
"SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
log "PostgreSQL restore OK: $TABLES tables restored"
else
alert_failure "PostgreSQL restore command failed"
fi
dropdb -U postgres --if-exists verify_restore
fi
# --- File Backup Integrity Test ---
log "Starting file backup integrity check..."
for archive in /backup/files/*.tar.gz; do
if ! tar -tzf "$archive" > /dev/null 2>&1; then
alert_failure "Corrupt archive: $archive"
fi
done
log "File backup integrity check complete"
# --- Summary ---
log "Verification complete. Status: $STATUS"
exit $([ "$STATUS" = "OK" ] && echo 0 || echo 1)
Add this to your crontab:
# Run backup restore verification every Sunday at 04:00
0 4 * * 0 /usr/local/bin/verify-backup-restore.sh >> /var/log/backup-verify.log 2>&1
Step 6: Cloud Backup Testing (S3, Snapshots)
AWS S3 Backup Verification
# List recent backups in S3
aws s3 ls s3://my-backup-bucket/mysql/ --recursive \
| sort -k1,2 | tail -5
# Download and verify the latest backup
aws s3 cp s3://my-backup-bucket/mysql/production_db_20260329_020000.sql.gz /tmp/
sha256sum /tmp/production_db_20260329_020000.sql.gz
# Verify against stored checksum
aws s3 cp s3://my-backup-bucket/mysql/production_db_20260329_020000.sql.gz.sha256 /tmp/
sha256sum -c /tmp/production_db_20260329_020000.sql.gz.sha256
S3 Lifecycle and Versioning
Ensure your S3 bucket has versioning enabled and appropriate lifecycle policies. This protects against accidental deletion and ransomware:
# Check versioning status
aws s3api get-bucket-versioning --bucket my-backup-bucket
# Enable versioning if not active
aws s3api put-bucket-versioning \
--bucket my-backup-bucket \
--versioning-configuration Status=Enabled
# View lifecycle rules
aws s3api get-bucket-lifecycle-configuration --bucket my-backup-bucket
# Example lifecycle policy: transition to Glacier after 30 days,
# delete after 365 days
aws s3api put-bucket-lifecycle-configuration \
--bucket my-backup-bucket \
--lifecycle-configuration '{
"Rules": [{
"ID": "backup-lifecycle",
"Status": "Enabled",
"Filter": {"Prefix": ""},
"Transitions": [{
"Days": 30,
"StorageClass": "GLACIER"
}],
"Expiration": {"Days": 365},
"NoncurrentVersionExpiration": {"NoncurrentDays": 90}
}]
}'
EC2/EBS Snapshot Restore Testing
# Find the latest snapshot for a volume
LATEST_SNAP=$(aws ec2 describe-snapshots \
--filters "Name=volume-id,Values=vol-0abc123def456" \
--query 'Snapshots | sort_by(@, &StartTime) | [-1].SnapshotId' \
--output text)
echo "Latest snapshot: $LATEST_SNAP"
# Create a test volume from the snapshot
TEST_VOL=$(aws ec2 create-volume \
--snapshot-id "$LATEST_SNAP" \
--availability-zone eu-central-1a \
--volume-type gp3 \
--tag-specifications 'ResourceType=volume,Tags=[{Key=Purpose,Value=restore-test}]' \
--query 'VolumeId' --output text)
echo "Test volume created: $TEST_VOL"
# Attach to a test instance, mount, and verify
aws ec2 attach-volume \
--volume-id "$TEST_VOL" \
--instance-id i-0test123instance \
--device /dev/xvdf
# On the test instance:
# sudo mkdir -p /mnt/restore_test
# sudo mount /dev/xvdf1 /mnt/restore_test
# ls -la /mnt/restore_test/
# sudo umount /mnt/restore_test
# Clean up
aws ec2 detach-volume --volume-id "$TEST_VOL"
aws ec2 delete-volume --volume-id "$TEST_VOL"
Troubleshooting
MySQL Restore Fails with "Access Denied"
Ensure the restore user has sufficient privileges. The backup may contain DEFINER clauses referencing a user that does not exist on the test system.
# Strip DEFINER clauses during restore
gunzip -c backup.sql.gz | \
sed 's/DEFINER=[^ ]* //g' | \
mysql -u root restore_test
PostgreSQL Restore Reports "Role Does Not Exist"
Use --no-owner and --no-acl flags, or create the missing roles beforehand:
# Create missing role
psql -U postgres -c "CREATE ROLE app_user WITH LOGIN;"
# Or restore without ownership
pg_restore --no-owner --no-acl -d restore_test backup.dump
tar Restore Fails with "Cannot Change Ownership"
This occurs when restoring as a non-root user. Use --no-same-owner:
tar -xzf backup.tar.gz --no-same-owner -C /tmp/restore_test/
Checksum Mismatch After S3 Transfer
S3 multipart uploads can cause checksum mismatches if the checksum was calculated on the original file. Verify using S3's ETag or use --checksum-algorithm SHA256 during upload:
# Upload with integrity checking
aws s3 cp backup.sql.gz s3://my-bucket/ --checksum-algorithm SHA256
# Verify after download
aws s3api head-object --bucket my-bucket --key backup.sql.gz \
--checksum-mode ENABLED
Backup File Is Older Than Expected
Check if the backup cron job is still running and if the backup script is exiting silently on error:
# Check cron logs
grep -i backup /var/log/syslog | tail -20
# Check if backup process ran recently
ls -lah /backup/mysql/*.sql.gz | tail -5
# Verify cron is active
crontab -l | grep backup
Prevention: Building a Backup Testing Culture
Establish a Restore Testing Schedule
- Weekly: Automated integrity checks (checksum verification, archive testing)
- Monthly: Full automated restore to a test environment with row count and table verification
- Quarterly: Manual end-to-end disaster recovery drill — restore from backup and verify application functionality
- After every major change: Schema migrations, infrastructure changes, or backup tool upgrades
Document Your Recovery Procedure
Create a runbook that anyone on the team can follow under pressure. Include:
- Where backups are stored (all locations, all credentials)
- Exact restore commands for each service
- Expected recovery time for each component
- Contact information for escalation
- The order in which services must be restored (dependencies)
Monitor Backup Health
# Example: Prometheus/node_exporter textfile collector
# Add to your backup script:
cat > /var/lib/node_exporter/textfile_collector/backup_status.prom <
Key Takeaways
- Schedule restore tests like you schedule backups — automatically and regularly.
- Verify integrity at every stage: creation, transfer, storage, and restore.
- Test the full stack, not just the database dump. Your application must actually work with the restored data.
- Treat backup failures as P1 incidents. A broken backup pipeline is an active risk to the business.
- Keep your restore documentation current and accessible to the entire team.
Need Expert Help?
Want me to verify your backups and tell you if you're safe? €39.
Book Now — €39100% money-back guarantee