Data Migration Emergencies: Recovery Strategies

When a Migration Fails Mid-Process: What’s Really Happening

Data migrations look straightforward on paper: extract, transform, load, validate, cut over. In practice, network blips, long-running locks, schema mismatches, and unexpected data can derail a migration halfway. The result is a dangerous state: partial data changes, inconsistent schemas, duplicate records, corrupted indexes, and confused applications.

Understanding common failure modes helps you choose the safest recovery path:

Partial writes: Only a subset of rows/documents loaded; some updates applied, others not.
Inconsistent schemas: DDL applied on the target but not on the source (or vice versa), causing application errors.
Broken constraints: Foreign keys fail, unique indexes conflict, or triggers misfire.
Duplicate data: Retries without idempotence create duplicates or update records multiple times.
Lock contention: Long-running operations block reads/writes, piling up client errors or timeouts.
Replication divergence: Primaries and replicas drift apart (especially mid-migration).
Interrupted background operations: Index builds or online schema changes left in an intermediate state.

The steps below explain how to triage, recover, and prevent repeats across PostgreSQL, MongoDB, and MySQL.

Immediate Triage: Stabilize Before You Fix

Think firefighter first, surgeon later. The faster you contain the incident, the easier the recovery.

Freeze writes to the affected systems.
- Put the application in read-only mode (feature flag or maintenance window).
- Stop ingestion pipelines, ETL jobs, consumers, or migration orchestrators.
- In sharded or distributed environments, ensure writes aren’t still trickling to a subset.
Capture the state and evidence.
- Note timestamps, migration IDs/batch IDs, and last successful checkpoints.
- Snapshot affected tables/collections if feasible (storage-level or logical).
- Preserve logs: database logs, migration scripts output, orchestration logs.
Assess blast radius quickly.
- Which tables/collections were touched?
- Are schemas partially changed (some DDL applied)?
- Did replicas or downstream systems apply partial changes?
Decide on rollback vs. roll-forward (don’t do both).
- Rollback: restore from backup or PITR to a safe point if the window is short and the system is compromised.
- Roll-forward: fix inconsistencies and complete the migration if the partial work is salvageable and validated.
Communicate status.
- Share a short incident update: what’s impacted, current mode (read-only), ETA for next update.
- Assign clear roles: one lead, one operator per system, one recorder.

Decision Framework: Roll Back or Roll Forward?

Choose based on data integrity, time pressure, and confidence.

Roll back if:
- The migration damaged data or schema in non-trivial ways.
- You lack deterministic, idempotent steps to safely resume.
- Validation shows widespread inconsistency.
- PITR or snapshots make rollback fast and low-risk.
Roll forward if:
- The partial state is understandable and isolated.
- The migration is idempotent (safe to re-run).
- You can reconcile differences deterministically (e.g., by batch ID or a high-water mark).
- Rollback would be slower or riskier than repair.

If you’re unsure, bias to rollback. It’s often safer than prolonged uncertainty with live data.

PostgreSQL: Recovery Strategies

PostgreSQL has a powerful transactional model—even for many DDL operations—which can make recovery more forgiving if you used transactions properly.

1) If a Transaction Is Still Open or Hung

Check for active sessions and locks:
- Use pg_stat_activity to identify the migration session.
- If it’s hung, cancel or terminate it.
If the migration was wrapped in a single transaction, a ROLLBACK will revert changes.
If partial data loaded outside a transaction, assess the scope and prepare to reconcile.

2) Point-in-Time Recovery (PITR) with WAL

When the database state is compromised and you have WAL archiving:

Restore from a base backup taken before the migration.
Replay WAL to a timestamp or LSN right before the migration started.
Validate in a staging environment if possible before promoting.
If you temporarily diverged to a restored node, use pg_rewind to rejoin the cluster after promoting the correct node.

PITR is ideal when the migration isn’t idempotent or touched too many tables inconsistently.

3) Clean Up Partial Loads

If the migration was an ETL-style load:

Use staging tables:

Load into staging_target first.
Validate counts, checksums, and key uniqueness.

Move data into target with idempotent upserts:

INSERT INTO target (id, col1, col2, ...)
SELECT id, col1, col2, ...
FROM staging_target
ON CONFLICT (id) DO UPDATE
  SET col1 = EXCLUDED.col1,
      col2 = EXCLUDED.col2;

If duplicates were created, deduplicate using DISTINCT ON (id) into a new table, then swap:
- CREATE TABLE target_new AS SELECT DISTINCT ON (id) ... ORDER BY id, updated_at DESC;
- Swap atomically with a transaction: rename target -> target_old, target_new -> target, then drop target_old after verification.

4) Sequence Corrections

Failed inserts can leave sequences ahead or behind.

Reset sequence to max ID:

SELECT setval('target_id_seq', (SELECT coalesce(MAX(id), 1) FROM target));

5) Constraints, Triggers, and Validations

Disable and re-enable constraints carefully:
- If you temporarily disabled triggers via session_replication_role = 'replica', re-enable and validate.
- Use deferrable constraints during data loads:
```
ALTER TABLE child
  ADD CONSTRAINT fk_parent
  FOREIGN KEY (parent_id) REFERENCES parent(id)
  DEFERRABLE INITIALLY DEFERRED;
```
- Validate after load:
```
ALTER TABLE child VALIDATE CONSTRAINT fk_parent;
```
Validate data integrity:
- SELECT COUNT(*) comparisons between source and target.
- Range checks for primary keys.
- Hash checksums:
```
SELECT md5(string_agg(id || ':' || col1 || ':' || col2, ',' ORDER BY id))
FROM target WHERE batch_id = 'xyz';
```

6) Lock-Safe Schema Changes

If a DDL operation failed mid-flight due to locks:

Re-plan DDL to minimize table rewrites:
- For large tables, avoid ALTER TABLE ... SET DEFAULT ... that rewrites data; prefer NULL + application-level default, then backfill.
- Split changes into small, reversible steps (add nullable column, backfill in batches, add constraint).
Use a maintenance window and lower lock_timeout for safety.

7) Testing and Idempotency

Make migrations idempotent:
- Guard with IF NOT EXISTS on DDL.
- Use ON CONFLICT for upserts.
- Use CREATE INDEX CONCURRENTLY and check pg_stat_progress_create_index if supported; if it fails, drop partially built index and retry.

MongoDB: Recovery Strategies

MongoDB’s document-level atomicity and flexible schema demand careful planning for multi-document migrations. Since MongoDB 4.0+, transactions are available in replica sets (and in sharded clusters since 4.2), but not every migration uses them.

1) Use Transactions Where Possible

If the migration ran within a transaction and failed, it likely aborted cleanly. If you mixed transactional and non-transactional operations:

Identify documents touched by the migration using a migration key, e.g., migrationId or batchId.
For incomplete updates, decide to:
- Delete partial docs in staging collections and re-run the batch, or
- Complete the updates idempotently using upserts.

2) Idempotent Bulk Operations

For bulk changes:

Use BulkWrite with ordered=false to maximize throughput and isolate failures.
Use idempotent patterns:
- Upsert with $set and $setOnInsert.
- Use retryWrites: true and writeConcern: { w: "majority" }.
Example approach:
- When migrating users to users_v2, include:
  - _id preserved.
  - migrationId: "2025-09-28_01".
  - status: "migrating" | "complete".
- After successful write, atomically set status: "complete".

If the process fails mid-way, find and fix partial docs:

db.users_v2.find({ migrationId: "2025-09-28_01", status: { $ne: "complete" } })

Then re-run upserts for these keys or roll back by deleting these docs.

3) Recover with Oplog or Snapshots

If your migration corrupted data or indexes:
- Restore from a known-good snapshot (filesystem snapshot taken with journaling enabled, or a backup solution like Ops Manager/Cloud Manager).
- Use Oplog-based PITR to replay up to a safe timestamp before migration start.
- Validate in a staging node before promoting.
For sharded clusters, ensure all shards and config servers are consistent before resuming.

4) Index and Schema Consistency

Interrupted index builds can be retried. Modern MongoDB versions have resumable index builds, but verify:
- List index builds in progress and drop/finish as needed.
- Rebuild with appropriate commit quorum.
If you changed schema conventions (e.g., field rename or nesting):
- Use a backfill job with idempotent updates.
- Build new indexes first; then switch application reads to new fields with a feature flag.
- Once stable, drop old indexes and fields.

5) Orphaned or Duplicated Documents

If duplicates exist due to retries (same logical entity repeated with different _id), choose a canonical doc:
- Prefer the doc with latest updatedAt, or the one marked status: "complete".
- Merge data and delete duplicates in small batches with majority write concern.

For references across collections, run data consistency checks:

// Example: verify all orders have a matching user
const cursor = db.orders.aggregate([
  { $lookup: { from: "users", localField: "userId", foreignField: "_id", as: "u" } },
  { $match: { u: { $size: 0 } } },
  { $limit: 100 } // sample for inspection
]);

Repair orphaned references with a controlled backfill.

6) Observability and Change Streams

Use change streams to monitor migration progress and errors in near real-time.
Track lag, error rates, and slow operations to decide when to pause or pivot to rollback.

MySQL: Recovery Strategies

MySQL’s recovery story depends on engine (use InnoDB), version (atomic DDL in 8.0 for many operations), and whether you used online migration tools.

1) Transactional Recovery (If Applicable)

If your migration ran inside a transaction and the session is intact, ROLLBACK will revert changes.
Beware DDL: not all DDL is fully atomic in all versions. Even with atomic DDL, some operations may be non-transactional depending on specifics.

2) Point-in-Time Recovery with Binary Logs

When you must revert to pre-migration state:

Restore from a recent full backup.

Replay binary logs using mysqlbinlog up to a safe time:

mysqlbinlog --start-datetime="2025-09-28 10:00:00" \
            --stop-datetime="2025-09-28 11:12:00" \
            /var/lib/mysql/binlog.000123 | mysql -u root -p

If the migration contaminated data widely, rebuild replicas from the restored primary to avoid divergence.

Note: Generating reverse statements automatically is not standard; prefer restore + roll-forward rather than trying to “undo” statements.

3) Cleaning Up Interrupted Data Loads

If LOAD DATA INFILE or bulk inserts were interrupted:
- Load into a staging table first, then upsert into the target:
```
INSERT INTO target (id, col1, col2)
SELECT id, col1, col2 FROM staging
ON DUPLICATE KEY UPDATE col1=VALUES(col1), col2=VALUES(col2);
```
- If duplicates occurred, deduplicate using a temporary table with a GROUP BY or window approach, then swap.
Reset AUTO_INCREMENT if necessary:
- Get SELECT MAX(id) FROM target;
- Then ALTER TABLE target AUTO_INCREMENT = <max+1>;

4) Foreign Keys and Integrity Checks

If you disabled FOREIGN_KEY_CHECKS during load, you must validate after:
- Check orphans:
```
SELECT c.*
FROM child c
LEFT JOIN parent p ON p.id = c.parent_id
WHERE p.id IS NULL;
```
- Fix orphans before re-enabling strict checks.
Rebuild affected indexes if needed and run ANALYZE TABLE for optimizer stats.

5) Online Schema Changes and Aborted Runs

If you used pt-online-schema-change or gh-ost and the migration failed mid-process:

gh-ost cleanup:
- Drop ghost tables (suffix _gho) and changelog tables (_ghc) if cutover didn’t happen.
- Remove triggers created by the tool.
- Validate original table is intact and unlocked.
pt-online-schema-change cleanup:
- Similar process: verify triggers removed, temp tables dropped.
- If cutover failed, the original remains the source of truth. Confirm row counts and sync if needed.

After cleanup, re-attempt schema change with safer parameters or off-peak timing.

6) Replication Considerations

If replicas encountered errors due to the migration:
- Stop replication on affected replica.
- Diagnose the failing statement from the relay log.
- Decide to fix data and resume, or rebuild the replica from a fresh snapshot.
Use pt-table-checksum and pt-table-sync to validate and repair divergence across replicas after recovery.

Cross-Database Validation: Proving You’re Safe to Resume

Regardless of database, you must prove integrity before resuming normal operations.

Counts and ranges:
- Compare row/document counts with expected values per batch ID or time window.
- Validate min/max of primary keys and timestamps.
Checksums and hashes:
- For relational tables, compute rolling checksums of key fields.
- For documents, sample hashes of canonical JSON serialization.
Referential and logical checks:
- Orphan detection for relational FKs or document references.
- Uniqueness validation (no duplicate keys).
- Business rules: e.g., sums, totals, balances, or status transitions.
Index health:
- Validate index presence and cardinality.
- In PostgreSQL/MySQL: ANALYZE for fresh stats.
- In MongoDB: ensure required compound/partial indexes exist.
Application smoke tests:
- Read and write critical paths under read-only toggles first if possible.
- Canary traffic before full ramp-up.

Practical Examples: Quick Recipes

Below are targeted quick recipes you can adapt.

PostgreSQL: Identify and Re-run a Failed Batch

Tag rows with batch_id.

Compare counts between staging and target:

SELECT (SELECT COUNT(*) FROM staging WHERE batch_id='b123') AS staging_count,
       (SELECT COUNT(*) FROM target WHERE batch_id='b123') AS target_count;

Re-run upsert from staging for missing keys:

INSERT INTO target (id, col1, col2, batch_id)
SELECT s.id, s.col1, s.col2, s.batch_id
FROM staging s
LEFT JOIN target t ON t.id = s.id
WHERE s.batch_id = 'b123' AND t.id IS NULL
ON CONFLICT (id) DO UPDATE
  SET col1 = EXCLUDED.col1, col2 = EXCLUDED.col2;

MongoDB: Resume an Idempotent Backfill

Find incomplete docs and upsert:

const cur = db.users_v2.find({ migrationId: "b123", status: { $ne: "complete" } });
cur.forEach(doc => {
  db.users_v2.updateOne(
    { _id: doc._id },
    {
      $set: { col1: doc.col1, col2: doc.col2, status: "complete" },
      $setOnInsert: { migrationId: "b123", createdAt: new Date() }
    },
    { upsert: true }
  );
});

MySQL: Deduplicate After a Faulty Retry

Create a deduplicated table keeping the latest by updated_at:

CREATE TABLE target_dedup LIKE target;
INSERT INTO target_dedup
SELECT t.*
FROM target t
JOIN (
  SELECT id, MAX(updated_at) AS max_updated
  FROM target
  GROUP BY id
) m ON t.id = m.id AND t.updated_at = m.max_updated;

RENAME TABLE target TO target_old, target_dedup TO target;
-- Keep target_old for a while, then drop once validated

Preemption: Set Yourself Up to Win Next Time

The best recovery is not needing one. Build migrations that assume failure and are safe to resume.

Backups and PITR:
- Ensure WAL/Oplog/binlog archiving and verified restorations.
- Rehearse restores regularly.
Idempotent operations:
- Use upserts and conflict-handling (ON CONFLICT, ON DUPLICATE KEY, MongoDB upsert).
- Tag writes with batch/migration IDs for easy reconciliation.
- Design schema changes as additive-first (expand/contract pattern):
  - Add new columns/fields nullable.
  - Backfill in batches.
  - Migrate application reads.
  - Remove old fields last.
Staging and canaries:
- Always stage data before swapping.
- Canary-run migrations on a subset and verify metrics.
Safe DDL:
- Prefer online schema change tools (gh-ost, pt-osc) for large MySQL tables.
- Use concurrent/resumable index builds where supported (PostgreSQL CONCURRENTLY, MongoDB resumable builds).
Observability:
- Log progress with checkpoints: last processed ID/timestamp.
- Emit metrics: rows/sec, error counts, retries, lag, lock wait time.
- Alerts on slowdowns and error thresholds.
Runbooks and dry runs:
- Document exact steps for pause, rollback, roll-forward, and validation.
- Dry-run commands in staging with production-like data volumes.
Access controls and feature flags:
- Ability to quickly toggle read-only mode or route traffic away.
- Application-level flags for schema versions.

A Tested Runbook Template You Can Adapt

Keep this near your keyboard when pushing the big red button.

Preparation
- Confirm backups and PITR are current.
- Confirm observability: logs, metrics, alerts.
- Schedule a maintenance window and announce.
Start Migration
- Enable canary mode (small batch).
- Verify data shape, counts, and app behavior.
- Scale to full batch if canary clean.
If Failure Occurs
- Freeze writes and pause all migration jobs.
- Record timestamps, migration IDs, last successful checkpoints.
- Snapshot affected datasets if feasible.
Assess and Decide
- Integrity check: counts, constraints, indexes, app errors.
- Decide rollback vs roll-forward within an agreed decision window.
Rollback Path
- Restore from snapshot/backup.
- PITR to pre-migration timestamp/LSN/binlog position.
- Validate on a staging or temp node.
- Promote, rejoin replicas, and resume read/write.
Roll-Forward Path
- Identify partial batches by ID/time.
- Re-run idempotent upserts or delete-and-reload from staging.
- Fix sequences/auto-increments, validate indexes, enforce constraints.
- Validate integrity and app smoke tests.
Post-Recovery
- Re-enable writes gradually; canary traffic first.
- Monitor closely (latency, error rates, replication lag).
- Run a blameless postmortem:
  - What failed and why?
  - What signals were missed?
  - What guardrails will prevent recurrence?

Final Thoughts: Make Failure Boring

Migrations fail in exciting ways; your goal is to make recovery boring. That means planning for failure with idempotent operations, strong observability, and rehearsed runbooks. PostgreSQL’s transactional strengths, MongoDB’s flexible upserts and transactions, and MySQL’s robust binlog/PITR ecosystem each provide solid recovery paths—but only if you’ve prepared.

In the heat of the moment, follow the triage steps: freeze writes, capture state, assess impact, choose rollback or roll-forward, and validate relentlessly. Then invest in prevention—expand/contract patterns, staging tables, batch IDs, and canary runs—so the next emergency becomes a non-event.