When a Migration Fails Mid-Process: What’s Really Happening
Data migrations look straightforward on paper: extract, transform, load, validate, cut over. In practice, network blips, long-running locks, schema mismatches, and unexpected data can derail a migration halfway. The result is a dangerous state: partial data changes, inconsistent schemas, duplicate records, corrupted indexes, and confused applications.
Understanding common failure modes helps you choose the safest recovery path:
- Partial writes: Only a subset of rows/documents loaded; some updates applied, others not.
- Inconsistent schemas: DDL applied on the target but not on the source (or vice versa), causing application errors.
- Broken constraints: Foreign keys fail, unique indexes conflict, or triggers misfire.
- Duplicate data: Retries without idempotence create duplicates or update records multiple times.
- Lock contention: Long-running operations block reads/writes, piling up client errors or timeouts.
- Replication divergence: Primaries and replicas drift apart (especially mid-migration).
- Interrupted background operations: Index builds or online schema changes left in an intermediate state.
The steps below explain how to triage, recover, and prevent repeats across PostgreSQL, MongoDB, and MySQL.
Immediate Triage: Stabilize Before You Fix
Think firefighter first, surgeon later. The faster you contain the incident, the easier the recovery.
-
Freeze writes to the affected systems.
- Put the application in read-only mode (feature flag or maintenance window).
- Stop ingestion pipelines, ETL jobs, consumers, or migration orchestrators.
- In sharded or distributed environments, ensure writes aren’t still trickling to a subset.
-
Capture the state and evidence.
- Note timestamps, migration IDs/batch IDs, and last successful checkpoints.
- Snapshot affected tables/collections if feasible (storage-level or logical).
- Preserve logs: database logs, migration scripts output, orchestration logs.
-
Assess blast radius quickly.
- Which tables/collections were touched?
- Are schemas partially changed (some DDL applied)?
- Did replicas or downstream systems apply partial changes?
-
Decide on rollback vs. roll-forward (don’t do both).
- Rollback: restore from backup or PITR to a safe point if the window is short and the system is compromised.
- Roll-forward: fix inconsistencies and complete the migration if the partial work is salvageable and validated.
-
Communicate status.
- Share a short incident update: what’s impacted, current mode (read-only), ETA for next update.
- Assign clear roles: one lead, one operator per system, one recorder.
Decision Framework: Roll Back or Roll Forward?
Choose based on data integrity, time pressure, and confidence.
-
Roll back if:
- The migration damaged data or schema in non-trivial ways.
- You lack deterministic, idempotent steps to safely resume.
- Validation shows widespread inconsistency.
- PITR or snapshots make rollback fast and low-risk.
-
Roll forward if:
- The partial state is understandable and isolated.
- The migration is idempotent (safe to re-run).
- You can reconcile differences deterministically (e.g., by batch ID or a high-water mark).
- Rollback would be slower or riskier than repair.
If you’re unsure, bias to rollback. It’s often safer than prolonged uncertainty with live data.
PostgreSQL: Recovery Strategies
PostgreSQL has a powerful transactional model—even for many DDL operations—which can make recovery more forgiving if you used transactions properly.
1) If a Transaction Is Still Open or Hung
- Check for active sessions and locks:
- Use
pg_stat_activity
to identify the migration session. - If it’s hung, cancel or terminate it.
- Use
- If the migration was wrapped in a single transaction, a
ROLLBACK
will revert changes. - If partial data loaded outside a transaction, assess the scope and prepare to reconcile.
2) Point-in-Time Recovery (PITR) with WAL
When the database state is compromised and you have WAL archiving:
- Restore from a base backup taken before the migration.
- Replay WAL to a timestamp or LSN right before the migration started.
- Validate in a staging environment if possible before promoting.
- If you temporarily diverged to a restored node, use
pg_rewind
to rejoin the cluster after promoting the correct node.
PITR is ideal when the migration isn’t idempotent or touched too many tables inconsistently.
3) Clean Up Partial Loads
If the migration was an ETL-style load:
- Use staging tables:
- Load into
staging_target
first. - Validate counts, checksums, and key uniqueness.
- Move data into
target
with idempotent upserts:INSERT INTO target (id, col1, col2, ...) SELECT id, col1, col2, ... FROM staging_target ON CONFLICT (id) DO UPDATE SET col1 = EXCLUDED.col1, col2 = EXCLUDED.col2;
- Load into
- If duplicates were created, deduplicate using
DISTINCT ON (id)
into a new table, then swap:CREATE TABLE target_new AS SELECT DISTINCT ON (id) ... ORDER BY id, updated_at DESC;
- Swap atomically with a transaction: rename
target
->target_old
,target_new
->target
, then droptarget_old
after verification.
4) Sequence Corrections
Failed inserts can leave sequences ahead or behind.
- Reset sequence to max ID:
SELECT setval('target_id_seq', (SELECT coalesce(MAX(id), 1) FROM target));
5) Constraints, Triggers, and Validations
- Disable and re-enable constraints carefully:
- If you temporarily disabled triggers via
session_replication_role = 'replica'
, re-enable and validate. - Use deferrable constraints during data loads:
ALTER TABLE child ADD CONSTRAINT fk_parent FOREIGN KEY (parent_id) REFERENCES parent(id) DEFERRABLE INITIALLY DEFERRED;
- Validate after load:
ALTER TABLE child VALIDATE CONSTRAINT fk_parent;
- If you temporarily disabled triggers via
- Validate data integrity:
SELECT COUNT(*)
comparisons between source and target.- Range checks for primary keys.
- Hash checksums:
SELECT md5(string_agg(id || ':' || col1 || ':' || col2, ',' ORDER BY id)) FROM target WHERE batch_id = 'xyz';
6) Lock-Safe Schema Changes
If a DDL operation failed mid-flight due to locks:
- Re-plan DDL to minimize table rewrites:
- For large tables, avoid
ALTER TABLE ... SET DEFAULT ...
that rewrites data; preferNULL
+ application-level default, then backfill. - Split changes into small, reversible steps (add nullable column, backfill in batches, add constraint).
- For large tables, avoid
- Use a maintenance window and lower
lock_timeout
for safety.
7) Testing and Idempotency
- Make migrations idempotent:
- Guard with
IF NOT EXISTS
on DDL. - Use
ON CONFLICT
for upserts. - Use
CREATE INDEX CONCURRENTLY
and checkpg_stat_progress_create_index
if supported; if it fails, drop partially built index and retry.
- Guard with
MongoDB: Recovery Strategies
MongoDB’s document-level atomicity and flexible schema demand careful planning for multi-document migrations. Since MongoDB 4.0+, transactions are available in replica sets (and in sharded clusters since 4.2), but not every migration uses them.
1) Use Transactions Where Possible
If the migration ran within a transaction and failed, it likely aborted cleanly. If you mixed transactional and non-transactional operations:
- Identify documents touched by the migration using a migration key, e.g.,
migrationId
orbatchId
. - For incomplete updates, decide to:
- Delete partial docs in staging collections and re-run the batch, or
- Complete the updates idempotently using upserts.
2) Idempotent Bulk Operations
For bulk changes:
- Use
BulkWrite
with ordered=false to maximize throughput and isolate failures. - Use idempotent patterns:
- Upsert with
$set
and$setOnInsert
. - Use
retryWrites: true
andwriteConcern: { w: "majority" }
.
- Upsert with
- Example approach:
- When migrating
users
tousers_v2
, include:_id
preserved.migrationId: "2025-09-28_01"
.status: "migrating" | "complete"
.
- After successful write, atomically set
status: "complete"
.
- When migrating
If the process fails mid-way, find and fix partial docs:
db.users_v2.find({ migrationId: "2025-09-28_01", status: { $ne: "complete" } })
Then re-run upserts for these keys or roll back by deleting these docs.
3) Recover with Oplog or Snapshots
- If your migration corrupted data or indexes:
- Restore from a known-good snapshot (filesystem snapshot taken with journaling enabled, or a backup solution like Ops Manager/Cloud Manager).
- Use Oplog-based PITR to replay up to a safe timestamp before migration start.
- Validate in a staging node before promoting.
- For sharded clusters, ensure all shards and config servers are consistent before resuming.
4) Index and Schema Consistency
- Interrupted index builds can be retried. Modern MongoDB versions have resumable index builds, but verify:
- List index builds in progress and drop/finish as needed.
- Rebuild with appropriate commit quorum.
- If you changed schema conventions (e.g., field rename or nesting):
- Use a backfill job with idempotent updates.
- Build new indexes first; then switch application reads to new fields with a feature flag.
- Once stable, drop old indexes and fields.
5) Orphaned or Duplicated Documents
- If duplicates exist due to retries (same logical entity repeated with different
_id
), choose a canonical doc:- Prefer the doc with latest
updatedAt
, or the one markedstatus: "complete"
. - Merge data and delete duplicates in small batches with majority write concern.
- Prefer the doc with latest
- For references across collections, run data consistency checks:
Repair orphaned references with a controlled backfill.// Example: verify all orders have a matching user const cursor = db.orders.aggregate([ { $lookup: { from: "users", localField: "userId", foreignField: "_id", as: "u" } }, { $match: { u: { $size: 0 } } }, { $limit: 100 } // sample for inspection ]);
6) Observability and Change Streams
- Use change streams to monitor migration progress and errors in near real-time.
- Track lag, error rates, and slow operations to decide when to pause or pivot to rollback.
MySQL: Recovery Strategies
MySQL’s recovery story depends on engine (use InnoDB), version (atomic DDL in 8.0 for many operations), and whether you used online migration tools.
1) Transactional Recovery (If Applicable)
- If your migration ran inside a transaction and the session is intact,
ROLLBACK
will revert changes. - Beware DDL: not all DDL is fully atomic in all versions. Even with atomic DDL, some operations may be non-transactional depending on specifics.
2) Point-in-Time Recovery with Binary Logs
When you must revert to pre-migration state:
- Restore from a recent full backup.
- Replay binary logs using
mysqlbinlog
up to a safe time:mysqlbinlog --start-datetime="2025-09-28 10:00:00" \ --stop-datetime="2025-09-28 11:12:00" \ /var/lib/mysql/binlog.000123 | mysql -u root -p
- If the migration contaminated data widely, rebuild replicas from the restored primary to avoid divergence.
Note: Generating reverse statements automatically is not standard; prefer restore + roll-forward rather than trying to “undo” statements.
3) Cleaning Up Interrupted Data Loads
-
If
LOAD DATA INFILE
or bulk inserts were interrupted:- Load into a staging table first, then upsert into the target:
INSERT INTO target (id, col1, col2) SELECT id, col1, col2 FROM staging ON DUPLICATE KEY UPDATE col1=VALUES(col1), col2=VALUES(col2);
- If duplicates occurred, deduplicate using a temporary table with a GROUP BY or window approach, then swap.
- Load into a staging table first, then upsert into the target:
-
Reset
AUTO_INCREMENT
if necessary:- Get
SELECT MAX(id) FROM target;
- Then
ALTER TABLE target AUTO_INCREMENT = <max+1>;
- Get
4) Foreign Keys and Integrity Checks
- If you disabled
FOREIGN_KEY_CHECKS
during load, you must validate after:- Check orphans:
SELECT c.* FROM child c LEFT JOIN parent p ON p.id = c.parent_id WHERE p.id IS NULL;
- Fix orphans before re-enabling strict checks.
- Check orphans:
- Rebuild affected indexes if needed and run
ANALYZE TABLE
for optimizer stats.
5) Online Schema Changes and Aborted Runs
If you used pt-online-schema-change or gh-ost and the migration failed mid-process:
- gh-ost cleanup:
- Drop ghost tables (suffix
_gho
) and changelog tables (_ghc
) if cutover didn’t happen. - Remove triggers created by the tool.
- Validate original table is intact and unlocked.
- Drop ghost tables (suffix
- pt-online-schema-change cleanup:
- Similar process: verify triggers removed, temp tables dropped.
- If cutover failed, the original remains the source of truth. Confirm row counts and sync if needed.
After cleanup, re-attempt schema change with safer parameters or off-peak timing.
6) Replication Considerations
- If replicas encountered errors due to the migration:
- Stop replication on affected replica.
- Diagnose the failing statement from the relay log.
- Decide to fix data and resume, or rebuild the replica from a fresh snapshot.
- Use
pt-table-checksum
andpt-table-sync
to validate and repair divergence across replicas after recovery.
Cross-Database Validation: Proving You’re Safe to Resume
Regardless of database, you must prove integrity before resuming normal operations.
-
Counts and ranges:
- Compare row/document counts with expected values per batch ID or time window.
- Validate min/max of primary keys and timestamps.
-
Checksums and hashes:
- For relational tables, compute rolling checksums of key fields.
- For documents, sample hashes of canonical JSON serialization.
-
Referential and logical checks:
- Orphan detection for relational FKs or document references.
- Uniqueness validation (no duplicate keys).
- Business rules: e.g., sums, totals, balances, or status transitions.
-
Index health:
- Validate index presence and cardinality.
- In PostgreSQL/MySQL:
ANALYZE
for fresh stats. - In MongoDB: ensure required compound/partial indexes exist.
-
Application smoke tests:
- Read and write critical paths under read-only toggles first if possible.
- Canary traffic before full ramp-up.
Practical Examples: Quick Recipes
Below are targeted quick recipes you can adapt.
PostgreSQL: Identify and Re-run a Failed Batch
- Tag rows with
batch_id
. - Compare counts between
staging
andtarget
:SELECT (SELECT COUNT(*) FROM staging WHERE batch_id='b123') AS staging_count, (SELECT COUNT(*) FROM target WHERE batch_id='b123') AS target_count;
- Re-run upsert from staging for missing keys:
INSERT INTO target (id, col1, col2, batch_id) SELECT s.id, s.col1, s.col2, s.batch_id FROM staging s LEFT JOIN target t ON t.id = s.id WHERE s.batch_id = 'b123' AND t.id IS NULL ON CONFLICT (id) DO UPDATE SET col1 = EXCLUDED.col1, col2 = EXCLUDED.col2;
MongoDB: Resume an Idempotent Backfill
- Find incomplete docs and upsert:
const cur = db.users_v2.find({ migrationId: "b123", status: { $ne: "complete" } }); cur.forEach(doc => { db.users_v2.updateOne( { _id: doc._id }, { $set: { col1: doc.col1, col2: doc.col2, status: "complete" }, $setOnInsert: { migrationId: "b123", createdAt: new Date() } }, { upsert: true } ); });
MySQL: Deduplicate After a Faulty Retry
- Create a deduplicated table keeping the latest by
updated_at
:CREATE TABLE target_dedup LIKE target; INSERT INTO target_dedup SELECT t.* FROM target t JOIN ( SELECT id, MAX(updated_at) AS max_updated FROM target GROUP BY id ) m ON t.id = m.id AND t.updated_at = m.max_updated; RENAME TABLE target TO target_old, target_dedup TO target; -- Keep target_old for a while, then drop once validated
Preemption: Set Yourself Up to Win Next Time
The best recovery is not needing one. Build migrations that assume failure and are safe to resume.
-
Backups and PITR:
- Ensure WAL/Oplog/binlog archiving and verified restorations.
- Rehearse restores regularly.
-
Idempotent operations:
- Use upserts and conflict-handling (
ON CONFLICT
,ON DUPLICATE KEY
, MongoDB upsert). - Tag writes with batch/migration IDs for easy reconciliation.
- Design schema changes as additive-first (expand/contract pattern):
- Add new columns/fields nullable.
- Backfill in batches.
- Migrate application reads.
- Remove old fields last.
- Use upserts and conflict-handling (
-
Staging and canaries:
- Always stage data before swapping.
- Canary-run migrations on a subset and verify metrics.
-
Safe DDL:
- Prefer online schema change tools (gh-ost, pt-osc) for large MySQL tables.
- Use concurrent/resumable index builds where supported (PostgreSQL
CONCURRENTLY
, MongoDB resumable builds).
-
Observability:
- Log progress with checkpoints: last processed ID/timestamp.
- Emit metrics: rows/sec, error counts, retries, lag, lock wait time.
- Alerts on slowdowns and error thresholds.
-
Runbooks and dry runs:
- Document exact steps for pause, rollback, roll-forward, and validation.
- Dry-run commands in staging with production-like data volumes.
-
Access controls and feature flags:
- Ability to quickly toggle read-only mode or route traffic away.
- Application-level flags for schema versions.
A Tested Runbook Template You Can Adapt
Keep this near your keyboard when pushing the big red button.
-
Preparation
- Confirm backups and PITR are current.
- Confirm observability: logs, metrics, alerts.
- Schedule a maintenance window and announce.
-
Start Migration
- Enable canary mode (small batch).
- Verify data shape, counts, and app behavior.
- Scale to full batch if canary clean.
-
If Failure Occurs
- Freeze writes and pause all migration jobs.
- Record timestamps, migration IDs, last successful checkpoints.
- Snapshot affected datasets if feasible.
-
Assess and Decide
- Integrity check: counts, constraints, indexes, app errors.
- Decide rollback vs roll-forward within an agreed decision window.
-
Rollback Path
- Restore from snapshot/backup.
- PITR to pre-migration timestamp/LSN/binlog position.
- Validate on a staging or temp node.
- Promote, rejoin replicas, and resume read/write.
-
Roll-Forward Path
- Identify partial batches by ID/time.
- Re-run idempotent upserts or delete-and-reload from staging.
- Fix sequences/auto-increments, validate indexes, enforce constraints.
- Validate integrity and app smoke tests.
-
Post-Recovery
- Re-enable writes gradually; canary traffic first.
- Monitor closely (latency, error rates, replication lag).
- Run a blameless postmortem:
- What failed and why?
- What signals were missed?
- What guardrails will prevent recurrence?
Final Thoughts: Make Failure Boring
Migrations fail in exciting ways; your goal is to make recovery boring. That means planning for failure with idempotent operations, strong observability, and rehearsed runbooks. PostgreSQL’s transactional strengths, MongoDB’s flexible upserts and transactions, and MySQL’s robust binlog/PITR ecosystem each provide solid recovery paths—but only if you’ve prepared.
In the heat of the moment, follow the triage steps: freeze writes, capture state, assess impact, choose rollback or roll-forward, and validate relentlessly. Then invest in prevention—expand/contract patterns, staging tables, batch IDs, and canary runs—so the next emergency becomes a non-event.