If you’re on-call, you know the 3 AM page: your phone explodes, adrenaline spikes, and suddenly you’re the only person standing between customers and chaos. This is your practical, battle-tested checklist to move from panic to progress—covering database and application server outages, how to stabilize quickly, and how to recover without making things worse.
Objectives: What “Good” Looks Like at 3 AM
Before you touch a key, align on the outcome:
- Stabilize the system and stop the bleeding.
- Protect data integrity and avoid split-brain or corruption.
- Restore the most critical functionality fast (even if degraded).
- Preserve evidence for later analysis.
- Communicate clearly, at predictable intervals, with facts not guesses.
Quick-Start Card: The First 15 Minutes
Keep this mental (or printed) card handy:
- Acknowledge alerts so duplicates don’t page the whole company.
- Declare an incident and spin up an incident channel.
- Assign roles: Incident Commander (IC), Comms Lead, Scribe (one person can fill multiple roles at 3 AM).
- Establish the blast radius: what’s down, who’s impacted, when did it start?
- Make the system safe:
- For databases: stop new writes if integrity is at risk.
- For apps: remove unhealthy instances from rotation, enable maintenance or read-only mode.
- Capture evidence: recent logs, metrics snapshots, system state.
- Communicate externally: initial notice with scope, next update time, and a safe holding statement.
- Choose a stabilization path (rollback/failover/scale) and execute.
Ground Rules That Prevent Secondary Incidents
- Do not reboot or restart the database until you have a snapshot/backup you can restore.
- Avoid config changes without noting them (scribe logs all commands).
- Turn off auto-scaling or auto-healing that could thrash or erase evidence, unless it’s essential to stabilize.
- Prefer reversible actions first (removing from load balancer, rolling back a deployment, rate limiting).
- Never promote a replica without verifying replication health and isolating the old primary to avoid split-brain.
Step 1: Acknowledge, Declare, and Assign Roles
- Acknowledge the alert in your monitoring tool to suppress duplicate pages.
- Start an incident channel (for example, “inc-YYYYMMDD-0701”) and pin key links (dashboards, runbooks, on-call schedule).
- Assign roles:
- Incident Commander: decides and keeps everyone focused.
- Comms Lead: posts updates to internal stakeholders and external status page.
- Scribe: records timeline, commands run, changes made.
- Domain SME(s): database engineer, platform engineer, application owner as available.
Example kickoff message:
- “We are investigating an outage impacting the checkout API and database writes. Incident ID: INC-2893. Comms updates every 15 minutes. Current priority: protect data integrity.”
Step 2: Confirm the Blast Radius and Triage
Questions that reduce uncertainty fast:
- What’s actually failing (API, background jobs, admin UI, read vs. write operations)?
- Since when? What changed in the last 60–90 minutes (deploys, migrations, infra changes, cert renewals, scheduled batch jobs)?
- Is the database reachable? If so, are there timeouts, lock waits, or disk/memory pressure?
- Is the application returning 5xx from all instances or just a subset?
Evidence to collect (screenshots or links):
- Application error rate, p95 latency, and saturation metrics.
- Database CPU, memory, I/O, connections, locks, disk space.
- Infrastructure: node health, network routes, load balancer target health.
Fast commands and checks (choose what fits your stack):
- Check DB connections: select count(*) from pg_stat_activity; or show processlist; in MySQL. Look for blocking queries.
- Check disk: df -h and du -sh on log directories; check WAL/binlog directories.
- Check logs: journalctl -u
--since "15 min ago" or cloud logging. - Kubernetes: kubectl get pods -n
; kubectl describe pod ; check CrashLoopBackOff and readiness probes.
Step 3: Make the System Safe
Your priority is to stop additional damage and reduce customer impact.
- If the database is unstable:
- Put the app into read-only mode if possible.
- Disable write-heavy background jobs.
- Halt batch imports or ETL pipelines.
- If the application is unstable:
- Remove unhealthy nodes from the load balancer.
- Roll back the most recent deployment if that correlates with the incident start.
- Reduce incoming traffic via rate limiting or temporary maintenance mode for the most affected endpoints.
- If you suspect data inconsistency:
- Snapshot disks or take a filesystem snapshot before any restarts.
- Preserve WAL/binlogs so you can do point-in-time recovery.
Step 4: Protect Data Integrity
This step is where many 3 AM incidents turn from “annoying” to “catastrophic” if rushed.
- Always snapshot volumes before restarting the database or applying repair commands.
- Validate replication health:
- Postgres: check pg_stat_replication on primary; select now() - pg_last_xact_replay_timestamp() on replica.
- MySQL: show slave status \G and confirm Seconds_Behind_Master, IO/SQL threads.
- Prevent split-brain:
- Fence or isolate the old primary before promoting a replica (remove from DNS/load balancer, stop services, or cut network access).
- If disk is near-full on the DB:
- Archive or move logs and old WAL/binlogs to another volume.
- Expand the disk or attach a new volume; do not delete WAL/binlogs arbitrarily without understanding recovery implications.
Step 5: Diagnose Fast With a Decision Tree
Use a simple forked path to pinpoint likely causes.
- Everything broke at once?
- Check for global changes: deployment, feature flag, configuration change, cloud provider event, TLS cert expiry, DNS change.
- If yes, roll back or revert first. Safe, reversible fixes beat perfect diagnostics at 3 AM.
- Database down or flapping?
- Disk full: check df -h; WAL/binlogs or slow query logs can fill disks.
- Lock contention: many idle in transaction or waiting processes; identify blockers.
- Memory/CPU saturation: sudden spike from a runaway query or migration.
- Corruption: errors in logs about pages or indexes; requires careful recovery.
- Application degraded but DB healthy?
- Connection pool exhaustion: connection leak, too-low max connections, or too aggressive retry/backoff.
- Crash loop after deploy: offending config or code path; roll back fast.
- External dependencies failing: payment gateway, cache cluster, message broker.
- TLS/DNS issues: expired certificates, misconfigured DNS, or certificate chain changes.
Step 6: Stabilization Playbooks for Common Failure Modes
Here are targeted actions with minimal blast radius.
A) Database Disk Full (Postgres/MySQL)
Symptoms: Write failures, database suddenly unreachable, logs about “No space left on device.”
Actions:
- Free space fast:
- Rotate or compress logs; move them off the main volume.
- Archive WAL/binlogs that are safely replicated and backed up; confirm retention policy.
- Temporarily reduce write pressure:
- Apply read-only mode or rate limit writes.
- Pause batch jobs and heavy migrations.
- Expand storage:
- Increase the volume size or attach a new volume and move the log directory or data directory extension to it.
- Validate:
- Confirm free space thresholds and alarm levels.
- Re-check replication and restart only if necessary (and only after snapshotting).
Prevention:
- Alerts at 70/80/90% on data and WAL/binlog volumes.
- WAL/archive/log rotation policies and archiving to object storage.
- Capacity planning and predictable retention.
B) Lock Contention and Long-Running Transactions
Symptoms: Queries blocked, high latency, CPU normal, many sessions “waiting.”
Actions:
- Identify blockers: on Postgres, select pid, state, query from pg_stat_activity where wait_event is not null; on MySQL, show engine innodb status.
- Carefully kill the smallest-blast-radius blocker (often a long idle-in-transaction) after validating it’s safe.
- Add lock timeouts at the app level to avoid indefinite waits.
- If caused by a migration, roll back or rewrite migration to be online-safe (add columns with defaults as nullable, backfill in batches).
Prevention:
- Migrations reviewed for locking behavior.
- Application-level timeouts and circuit breakers.
- Dashboards that show top blocking sessions.
C) Crash-Looping Application After Deploy
Symptoms: Health checks fail, CrashLoopBackOff in Kubernetes, 5xx error spikes aligned with deploy timestamp.
Actions:
- Roll back to the last known good release immediately.
- Remove affected pods/instances from rotation to stop flapping.
- Increase replicas of the previous version to absorb traffic.
- Collect logs and diffs from the failing version for later analysis.
Prevention:
- Blue/green or canary deployments with automatic rollback on SLO breach.
- Health checks that match real user flows.
- Feature flags to gate risky changes without deploys.
D) Connection Pool Exhaustion and Thundering Herd
Symptoms: App timeouts, DB max_connections hit, elevated CPU from connection churn.
Actions:
- Raise pool size cautiously only if DB can handle it; otherwise reduce concurrency at the app tier.
- Implement exponential backoff and jitter on retries.
- Kill stale connections and verify the pool’s max lifetime settings to avoid long-lived idle connections.
- Consider adding a connection proxy (e.g., pgbouncer) in transaction mode for spikes.
Prevention:
- Right-size connection pools relative to DB cores.
- Enforce connection timeout and retry budgets in code.
E) TLS Certificate Expired
Symptoms: Sudden spike in 525/526/SSL errors, services unable to call each other.
Actions:
- Renew certificate in the CA portal or automate re-issuance.
- Redeploy or reload the proxy/load balancer that terminates TLS.
- Validate full chain and intermediate certs; ensure clients trust the updated chain.
Prevention:
- Automated renewal with alerting at 30/14/7/3/1 days to expiry.
- Playbook for emergency manual renewal.
F) Replica Promotion / Failover
Use only if the primary is truly unavailable and you have confirmed replication state.
Actions:
- Ensure the old primary is fenced: stop services, detach networking, or remove from LB/DNS.
- Promote the most up-to-date replica:
- Postgres: pg_ctl promote or use your HA tool (Patroni/Pacemaker).
- MySQL: Stop replication and reset appropriately, ensuring read_only is off on new primary.
- Point applications to the new primary; verify reads/writes.
- Rebuild the old primary as a replica later. Don’t reintroduce it without a full resync.
Prevention:
- Regular failover drills.
- Automated fencing and promotion with clear criteria and logs.
Step 7: Validate Recovery Before You Declare Success
After applying a fix, run through a short battery of checks:
- Functional: can users log in, create/update core resources, and check out?
- Data consistency: spot-check recent writes, verify no partial updates.
- Performance: latency, error rate, and saturation back to baseline?
- Persistence: do logs show warnings repeating, or is the storm over?
- Replication: replicas healthy and caught up; lag within policy?
- Observability: alarms cleared, and custom synthetic checks passing?
Roll traffic back gradually if you shifted it away:
- Start with 10–20% traffic to the recovered tier, monitor, then ramp up.
- Watch for oscillations or heat maps showing noisy neighbors.
Step 8: Communication That Builds Trust
During an incident, silence damages credibility. Adopt predictable cadences.
Internal template:
- “INC-2893 update (03:12 UTC): Root cause not confirmed. Checkout writes disabled to protect data. Error rates improving. Next update at 03:27 UTC.”
External status page template:
- “We’re investigating an issue affecting database-backed operations. To protect data integrity, we’ve placed some features in read-only mode. We’re actively working to restore full functionality. Next update in 15 minutes.”
When closing:
- “This incident is resolved. Impact window: 02:51–03:31 UTC. Root cause: DB disk exhaustion due to WAL growth. Mitigations: expanded disk, adjusted WAL archiving. We’re monitoring and will publish a full post-incident review.”
Cadence rules:
- Update every 15–30 minutes during active impact, even if there’s “no change.”
- Avoid speculating; stick to verifiable facts and next steps.
Step 9: Post-Incident Review (Within 72 Hours)
A good review turns pain into progress. Keep it blameless and specific.
Include:
- Executive summary (one paragraph): what broke, who was impacted, how long.
- Timeline: detection, decision points, actions, recovery.
- Root cause and contributing factors: distinguish triggers from underlying weaknesses.
- What went well / what was hard.
- Action items with owners and deadlines:
- Immediate (this week).
- Near-term (this quarter).
- Structural (requires budget or cross-team changes).
- Artifacts: graphs, logs, diffs, and the exact commands used.
Examples of high-impact action items:
- Add alerts for DB disk at 70/80/90% and WAL/archive directories specifically.
- Introduce read-only mode toggle and test it monthly.
- Implement blue/green deployments for the app tier.
- Add connection pooling layer and tune pool sizes by environment.
- Enforce schema migration standards with online-safe patterns.
Ready-to-Use Checklists
On-Call Pre-Shift Readiness
- Confirm you can access VPN, cloud console, and production shells.
- Bookmark runbooks, dashboards, and status page tooling.
- Test your pager escalation (send a test to yourself).
- Know your backup/failover points: which replicas are promotable?
- Ensure you have the authority to roll back or promote; know who to wake up.
First 5 Minutes
- Acknowledge alerts.
- Declare the incident; open channel; assign roles.
- Identify affected services and user impact.
- Stabilize: remove unhealthy nodes from rotation, enable read-only mode if needed.
- Snapshot evidence: logs, metrics, replication state.
- Post the first communication update.
First 30 Minutes
- Decide: rollback, failover, scale, or fix-in-place.
- Protect data: snapshot before restarts, confirm replication health.
- Execute one change at a time; scribe logs everything.
- Verify recovery with functional checks and metrics.
- Communicate progress and next update time.
Do / Don’t at 3 AM
- Do choose reversible actions first.
- Do limit parallel changes; serialize to reduce ambiguity.
- Do keep a running timeline.
- Don’t restart databases blindly.
- Don’t promote replicas without fencing the old primary.
- Don’t rely on memory; document every command and config edit.
Practical Examples
Example 1: Postgres WAL Filled the Disk
- Symptoms: sudden write failures; app errors on write; disk at 100%.
- Actions: switched app to read-only; archived old WAL to object storage; expanded volume by 20%; restarted archiver; confirmed replication health.
- Recovery: writes re-enabled; added alerts and daily WAL archive verification job.
- Lessons: a long-running failed backup process silently stopped WAL archiving.
Example 2: Java App Crash Loop After Feature Flag Change
- Symptoms: crash loop within 1 minute; started after a silent feature flag flip.
- Actions: removed failing instances from LB; disabled the flag; new deploy not required; stabilized within 10 minutes.
- Prevention: mandatory canary percentage for new flags; add circuit breaker around the new code path.
Example 3: MySQL Replica Accidentally Promoted Without Fencing
- Symptoms: data divergence; conflicting updates; two primaries briefly active.
- Actions: fenced old primary immediately; picked the most advanced node; reconciled conflicted rows via binlog comparison; restored consistency.
- Prevention: automated fencing in HA tool; “promotion requires two-person review” policy.
Tooling That Makes 3 AM Sane
- Monitoring and APM: high-signal dashboards for error rates, latency, saturation, and top queries.
- Log aggregation with queryable filters over the last 60 minutes.
- Database observability: lock graphs, slow query dashboards, replication lag.
- Incident tooling: a bot to create incident channels, assign roles, and post reminders.
- Status page with templated updates and roles-based access.
- Runbook repository: short, accurate, and tested; linked from dashboards.
- Chaos and failover drills scheduled quarterly to keep muscle memory fresh.
Database-Specific Tips
- Postgres:
- Keep autovacuum healthy; monitor bloat; set sensible max_wal_size.
- Use pgbouncer in transaction mode for bursty traffic.
- For corruption: pg_checksums and pg_amcheck can validate; consider logical dump from a clean replica if needed.
- MySQL:
- Monitor InnoDB buffer pool hit ratio and redo log pressure.
- Ensure GTID-consistent replication and validated backups for PITR.
- Use pt-kill carefully to terminate runaway queries; prioritize low-risk sessions.
Kubernetes and VM Nuances
- Kubernetes:
- Distinguish readiness vs. liveness probes to avoid kill loops during dependencies outages.
- Enforce resource requests/limits to prevent node evictions.
- Use pod disruption budgets and surge deploys to reduce downtime.
- VM/Bare Metal:
- Service managers (systemd) should have sensible restart policies with backoff.
- Maintain golden images with up-to-date certificates and agents.
Pre-Production Practices that Save 3 AM Incidents
- Staging that mirrors production scale and data shapes (synthetic data for privacy).
- Migrations tested with production-sized datasets.
- Feature flags default off with gradual rollout.
- Load tests with chaos elements (kill pods, add latency, fill disks).
- Backups restored regularly to a recovery environment and verified.
A Minimal Runbook Template You Can Copy
- Purpose: Recover from DB or app server outage while protecting data.
- Preconditions: On-call access, incident tooling, dashboards linked.
- First actions: acknowledge → declare → roles → stabilize → snapshot evidence.
- Decision points:
- Recent deploy? Roll back first.
- DB at risk? Read-only mode; snapshot; confirm replication.
- Disk full? Free or expand; archive logs/WAL.
- App crash loop? Remove from rotation; revert config/flag/deploy.
- Validation: functional checks, performance metrics, replication status.
- Communication: internal every 15–30 min; external with impact, actions, and next update.
- Closure: monitoring back to green, notes captured, PIR scheduled with owners.
Final Thoughts
You don’t need heroics at 3 AM—you need a calm, repeatable checklist that prioritizes safety over speed. Stabilize, protect data, restore the most critical capabilities, and keep stakeholders informed. Then, in daylight, do the deeper forensic work and invest in the fixes that mean the next on-call can sleep through the night.
Keep this guide in your runbook repo, test it during game days, and tune it to your stack. The difference between chaos and control is preparation, discipline, and a reliable checklist—especially when it’s 3 AM.