Back to Blog
Post Mortems

Post Mortem: API Downtime

By Matthew Tse

On March 6, 2026, the ImprovMX API experienced a brief period of downtime. This post mortem documents what happened, our response, and the steps we've taken to prevent similar incidents.

What happened

At approximately 14:32 UTC, our API monitoring detected elevated error rates. The API became unresponsive for roughly 12 minutes before being restored at 14:44 UTC.

Root cause

A routine database migration triggered an unexpected lock on a critical table, causing API requests to queue up and eventually time out. The migration was part of a planned schema update to improve query performance.

Impact

During the 12-minute window, API requests returned 503 errors. Email forwarding was not affected — inbound emails continued to be received and queued normally. Only API-dependent operations (dashboard, alias management, account settings) were impacted.

Resolution

The engineering team identified the blocking migration within 5 minutes and terminated the operation. The database recovered automatically once the lock was released. All queued requests were processed normally.

Prevention

We've implemented migration dry-runs that simulate lock behavior before executing on production. Additionally, all future migrations will run during our lowest-traffic window with automatic rollback triggers.

Timeline

  • 14:32 UTC - Alerts triggered
  • 14:35 UTC - Engineering team engaged
  • 14:37 UTC - Root cause identified
  • 14:44 UTC - Service fully restored