Post Mortem: SQL Servers Overloaded
By Matthew Tse
On October 5, 2025, our SQL servers became overloaded, impacting email deliveries and API requests.
Summary
SQL servers experienced CPU saturation reaching 100%, causing cascading failures across the email delivery pipeline and API.
Timeline
- 11:20 UTC - CPU alerts triggered.
- 11:25 UTC - Engineering team engaged.
- 11:40 UTC - Identified runaway query from a background job.
- 11:45 UTC - Terminated the offending process.
- 12:00 UTC - Systems fully recovered.
Root Cause
A background analytics job that runs daily was triggered with incorrect parameters, causing it to process the entire dataset instead of the incremental batch. This consumed all available database connections and CPU.
Impact
For approximately 40 minutes, email deliveries were delayed (queued in SQS) and API requests returned timeouts. No emails were lost — all queued messages were delivered once the database recovered.
Resolution
Killed the runaway process, added parameter validation to the background job, and implemented connection pool limits for background processes separate from production traffic.
Prevention
Background jobs now run on a read replica with strict resource limits. Production database connections are reserved exclusively for the API and delivery pipeline.