Post Mortem: Connection Spike leads to cascading SQL Performance Degradation

Summary

Spike in connections overloaded our SQL server and triggered a latent performance degradation in SQL lookups, causing delayed delivery until performance mitigation.

Timeline

11:30AM EST: Incoming mail connections spike to 30x steady state, causing increased load on our servers
11:49AM [Downtime Begins]: Emails begin being delayed/backed up.
11:58AM [First Alert Fired]: Automated Alerting notified Matthew Tse on call that there was a deliverability issue with Microsoft
11:59AM [First Responder Signs On]: Matthew Tse signs on and begins investigating
12:08PM [Customers Alerted]: Matthew Tse posts an incident to status page
12:42PM [Mitigation Attempted]: Matthew Tse pushes code that increases logging to track offending users, and also increases the connection limit on front door mail requests
01:34PM [Mitigation Attempted]: Matthew Tse unlocks the max SMTP autoscaled server limit
02:00PM [Recovery Begins]: The SQL servers begin to handle the load, and emails begin being delivered again, but delayed.
02:10PM [Mitigation Attempted]: Matthew Tse adds additional logging to root cause the dropped front door messages. We discover that there is a SQL connection pool overload error being emitted constantly.
04:27PM [Mitigation Attempted]: Matthew Tse pushes async connection pool optimizations to decrease load on SQL servers
07:10PM [Mitigation Attempted][Recovery Complete]: Matthew Tse pushes further async connection pool optimizations, that fully eliminate the SQL error.

Action Items

IMX-1337: Audit all python SQL connection pool logic across all clients ensuring the thundering herd issue doesn't happen again.
IMX-1338: Audit/persist all SQL limit changes made during the incident, ensuring they persist past server reboot.
IMX-1339: Add Metrics Tracking the number of Connections made to our SQL database, ensuring this issue surfaces immediately in the future.
IMX-1340: Add Metrics Tracking number of mail rejections due to unhandled SQL connection issues. This should further improve our speed and reliability during delivery.

We apologize for the downtime in our services.

If you have any questions, feel free to reach out to me at

Matthew Tse
Owner and CEO of ImprovMX