Back to Blog
Post Mortems

Post Mortem: Connection Spike leads to cascading SQL Performance Degradation

By Matthew Tse

A sudden connection spike caused cascading SQL performance degradation and delayed email deliveries for several hours.

Summary

On September 23, 2025, a sudden spike in database connections led to SQL performance degradation that affected email deliveries for approximately 3 hours.

Timeline

  • 06:45 UTC - Monitoring detected connection count spike.
  • 07:00 UTC - API latency began increasing.
  • 07:15 UTC - Engineering team engaged.
  • 08:30 UTC - Root cause identified.
  • 09:45 UTC - Full recovery confirmed.

What Happened

A DNS resolution issue in our infrastructure caused multiple services to simultaneously reconnect to the database, creating a thundering herd effect. The sudden influx of connections overwhelmed the connection pooler.

Impact

Email deliveries were delayed by 1-3 hours during the incident. API response times were elevated. No emails were lost — all queued messages were eventually delivered.

Root Cause

An internal DNS TTL expiration coincided with a brief DNS resolver hiccup, causing all application instances to simultaneously re-establish database connections. The connection pooler was not configured to handle this spike.

Resolution

Increased connection pooler limits, implemented connection jittering to prevent thundering herd scenarios, and added DNS caching at the application level.

Prevention

Implemented gradual reconnection logic with randomized backoff. Upgraded our connection pooler to PgBouncer with better surge handling. Added connection count alerting with lower thresholds.