Post Mortem: API Downtime

Summary

We recently attempted to perform database migration that caused our API to go down. As a result, users were unable to access their account.

Timeline (Eastern US Time)

04:58AM [Downtime Begins]: Yosif tries to perform a database migration. This leads to an error which causes or API to go down.
05:04AM [First Alert Fired]: Automated Alerting notified Matthew Tse on call that there was an API failure. Matthew acknowledges.
05:10AM [First Responder Signs On]: Matthew Tse signs on and begins investigating.
05:10AM [Customers Alerted]: Matthew Tse posts an incident to status page: https://status.improvmx.com/incident/842325
05:14AM [Mitigation Attempted][Recovery Complete]: We push code that reverts the breaking API migration. APi service is back to normal.

What Went Wrong

Ultimately, this issue was caused by a gap in our deployment procedure which skipped a necessary manual step. We have immediately updated our documentation to prevent this from happening going forward. Furthermore, we have added additional alert notifications during deploys which will improve our time to alert in the future.