Issue: Elevated Error Rates
Cause: Increase in traffic caused by a query retry mechanism that failed to close properly
Incident Timeline on the 14th of June:
15:40 UTC - Our Technical Operations Team started investigating alerts and an increase in database traffic. The increased traffic was caused by a retry mechanism that failed to close properly after query results were provided.
16:07 UTC - By this time, the source queries were recognized and closed. As impact was realized we alerted customers of the issue.
16:29 UTC - With stability confirmed, we moved to the monitoring phase.
17:00 UTC - Incident closed after further confirmation.
Resolution:
Our approach to resilient architecture ensured services remained online but this incident helped us realize improvement opportunities. We have applied a fix which will help services better handle scenarios like this moving forward.