US1 - Connector Delays

Incident Report for Gainsight

Postmortem

Incident:
Beginning around 12:00 UTC on the 13th of June, Engineers were alerted of elevated queue levels for Connector services in CS-US1.

Root Cause:
A leader node was found to have higher than usual disk activity which prevented optimal job execution for Connector services.

Recovery Action:
Engineers scaled the number of Connector instances to correct the issue temporarily. Additionally, Engineers skipped long-running and duplicate jobs to help recover.

Preventive Measures:
System configuration adjustments have been made to prevent these issues moving forward.

Posted Aug 11, 2023 - 04:20 UTC

Resolved

This incident has been resolved. A subset of customers faced connector queue delays during this incident window. We will add RCA details as they become available.

Posted Jun 14, 2023 - 00:09 UTC

Monitoring

A fix has been implemented and we are monitoring. The Connectors queue was blocked for analysis and troubleshooting during this incident. We have since unblocked, and any duplicate sync jobs were aborted with no data impact. Please expect delays while the queue clears.

Posted Jun 13, 2023 - 20:04 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 13, 2023 - 18:43 UTC

Investigating

Beginning around 12:00 PM UTC today, we detected a delay in Connectors traffic and adjusted accordingly. As we still have queue delays, we are investigating further and will update as more information becomes available.

Posted Jun 13, 2023 - 17:29 UTC

This incident affected: Gainsight CS - US1 Region (US1 Data Ingestion Queue).