Incident Summary for issue on 28 May 2024 (External)
Gainsight CS - EU - Elevated errors in NXT Authentication
On 2024-05-28 between 07:31 and 08:45 UTC, users of the Gainsight Application in the CS EU Cloud experienced intermittent application availability issues. The Gainsight UI was inaccessible for approximately 75 minutes during this window.
Root Cause :
Investigations have identified the following cause of the incident:
- An infrastructure component, specifically the backend worker service (Kubernetes Karpenter), was upgraded to a newer version to patch critical security and other updates.
- This change had already been successfully executed in the STAGE and other PROD environments.
- During the EU environment upgrade, all metadata configurations were transferred except for one critical rule.
- The missing rule allowed for UDP communication to DNS Servers.
- Due to the absence of this rule, DNS requests could not be resolved, causing microservices on newly provisioned worker nodes to fail. Microservices on older worker nodes were unaffected.
- These failures resulted in a significant number of stale threads/connections in a short time frame, rendering the API Gateway unresponsive.
- Updating the missing rule in the Network Security Group and reprovisioning the worker nodes resolved the issue.
- Pending rule jobs were either skipped or resubmitted as necessary.
Recovery Action :
- Updated the missing UDP rule in the Network Security Group.
- Restarted all affected services.
Preventive Measures:
- Ensure network rules consistency before and after any upgrade – this process has been initiated.
- Schedule critical security updates and even low-risk infrastructure changes during non-peak hours, despite previous successes in other environments, to minimize impact.