Gainsight CS - EU - Elevated errors in NXT Authentication
Incident Report for Gainsight
Postmortem

Incident Summary for issue on 28 May 2024  (External)

Gainsight CS - EU - Elevated errors in NXT Authentication

On 2024-05-28 between 07:31 and 08:45 UTC, users of the Gainsight Application in the CS EU Cloud experienced intermittent application availability issues. The Gainsight UI was inaccessible for approximately 75 minutes during this window.

Root Cause :

Investigations have identified the following cause of the incident:

  • An infrastructure component, specifically the backend worker service (Kubernetes Karpenter), was upgraded to a newer  version to patch critical security and other updates.
  • This change had already been successfully executed in the STAGE and other PROD environments.
  • During the EU environment upgrade, all metadata configurations were transferred except for one critical rule.
  • The missing rule allowed for UDP communication to DNS Servers.
  • Due to the absence of this rule, DNS requests could not be resolved, causing microservices on newly provisioned worker nodes to fail. Microservices on older worker nodes were unaffected.
  • These failures resulted in a significant number of stale threads/connections in a short time frame, rendering the API Gateway unresponsive.
  • Updating the missing rule in the Network Security Group and reprovisioning the worker nodes resolved the issue.
  • Pending rule jobs were either skipped or resubmitted as necessary.

Recovery Action :

  1. Updated the missing UDP rule in the Network Security Group.
  2. Restarted all affected services.

Preventive Measures:

  1. Ensure network rules consistency before and after any upgrade – this process has been initiated.
  2. Schedule critical security updates and even low-risk infrastructure changes during non-peak hours, despite previous successes in other environments, to minimize impact.
Posted Jun 03, 2024 - 07:14 UTC

Resolved
This incident has been resolved.
Posted May 28, 2024 - 09:10 UTC
Update
We are continuing to monitor for any further issues.
Posted May 28, 2024 - 08:45 UTC
Monitoring
Fix is implemented and all services are back to normal. The queues are also released and the jobs will catchup in next couple of hours.
We are monitoring closely
Posted May 28, 2024 - 08:45 UTC
Identified
The issue has been identified and fix is being implemented.
Posted May 28, 2024 - 08:08 UTC
Update
Gainsight NXT application is still down. We are working with the upstream service provider.

We will post updates as soon as they are available.
Posted May 28, 2024 - 07:31 UTC
Investigating
We are investigating errors while logging into the Gainsight NXT application.

We will post updates as soon as they are available.
Posted May 28, 2024 - 06:46 UTC
This incident affected: Gainsight CS - EU Region (Gainsight CS EU Application).