Resolved

Root Case Analysis:

Issue Reported On: 2/07/2020 5:51 AM GMT / 11:21 AM IST

Issue Resolved On: 2/07/2020 6:19 AM GMT / 11:49 AM IST

On 2/07/2020 starting at approximately 5:51 AM GMT / 11:21 AM IST we started receiving customer complaints regarding intermittent slowness / 503 Service Unavailable errors while trying to resolve their respective Supersites. Around the same time our Engineering Teams received multiple alerts pertaining to database connectivity from our internal Monitoring System. Additionally a few Supersite Server Containers went into an unhealthy state.

The Engineering Teams quickly got together and began troubleshooting the issue. Initial investigations suggested multiple issues, one relating to Memcache response times and the other pertaining to database connectivity.

The team realized that the Memcache issue was due to a misconfiguration that happened during one of our previous code uploads. This configuration related issue was corrected and Memcache then began functioning normally.

The database connectivity issue still persisted and the team began digging deeper. The Database Administration Team confirmed that there were no issues on the database side of things. Our System graphs revealed that the database was not receiving any traffic, especially when requests were hitting the Primary Supersite Server and TCP packets were being dropped.

The team immediately decided to drop that particular Server from the cluster and routed all requests via the other Servers in the cluster. Once this was done, all Supersites began functioning normally.

Post this the System Administration Team carried out further troubleshooting and rebooted the Primary Server. A few System level tweaks were made, after which the TCP packet drop related issue was fixed.

Long Term Actions Items:

1. Move file synchronization from NFS to Syncthing

2. Reduce static data content and increase redundancy

3. Faulty containers should be dropped from rotation automatically

Posted July 3rd, 2020 1:47 pm GMT

Identified

Update:

The Network Operations Team identified that one of the Supersite Servers was dropping TCP packets randomly and that was causing the inter module connectivity issues.

As of now the team has dropped that particular Server from the cluster and has switched operations over to other Servers in the cluster.

The plan now is to restart that Server and continue further troubleshooting in order to identify the Root Cause of the issue.

The team has also identified the below peripheral causes when it comes to the slow loading of the Supersites and is working on fixing the same simultaneously.

1. MySQL Database connectivity issues

2. mcrouter configuration issues

3. High disk I/O

4. High CPU utilization

As of now all Supersites are resolving seamlessly. We will update this thread with more information as and when we have the same handy.

Posted July 2nd, 2020 12:21 pm GMT

Investigating

Update:

We observed that certain critical modules were not able to communicate with each other and there were intermittent inter module connectivity issues.

The System Administration Team along with the Engineering Team are reviewing System logs to identify the exact cause of the issue.

Posted July 2nd, 2020 7:29 am GMT

Investigating

We have identified an issue with the Supersite wherein you may see intermittent slowness or encounter the following error while trying to resolve your Supersite.

503 Service Unavailable

The Supersite and System Administration Teams are working on the issue and preliminary investigations reveal that there is a load on the Database Servers that could be causing this issue. Further investigations are underway and we will update this post with more details shortly.

Posted July 2nd, 2020 6:16 am GMT

Intermittent Slowness and Errors While Resolving the Supersite

Opened on July 2nd, 2020 6:16 am GMT, last updated July 3rd, 2020 1:47 pm GMT

Resolved

Identified

Investigating

Investigating