Root Cause Analysis:
At around 11:09 AM IST/05:39 AM GMT, we received a few down alerts from our Monitoring System. But these alerts were recovering every few seconds and initially, we thought it was a network-related issue from those particular locations.
We noticed that all Supersites were intermittently throwing either one of the following errors:
1. 503 Service Unavailable
2. SQLSTATE[HY000] [2002] Connection refused
3. SQLSTATE[HY000] [2006] MySQL server has gone away
4. SQLSTATE[HY000] [2002] No route to host
Our System Administration Team immediately began troubleshooting and noticed that all our Supersite container health checks are failing and the application is not able to establish a steady connection with the Database Servers. The System Administration Team immediately looped in our Database Administrators and further troubleshooting began.
The Database Administration Team noticed that ProxySQL was restarting every few seconds and as a result database connections were being dropped intermittently. On checking further the team noticed the following error in the logs:
http://prntscr.com/q88tyu
This error was caused due to the following bug:
https://github.com/sysown/proxysql/issues/2131
The Database Administration Team immediately updated the ProxySQL version from 2.0.4 to 2.0.6 on one of the proxies and monitored the Server post the update. Database connections were stable and the Server was not restarting randomly. Once this fix worked, it was applied to the other Proxy as well and everything stabilized. All Supersites began resolving seamlessly once again.
While moving to a clustered setup in September 2019, ProxySQL 2.0.6 was newly released and still had a few bugs because of which we had not updated our Servers. These bugs were recently fixed.
Root Cause: Bug in ProxySQL 2.0.4.
Action Items: Proactively update Server Software to the latest stable versions available in the market.
There was a temporary network related issue with our Supersite Servers which caused few Supersites to not resolve. Our System Admin team has looked into this issue after which the connectivity was restored.
We are monitoring our servers to ensure there are no further such issues and shall post a detailed Root Cause Analysis soon.
We are currently encountering an issue with our SuperSite servers, due to which you may receive intermittent error messages such as "503 Service Unavailable" while accessing the Supersite.
Our System Admin team is already working on this, and efforts are ongoing to resolve the issue at the earliest. We will update this thread as soon as we have any further information.
Affected Services