Here is the RCA for the Supersite Downtime we faced on the 31st of July, 2020.
Issue:
At around 02:00 AM IST, we received a few alerts from our Monitoring System for the Supersite cache which caused issues with the Supersite's accessibility as well as the visibility of certain products.
Our System Administration Team along with the Development Team immediately began troubleshooting the issue. They noticed that the /login.php page for certain supersite nodes was under DDoS attack.
Impact:
Access to the Supersite Admin area, the Express Cart's drop-down menu and missing products from the Supersite
Root Cause:
While analyzing request logs being made to the servers, the System Administration team identified that a high number of requests were being made to the /login.php file for a particular supersite node. The team quickly identified this to be due to a heavy DDoS attack, as all requests were being made from a single IP address.
The System Admin team added an ACL for the affected URL to mitigate the high number of requests. In the meantime, Cloudflare DDoS protection was enabled for the affected URL to contain the attack. After monitoring for some time, the ACL was tweaked to allow traffic only from Cloudflare IPs which helped resolve the issue.
Action Items:
Implement IPtables based rate-limiting for Supersite requests
Our development team observed that the Supersite server cache was not rebuilt automatically which affected our Supersite and Express Cart functionalities such as accessing the Supersite admin area, the drop-down menu on the Express Cart had missing products and no other product apart from the Domain registration services were visible on the Supersite.
Our development team has manually built the cache post which the issue was fixed. However our team is currently monitoring this issue to avoid any further discrepancies, kindly monitor the post for further updates.
We are currently experiencing intermittent issues while accessing the Supersite admin area.
Our System Administration team has been alerted and they are already working towards resolving it at the earliest. We regret the inconvenience caused to you and shall continue to post updates regarding the issue here at regular intervals.
Your patience and cooperation are highly appreciated.
Affected Services