cloudblog
2016/02/04
February 04, 2016
3 min read

WSO2 Cloud Incident Report: Feb 4, 2016

WSO2 Cloud has faced a serious degradation of service recently: API Publisher and Store started performing slowly and didn't function properly. For some users, it returned a 404 when trying to access the UI. Below is the detailed post-mortem analysis for the incident. Start time: February 4, 2016 8:35 PST Recovery time: February 4, 2016, 9:45 PST Impact:
  • Publisher and Store UIs were not functioning properly.
  • Because of this, API related administrative tasks could not be performed.
  • However, already hosted APIs were functioning properly: https://uptime.cloud.wso2.com/
Root cause: One of the publisher/store nodes started consuming too much memory. This node was initially able to recover itself but then became unresponsive. This affected the other node in the cluster too and it also started slowing down. Since the first node was not down, users were routed to that one too by the load balancer. Those users experienced a 404 error and the other users experienced extreme slowness. Actions:
  1. Our monitoring system had failed to read the memory consumption of the problematic node at that time and ended up with failing to alarm us. We have fixed this now.
  2. We have tuned the alarm triggering thresholds to notify us earlier than during this incident.
  3. We have enabled the Java Flight Recorder to run and collect stats for us to investigate in a similar situation in the future.
  4. We improved our health checking tool to probe individual instances rather than going through the load balancer. This will help us to identify these kinds of situations quickly. With the measures taken in #2 and #3, we will gather necessary information to identify the culprit and then escalate to the internal support team to fix it.