WSO2 Cloud experienced a serious service degradation on 12th January 2016: users were not able to login to the cloud for few hours. Start time: 12th January 2016, 0833 PST Recovery time: 12th January 2016, 1217 PST Impact:
- Users were not able to log into the cloud,
- Sign-up was not working,
- API Gateway was functioning throughout the incident serving API calls at normal performance level. There was only a 5 minute gateway downtime during database restart: http://uptime.cloud.wso2.com/
- One of the housekeeping tasks running in our Identity Servers has failed due to a failure in acquiring a lock on a database table. This locked table is also responsible for storing the sessions and since it was locked, system was not able to complete the new user logins.
- Since it was not possible to find out, which component was keeping the table locked, we had to restart the database server to get the system back on track.
- We have decreased the frequency of the aforementioned housekeeping tasks as advised by our Identity Server team.
- We have also raised a support ticket with our internal support team to fix any possible future failures for this task.
- We are investigating further to figure out which component had the table locked and to fix it.
- We are looking into alerting and maintenance processes to ensure quicker resolution time in the future.