Rainbow Services have experienced some troubles Monday, November 16, 2020 from 10:56AM CET to 12:20PM CET.
What Happened:
Incident description:
An internal race condition within our core messaging service lead to an application crash and inconsistent behavior, cutting by half our service processing capabilities in EMEA Region. We mitigated the incident by routing traffic back to another datacenter in EMEA while progressively restarting the impacted servers, leading to service deviation meanwhile.
During the restart and rerouting phase, some components managing, for example, Telephony Services or Bubbles had to be restarted, resulting in the unavailability of these functionalities.
Incident Time frame:
- From 10:56AM CET to 11:10AM CET: A slowness in the databases of a datacenter in the EMEA Region was detected by the operation team.
- From 11:11AM CET to 11:35AM CET: The slowness implies a dead time and some failures in the queries of the database users. Some components have been restarted to restore the situation. Other running servers still think that rebooted components are running and potentially try to transmit requests, resulting in inconsistent behavior.
- From 11:36AM CET to 11:47AM CET: All database servers in this datacenter are restarted and all traffic is routed to another datacenter in the EMEA region. Services are slowly starting to recover.
- From 11:48AM CET to 12:20PM CET: All the databases of datacenters in the EMEA Region are back up and the queue is gradually emptied to return to a normal situation.
Incident impact:
Remember that the region of the Rainbow Company prevails, not the Rainbow user's region.
- From 10:55AM CET to 11:00AM CEST:
- Slowness in the application provides a degraded level of service.
- From 11:01AM CEST to 11:11AM CET:
- Some users may experience Bubble issues when trying to create a group conversation or start and join a Web Conference.
- Slowness in the application provides a degraded level of service.
- From 11:12AM CET to 11:20AM CET:
- Some user are disconnected and no longer able to reconnect.
- Some users lose their mobile notifications and need to open the mobile app again to retrieve notifications.
- Some users may experience Bubble issues when trying to create a group conversation or start and join a Web Conference.
- Slowness in the application provides a degraded level of service.
- From 11:21AM CET to 12:13PM CET:
- Some users lose their Telephony Services.
- Some users may experience Bubble issues when trying to create a group conversation or start and join a Web Conference.
- Slowness in the application provides a degraded level of service.
- From 12:14AM CET to 12:16AM CET:
- Slowness in the application provides a degraded level of service.
- From 11:21AM CET to 12:13PM CET :
- Some users lose their Telephony Services.
Corrective Measures:
- Upgrade some components to new version to improve robustness of the Rainbow infrastructure.
- Add some scripts to automatically restart databases in case of detected congestions start.
COMMUNICATION HISTORY:
Monday, November 16, 2020 - 07:10PM CET
EMEA users were today impacted by a non planned outage and got degraded Rainbow experience from 10:55AM CET to 12:16PM CET. Our Operations teams were acting immediately to recover the service when our monitoring tools have detected the incident.
Detailed Root Cause Analysis is on-going and will be published on Nov. 17.
Comments
0 comments
Please sign in to leave a comment.