Rainbow Services have experienced some troubles Monday, December 07, 2020 from 08:21AM CET to 11:00PM CET.
What Happened:
Incident description:
A power outage impacting the IaaS (Infrastructure as a Service) provider of our CALA datacenters has made Rainbow services unavailable in this region. The power outage was caused by a water leak in the data center's cooling system.
The unavailability of the CALA region's datacenters resulted in a significant increase in traffic from other regions, particularly EMEA and DE, during peak business hours at the time. The resulting slowness has resulted in a degraded level of service for users in other regions, exclusively for Bubbles and Web Conferences. Android users may also have experienced some troubles connecting and accessing Rainbow services.
A patch has been produced and deployed to allow each region to better manage this situation, i.e. to better manage traffic when one of them is completely disconnected.
The resolution of the cooling problem by our IaaS provider then allowed us to restart our servers and restore services in the CALA region.
Incident Time frame:
- From 08:22AM CET to 08:45AM CET: Slowness in some component of a datacenter in the CALA region was detected by the operation team.
- From 08:46AM CET to 08:50AM CET: The slowness of the CALA region continues to alert the operations team and a sharp increase in traffic in the EMEA and DE region is causing congestion of certain components.
- From 08:50AM CET to 10:15AM CET: The CALA region is now totally disconnected and congestion in the EMEA and DE regions continues. Some components are being restarted in these regions in an attempt to alleviate congestion but without success.
- From 10:15AM CET to 11:55AM CET: The CALA region is still totally disconnected and congestion in the EMEA and DE regions continues. Congestion is related to the excess traffic coming from the CALA region. An attempt to virtually isolate the CALA region is launched but without success.
- From 12:00PM CET to 12:15PM CET: The CALA region is still totally disconnected. A first patch is prepared and deployed in order to reduce congestion in EMEA and DE region. It is a success and the latency is halved as a result of the deployment.
- From 12:16PM CET to 12:55PM CET: The CALA region is still completely disconnected. A second patch is prepared and deployed to remove any congestion in EMEA and DE region. It is a success and there is no more congestion after the patch is applied.
- From 12:56PM CET to 02:10PM CET: The CALA region is always completely disconnected. Users in the EMEA and DE regions are still experiencing problems creating new Bubbles or connecting to their Android application.
- From 02:11PM CET to 03:57PM CET: The CALA region is always completely disconnected. A third patch is prepared and deployed to solve the connection problems on Android and the creation of new Bubble in the EMEA and DE regions.
- From 03:58PM CET to 09:04PM CET: The CALA region is always completely disconnected. Residual effects of congestion in the EMEA and DE regions are causing some minor slowdowns on some Rainbow BOTs.
- From 09:05PM CET to 09:45PM CET: The IaaS provider of datacenter in the CALA region fixed the cooling outage and the region is back online.
- From 09:46PM CET to 11:00PM CET: All Rainbow services are restarting and are gradually being made available to users in the CALA region. The return of the CALA region also helps to mitigate the slower speeds affecting some BOTs in the EMEA and DE regions.
Incident impact:
Remember that the region of the Rainbow Company prevails, not the Rainbow user's region.
- From 08:22AM CET to 09:45PM CET:
- Rainbow Services (Connection, Messaging, Bubbles, Conferences, Telephony Services ...) are fully unavailable.
- From 09:46PM CET to 11:00PM CET:
- Rainbow services are restarting and are gradually being made available.
- From 08:22AM CET to 08:45AM CET :
- Slowness in the application provides a degraded level of service.
- From 08:46AM CET to 12:55PM CET :
- Bubble issues when trying to create a group conversation, get the history of a conversation or start and join a Web Conference.
- Unable to connect from the Android application.
- Slowness in the application provides a degraded level of service.
- From 12:56PM CET to 03:57PM CET :
- Bubble issues when trying to create a new group conversation. Existing bubbles do not suffer any problems.
- Unable to connect from the Android application.
- Slowness in the application provides a degraded level of service.
- From 03:58PM CET to 09:46PM CET :
- Slight slowdowns may impact some Rainbow BOTs.
Corrective Measures:
- Make all components globally more resilient e.g in case one or more regions were to be disconnected.
- Evaluate the setup of a redundant datacenter locations (multi IaaS providers) and servers in separate buildings.
- In addition we continue to implement our contingency plan announced two weeks ago.
COMMUNICATION HISTORY:
Information about this outage and its resolution has been published regularly on the Status website through two incidents:
Comments
0 comments
Please sign in to leave a comment.