Rainbow Services have experienced some troubles from Monday, August 31 at 09:55PM CEST to Tuesday, September 01 at 01:34AM CEST.
What Happened:
Incident description:
An issue on a database server in the EMEA Region for a few minutes caused an overall slowdown of the application. With the database temporarily unavailable, all user requests (connection, get Bubble messages, get presence of contacts, etc.) accumulated in a queue on the servers.
Some components managing, for example, Telephony Services or Bubbles had to reconnect due to this congestion resulting in the unavailability of these functionalities.
Incident Time frame:
- From 09:55PM CEST to 09:58PM CEST: Slowness in the databases was detected by the operation team.
- From 09:58PM CEST to 22:00PM CEST: The slowness implies timeout and somes failures in the queries of the users to the database. A chain reaction increases the queuing of requests and the disconnection of certain components.
- From 22:00PM CEST to 01:34AM (day+1) CEST: All the components reconnect and the queue is gradually emptied to return to a normal situation.
Incident impact:
Remember that the region of the Rainbow Company prevails, not the Rainbow user's region.
- From 09:55PM CEST to 10:00PM CEST:
- Slowness in the application provides a degraded level of service.
- From 10:00PM CEST to 10:26PM CEST:
- Users may experience Bubble issues when trying to create a group conversation or start and join a Web Conference.
- Slowness in the application provides a degraded level of service.
- From 10:26PM CEST to 10:36PM CEST:
- Some users lose their Telephony Services.
- Users may experience Bubble issues when trying to create a group conversation or start and join a Web Conference.
- Slowness in the application provides a degraded level of service.
- From 10:36PM CEST to 10:50PM CEST:
- Users lose their mobile notifications and need to open the mobile app again to retrieve notifications.
- Some users lose their Telephony Services.
- Users may experience Bubble issues when trying to create a group conversation or start and join a Web Conference.
- Slowness in the application provides a degraded level of service.
- From 10:50PM CEST to 11:45PM CEST:
- Some users lose their Telephony Services.
- Slowness in the application provides a degraded level of service.
- From 11:45PM CEST to 12:02AM (day+1) CEST:
- Some users lose their Telephony Services.
- From 01:04AM (day+1) CEST to 01:14AM (day+1) CEST :
- Some users lose their Telephony Services.
- From 01:15AM (day+1) CEST to 01:25PM (day+1) CEST:
- Slowness in the application provides a degraded level of service.
- From 10:26PM CEST to 12:02AM (day+1) CEST :
- Some users lose their Telephony Services.
- From 01:04AM (day+1) CEST to 01:14AM (day+1) CEST :
- Some users lose their Telephony Services.
Corrective Measures:
- An update of the databases will be performed in October to improve the robustness of the server in the face of this type of incident.
- Moreover, new tests will be added on the validation environment to study and prevent this type of incident.
Comentarios
0 comentarios
Inicie sesión para dejar un comentario.