Rainbow Services have experienced some troubles from Friday, September 06, 2019 - 9:30 CEST to Friday, September 06, 2019 - 10:00 CEST. All services are back online.
On Friday 6th September, 09:18 CET, the Rainbow solution experienced a global outage in EMEA region. The main EMEA database server suffered from an electrical power-loss, preventing application layer to work properly. The core services have been resumed by 09:45 CET (27min) and we monitored all customers PBX telephonic extensions to have fully reconnected by 10:00 CET.
While being deeply sorry for the inconvenience, we've put immediate actions in motion two minutes after the failure to resume the services. Improvement actions and measures have also been taken to avoid such incidents to occur again in the future.
All users in EMEA suffered from a complete service blackout for at least 5 minutes. Operational decision and measures were taken to restart some core services within the very next minutes resulting in enforced user disconnections/reconnections for the next 17 minutes, until when all core services were back to normal.
- 2019-09-06 09:18 CET (T+0min): EMEA primary database server suffered from an electrical power-loss and rebooted. Thanks to continuous active master/slave database mechanisms in place, all data were kept synchronized on secondary database servers, preventing any data loss to happen. However the automatic active/passive failover did not work as expected due to some database inconsistency.
- 2019-09-06 09:20 CET (T+2min): our monitoring system has diagnosed the issue, alerted the 24/7 On-Call engineers and the whole Operations team was on the deck by this time as to analyze the situation.
- 2019-09-06 09:23 CET (T+5min): the primary database server has rebooted and was ready to serve back requests. Operational decision has been taken not to proceed with manual failover, master server being functional at that moment. The check of databases consistency had to be done before and this has taken time.
- 2019-09-06 09:24 CET (T+6min): while the service was restored automatically, the core application layer suffered from inconsistencies in behavior, preventing 100% feature-set to be provided to all users. Operational decision has been taken to restart the EMEA core services (lengthy process), as to maximize stability and coherency over uptime.
- 2019-09-06 09:34 CET (T+16min): all our Business Partners were continuously notified of our on-going issue and live actions taken to fix it, and this during the whole incident duration.
- 2019-09-06 09:40 CET (T+22min): the core application service restart was effective.
- 2019-09-06 09:45 CET (T+27min): the various side services restart was effective, core IM and WebRTC call/conferences service were fully restored for all users.
- 2019-09-06 09:53 CET (T+35min): the first PBXs were back to service.
- 2019-09-06 10:00 CET (T+41min): the last PBX has reconnected and back to service. PBX telephonic services have been fully restored for all customers.
- 2019-09-06 10:00 CET (T+42min): All services were functional. End of outage.
Actions taken to prevent same issue to happen again:
- While our infrastructure is greatly redunded, with multiple datacenters in place in each regions and high-availability with failover mechanism, we worked with our IaaS provider to improve the hardware stability to avoid such situations, migrating to more reliable and redundant Power Supply Unit -> Action initiated, to be completed by 21th September in all our Datacenters worldwide.
- Our database failover management was configured to request a human final decision, although the whole recovery process is automated. This decision was motivated by the fact that we value data integrity over uptime as to ensure that no data loss or corruption can happen. We've taken immediate actions to rework our database failover management in place, making it more automated and transparent to application layer (and so to end users) while guaranteeing data integrity > Action ongoing to be completed end of Q3/2019
- Our infrastructure is already designed to have several data backups on isolated platforms in different locations to ensure full data recovery in case of disaster recovery context.
We are sorry for this incident and the possible consequences. Be sure that we're setting up all measures and means in terms of event detection and automation as well as infrastructure resilience to ensure that such issue won't happen again.
Friday, September 06, 2019 - 11:30 CEST
Rainbow Services have been fully restored since 10:00 CEST
We monitored Rainbow infrastructure and we confirme that all Services are operational since 10:00 CEST.
Friday, September 06, 2019 - 10:45 CEST
We are monitoring the Rainbow infrastructure to ensure all is operational.
Friday, September 06, 2019 - 10:05 CEST
We are still working on these issues. Our entire team is focused on restoring services as soon as possible. Following services are still degraded:
Users are currently retrieving their Telephone features.
Friday, September 06, 2019 - 9:55 CEST
The reboot is completed. Users can now connect and use Rainbow services.
Friday, September 06, 2019 - 9:45 CEST
We are restarting some components in our EMEA datacenter to fixe the outage. The reboot may take some time. We will keep you informed of the progress in this conversation.
Friday, September 06, 2019 - 9:30 CEST
Rainbow Services are currently unavailable for users connected to the EMEA datacenter.
Do you have any doubts about the region you are concerned about? More details 🔽
Remember that the region of the Rainbow Company prevails, not the Rainbow user's region.