On November 09th 2017 7:15am CET+1, our Cloud IaaS provider simultaneously experienced a major power supply issue and an optical network connectivity loss, putting down 3 Millions web sites and services, unfortunately including Rainbow.
Only 7 minutes later has our SRE (Site Reliability Engineering) team been notified of the issue and initiated immediately our alert chain procedure. A communication has been published on our public help Center Web site at 08:00.
A dedicated communication to our customer administrators has been sent by email at 11:01
While this unprecedented outbreak was completely out of our control, we owe our customers our deepest apologies.
Rainbow services have been restored in a degraded mode by 4:45pm and the complete infrastructure has been restored to its original state by 11pm. The root cause was a default in the data-center power supply entry point, causing both redundancy points to be shutdown simultaneously.
In the immediate aftermath of power restoration, many core routers required reconfiguration. Some of the 40k servers, including Rainbow ones needed to be replaced due to power-damaged components, leading to hours of struggle.
While we thought having taken all necessary means on Rainbow operations side to ensure a full scale redundancy of all of our services, experience has proven us wrong. We need to be better prepared and we will be.
While all application services are designed to be highly available, the data-center itself was our single point of failure. Our roadmap included to add geographical redundancy to our data centers for early 2018. This plan has been accelerated to be ready before end of 2017.
As of 2018, our current core infrastructure will be hosted on 3 geographically isolated sites in France and same principles will be applied when opening new countries and continents.
In addition, communication to our Rainbow administrators will be improved to decrease timeframe between the outage and the dedicated communication.
Please sign in to leave a comment.