Rainbow Services in DE region have experienced some troubles from Tuesday, 22th August 10:08AM CEST to Wednesday, 23th 02:00PM CEST.
What Happened:
Incident Time frame
Tuesday, 22th August :
- From 10:08AM CEST to 10:40AM CEST: Rainbow German users experience some troubles like disconnection, cut of ongoing conference, loss of telephony services, no audio during web calls. Rainbow operation team has been alerted immediately by infrastructure telemetry and started analyzing the incident.
- At 10:40AM CEST: Root cause identified: A maintenance planned by our IaaS provider had an unexpected impact on some components of Rainbow infrastructure, freezing them each time a host is going down for maintenance.
- From 10:40AM CEST to 12:05PM CEST: The incident was escalated to our IaaS providers. As maintenance could not be interrupted, several measures were taken to reduce the impact. Redirecting traffic to the EMEA Region, for example, helped to relieve pressure on the DE Region while waiting for maintenance to be completed.
- From 12:05PM CEST to 01:33PM CEST: Maintenance is now complete, and certain components have been restarted to ease the congestion caused by a large number of users reconnecting. More than half of all users can now use Rainbow without any noticeable disruption.
- From 01:33PM CEST to 03:21PM CEST: A new congestion occurs following a large number of connections after the lunch break. Some users are experiencing trouble loading their conversations or inconsistencies in notifications. A script is run to restore these functions once the congestion has ended. The situation has returned to normal for all users.
- At 03:30PM CEST: The redirection of traffic to the EMEA region has been stopped, and all traffic is now routed to the DE region.
- From 04:25PM CEST to 04:57PM CEST: Some components still have inconsistencies that were not identified following the morning's incident. These inconsistencies are causing slowness and disconnections for some users. The impact is much less than in the previous incident. A Rainbow maintenance operation is scheduled for the evening to restore full service. Details of the night's operations are available here.
- From 08:30PM CEST to 00:00AM CEST: Rainbow operation team will perform a preventive maintenance to tackle any remaining inconsistency issue which may remain for the end users. As a consequence, German users will be disconnected once or twice, with possibly some slowness to reconnect during a few minutes. Mobile client notifications may be not working during 30 minutes. There will be no impact on established communications or conferences.
Wednesday, 23th August :
- From 09:00AM CEST to 12:02PM CEST : German Rainbow users are experiencing further slowdowns and outages of telephonie services (Hybrid only). Rainbow's operations team was immediately alerted by infrastructure telemetry and began to analyze the incident.
- At 12:02PM CEST: Root cause identified: A database, which had not been found to be inconsistent during the previous day's maintenance, was not processed during the operation. It was this database, which finally proved to be faulty, that caused this new incident.
- From 12:02PM CEST to 02:00PM CEST: The database is quickly restored, allowing a return to normal. Almost all users regained full access to their services by 12:28PM CEST.
Incident description:
Tuesday, 22th August :
Our IaaS provider scheduled maintenance for the DE region on Tuesday August 22. While this maintenance should have had no impact, it froze certain infrastructure components, causing disconnections and degradation of Rainbow services for german users.
Unable to stop maintenance, the operations team took a series of measures to reduce the impact of the incident until maintenance was completed. To mitigate the impact of the incident, traffic was redirected to the EMEA region. This redirection greatly reduced the impact for companies that had authorized all recommended ip addresses on their firewalls.Once maintenance has been completed, services are restored gradually, slowed down by a congestion due to the large number of connections.
Preventive maintenance to resolve any remaining inconsistencies for end users has been done in the evening (more details here). During this maintenance, German users may be disconnected once or twice, with a possible slow reconnection for a few minutes.
Wednesday, 23th August :
The previous day's preventive maintenance restored almost all the faulty components. One database, which showed no inconsistencies during maintenance, was not processed during the operation. It was this database, which finally proved to be faulty, that caused Wednesday's incidents.
Once the failure had been identified, it was quickly restored by the operational team, putting an end to the incident.
Corrective Measures:
- Improve preventive maintenance notification and better evaluate the risks with our IaaS provider.
- Increase service robustness on DE Data Center : Plan & Strategy in preparation.
- Implementation of special monitoring on dedicated components pending the execution of the robustness plan mentionned above on DE Data Center.
Comments
0 comments
Article is closed for comments.