On Monday July 17th morning, Rainbow faced distinct impairments, some EMEA users found it difficult or impossible to communicate through bubbles or may not be able to make or receive calls.
- 8:00 CEST: Rainbow service started facing congestion on some EMEA clusters, therefore, some users experienced some slowness while refreshing conversations or managing phone calls from Rainbow applications. Rainbow engineers started working on mitigation of the issue impact.
- 8:40 CEST: Due to increasing traffic to unusually high peak, some API calls failed because of timeouts protection mechanism. Consequently, some users have experienced inconsistencies with presence and bubble content, making it impossible to answer or make calls.
- Rainbow team worked on different mitigations to reduce the pressure on backend components.
- 11:40 CEST: the different mitigation actions allowed to recover fully the service for all users. The incident was closed.
- The cause is still under analysis to understand better all the converging events that led to the congestion. For now we know that the conjunction of unusual high peak of activity on some clusters + some disk failure on other clusters are likely to be the cause of such behavior. This conjunction is unlikely so rare and difficult to predict. It happened; we need to fix this.
Additional Information & actions:
- The restart of `conversations` services needs to be optimized to be faster and gentle especially during peak hours.
- The monitoring of the infrastructure is being reinforced to anticipate such specific events and metrics on databases are being enriched.
- Reproduce the specific high load pattern on `conversations` on Rainbow lab.
- Rainbow client will be more resilient to API errors on `conversations` with limited impact on telephony services.
- Initially, the impact on telephony services has not been described on status.openrainbow.com : Enhance the qualification of the end user impact on the status page.