Resumo

On February 9th notifications service started showing degradation around 13:50 UTC, resulting in an increase in notification delivery delays. Our team started investigating.

Around 14:30 UTC the service started to recover as the team continued investigating the incident. Around 15:20 UTC degradation resurfaced, with increasing delays in notification deliveries and small error rate (below 1%) on UI and API endpoints related to notifications.

At 16:30 UTC, we mitigated the incident by reducing contention through throttling workloads and performing a database failover. The median delay for notification deliveries was 80 minutes at this point and queues started emptying. Around 19:30 UTC the backlog of notifications was processed, bringing the service back to normal and declaring the incident closed.

The incident was caused by the notifications database showing degradation under intense load. Most notifications-related asynchronous workloads, including notifications deliveries, were stopped to try to reduce the pressure on the database. To ensure system stability, a database failover was executed. Following the failover, we applied a configuration change to improve the performance. The service started recovering after these changes.

We are reviewing the configuration of our databases to understand the performance drop and prevent similar issues from happening in the future. We are also investing in monitoring to detect and mitigate this class of incidents faster.

Fontes

© 2026 Está fora do ar?. Site não oficial.