Postmortem for incident on 16.04.2020
Yesterday, April 16th, between 19:29 and 19:59 UTC we rejected some of our incoming traffic due to an overload situation after a data center failover.
At 18:45 UTC we saw hardware failures for multiple machines in one of our Frankfurt data centers. At first it seemed like a minor issue as backup machines jumped in to replace failed ones. As the error started to cascade we made the decision to move all traffic away from the affected data center.
At 19:15 UTC we started to reroute all traffic to our remaining locations in Amsterdam, Los Angeles and Frankfurt. Due to the COVID-19 crisis we are currently seeing traffic up to 25% higher than average. This put the remaining data centers under severe pressure.
At 19:29 UTC we began to see an overload failure cascading from one data center to another.
We began to reduce the incoming traffic by delaying event data from the SDK. This stabilized our load issues.
At 19:59 UTC we had successfully fixed all configuration issues and tracked all incoming traffic again.
By 21:40 UTC we had restored access to the dashboard and the KPI services.
We have also been replaying the previously delayed SDK data and allowed partners to replay their clicks and impressions to fix any missing attributions.
At the moment we are working with all major partners to double check if all data is consistent and replay additional data if necessary.
To avoid any such chain of events in the future, we will increase our reserves in hardware. Delivery of new servers will start next week. Specifically, we will increase our SSL offloaders by 60% and our backend servers by 25% in Europe and 87% in the US.
We sincerely apologize for any inconvenience this incident may have caused you. We are committed to provide you the highest uptime possible and continuously improve our processes and systems.