Adjust Service Outage - Resolved

Incident Report for Adjust

Postmortem

Postmortem for incident on 16.04.2020

Yesterday, April 16th, between 19:29 and 19:59 UTC we rejected some of our incoming traffic due to an overload situation after a data center failover.

At 18:45 UTC we saw hardware failures for multiple machines in one of our Frankfurt data centers. At first it seemed like a minor issue as backup machines jumped in to replace failed ones. As the error started to cascade we made the decision to move all traffic away from the affected data center.

At 19:15 UTC we started to reroute all traffic to our remaining locations in Amsterdam, Los Angeles and Frankfurt. Due to the COVID-19 crisis we are currently seeing traffic up to 25% higher than average. This put the remaining data centers under severe pressure.

At 19:29 UTC we began to see an overload failure cascading from one data center to another.
We began to reduce the incoming traffic by delaying event data from the SDK. This stabilized our load issues.

At 19:59 UTC we had successfully fixed all configuration issues and tracked all incoming traffic again.

By 21:40 UTC we had restored access to the dashboard and the KPI services.
We have also been replaying the previously delayed SDK data and allowed partners to replay their clicks and impressions to fix any missing attributions.

At the moment we are working with all major partners to double check if all data is consistent and replay additional data if necessary.
To avoid any such chain of events in the future, we will increase our reserves in hardware. Delivery of new servers will start next week. Specifically, we will increase our SSL offloaders by 60% and our backend servers by 25% in Europe and 87% in the US.

We sincerely apologize for any inconvenience this incident may have caused you. We are committed to provide you the highest uptime possible and continuously improve our processes and systems.

Posted Apr 17, 2020 - 14:11 UTC

Resolved

All systems are now back operational. As a recap, due to an issue in our Frankfurt datacenter some of our services had performance issues resulting in a partial loss of incoming clicks between 19:29 UTC and 19:59 UTC. For other data points, our SDK buffers all activities if it cannot reach our servers and all of this data is being replayed and will be attributed.

A full postmortem, including how we will mitigate this type of issue in the future will be delivered. We would like to sincerely apologize for the disruption and to thank you for your patience. Please contact your dedicated AM or email us at support@adjust.com with any additional questions.

Posted Apr 16, 2020 - 21:48 UTC

Update

Currently all services with the exception of KPI and Dashboard are 100% operational

Posted Apr 16, 2020 - 21:23 UTC

Update

Currently all services with the exception of KPI and Dashboard are 100% operational

Posted Apr 16, 2020 - 21:10 UTC

Update

Currently all services with the exception of Dashboard and KPI are 100% operational; and applicable SDK traffic is being replayed to our servers.

Posted Apr 16, 2020 - 20:41 UTC

Update

Redirects and tracking are now fully back, and we are not anticipating any major loss of data. Other services will be back shortly.

Posted Apr 16, 2020 - 19:59 UTC

Monitoring

We’ve temporarly lost operation in our Frankfurt datacenter and have rerouted all traffic to our LAX and AMS datacenters; the majority of our redirects are back up and we're working to resolve the rest. Some events are being rejected to reduce the traffic load on LAX and AMS; our SDK will replay the data once we’re fully online.

Posted Apr 16, 2020 - 19:48 UTC

Update

Adjust is currently experiencing a service outage. The issue has been identified and a fix is being implemented. As a result, some Adjust tracking URLs fail to redirect, and some incoming clicks regardless of source are not being processed. Requests such as installs, sessions, and events sent by the Adjust SDK are not being processed but they will be buffered locally on each device and replayed in full once our services are back online.
We sincerely apologize for the disruption and inconvenience caused.

Posted Apr 16, 2020 - 19:29 UTC

Update

Adjust is currently experiencing a degradation in performance. The issue has been identified and a fix is being implemented.

We sincerely apologize for the disruption and inconvenience caused.

Posted Apr 16, 2020 - 19:16 UTC

Update

Adjust is currently experiencing a degradation in performance. The issue has been identified and a fix is being implemented.

We sincerely apologize for the disruption and inconvenience caused.

Posted Apr 16, 2020 - 19:15 UTC

Investigating

The Adjust dashboard is currently experiencing a degradation in performance. This disruption has no impact on any of the data that you’re tracking with Adjust, including any of your active advertising campaigns. All callbacks and raw data exports, both to your own BIs and to partners, will function as normal.

We are working on fixing this issue.

Posted Apr 16, 2020 - 18:43 UTC

This incident affected: Engagement Redirect Endpoint, SDK Endpoint, Dashboard, Server to Server Endpoint, Raw Data Export Service, and Uploading Service.