During the last 24 hours most Zendesk customers have experienced outages on two separate occasions.
Zendesk customer data is stored in so-called shards, which are basically separate MySQL databases. One of these shards crashed hard yesterday, Thursday at 10:42am PT (19:42 CET). We switched to the hot backup immediately and the site began serving data five minutes later. At 11:10am site performance reached normal levels. Fifteen minutes later, a configuration change made under heavy load, caused the second database to unexpectedly shutdown. The database restarted immediately. However, the affected sites recovered in read only mode and remained up but not writable until 11:44am PT, when the database went back to normal.
So, despite having hot back ups and procedures in place, most customers experienced a full hour of interruptions yesterday.
Immediately after the crash, we initiated root cause analysis and involved our external MySQL support providers. This morning, we identified the root cause of the crash, which turned out to be a rare data corruption bug in the database version we’re running.
Unfortunately, the same shard crashed again this morning at 5:08am PT (14:08 CET) before we could implement the identified patches. This time the hot backup took over without issues. However, since the incident happened early morning local time, there was a delay on our initial response and most customers experienced service interruptions for up to 25 minutes.
Tonight, we will be having a 10 minute maintenance window at 7pm PT (Saturday 4:00 CET) to upgrade our database software.
Patching the software and getting rid of the MySQL related bug is the first step. We will also make sure that all stable patches are implemented prior to having incidents occur, rather than after the fact.
The next step is to improve our hot swap database capabilities; for this we are doubling our database capacity and adding more hardware in the first week of June.
Finally, we will improve our internal notification and escalation procedures to be more swift when incidents happen, even in the wee hours of the morning.
On behalf of the Zendesk Ops team and all of Zendesk, I want to apologize to our customers for the interruption. Twice in 24 hours is just unacceptable and we will do whatever it takes to ensure that that doesn’t happen again. We’ve experienced 100% uptime in the last two months, and during the last 12 months we’ve only seen one month drop below 99.9% uptime. But we realize that all of that doesn’t matter when something like this happens.
We are disappointed and we will work harder to do better. Your trust is what matters to us.