Managing Explosive Growth at Zendesk Ops

You haven’t heard much from us at Zengineering over the past few months. But that’s not to say we haven’t been busy. In fact, we’ve been extremely busy dealing with the explosive growth of Zendesk and Zendesk’s customer base.

I don’t think the scale of Zendesk has got that much attention. Frankly, in the infrastructure and operations team we like it that way. Our goal is to keep Zendesk running smoothly and to let the business grow without any of our customers realizing it. Don’t remember what Zendesk’s maintenance page looks like? Frankly neither do we, and we much prefer it that way!

So let me give you some insight into the numbers behind the operation of Zendesk. All of these figures relate to a typical weekday inside the operational machine that is Zendesk:

# of new tickets: > 150,000
# of emails processed: > 500,000
# of website requests: > 10 million
# of database queries: > 100 million
(That’s an average load of > 1000 QPS ; our peak load is over 4000 QPS.)

Average page load time: < 200 ms (that’s 1/5 of a second)

How do we achieve such impressive numbers and give high performance and enterprise-class reliability?

Here are some of our secrets:

  • Fully redundant and distributed environment. We have multiple servers handling each type of request. Multiple machines handle incoming and outgoing email. Multiple machines handle web requests and multiple machines handle background data processing (reports, etc.) We’re proud to be in a situation where every device is part of a cluster and we can transparently balance the load between them at all times.
  • High performance hardware. We’ve partnered with FusionIO to provide us with ultra high performance flash memory storage for our Plus+ plan customers’ database data. This is truly bleeding edge performance (we’ve measured 50000+ IOPS on our systems which is better than most large SANs).
  • Regular backups and hot and warm spares. To make sure data integrity is never compromised we back up all databases both nightly and in real time. Some of these spares are on flash storage and others are on traditional RAID arrays. We restore from these backups regularly and use the spares for internal purposes to make sure they are always up to date and fully functional.

And here are some of the things we’re working on:

  • A sharded database environment. This will allow us to balance different customers across different database servers so as to evenly distribute the load internally. We’ll be bringing up additional databases and connecting them to our infrastructure in the next month. We’ll post the details of how we sharded in another post.
  • Cloud data storage. To ensure high performance and prevent against data loss we are starting to use Rackspace’s CloudFiles solution to store ticket attachments and other large files entrusted to us by our customer and user base.
  • Increasing networking scalability. We’re upgrading our load balancers to F5 BigIPs so we can keep pace with the increasing traffic to our servers.

Zendesk has grown into a world-class SaaS organization, offering the largest customer support platform on the web, where our customers serve more than 20 million end users worldwide. And as we continue to grow, we will keep tackling our complex and multi-faceted growth issues, all the while remaining nimble and careful. And no matter how large we grow, we will continue to give you the personalized and flexible service upon which we’ve built our reputation.

  • Alexander

    Amazing!

  • Pingback: Rackspace Cloud Computing & Hosting

  • http://twitter.com/lucasjans Lucas Jans

    You guys rock. Quite impressive.

  • http://twitter.com/travelton Travis Swientek

    Agreed! I’m always impressed by the performance and reliability of Zendesk. Speed is never an issue…