Last March, I wrote about how fortunate we are in operations (ops) to go unnoticed because of our consistently high uptime. Since then, we’ve continued to fly under the radar by maintaining more than 99.9% uptime, even celebrating four 100% months (with copious amounts of whiskey, mind you) while growing considerably. So what’s our secret sauce for maintaining stability? Planning aggressively.
Each month, our engineering team looks at our customer growth and sales pipeline and forecasts our usage for the following two to three months. During this process, we brainstorm the what-ifs and plan for any that we deem foreseeable beyond our forecasted needs. We strive to launch efforts months before they’re needed.
In my last post, I discussed a few of these preemptive initiatives that we were actively working on, and I’m proud to say that we accomplished the following:
Sharded Databases – As our customer base and data volume grows, using one large database would cripple Zendesk through stability issues slow load times. To avoid this, we moved to a sharded environment that spreads our customers across multiple smaller databases. We then keep backups of each of these, and backups of the backups (redundancy FTW!) to help make sure things run smoothly and can survive the worst.
Quadrupling the number of servers – There’s also a limit to how many requests a server can process, so we decided to quadruple our number of servers to keep things flowing smoothly as we grow.
Cloud data storage – We moved our ticket attachment storage to Rackspace’s CloudFiles and Amazon’s S3 storage solutions as a way to help protect them and reduce our storage load.
Increasing network scalability – To keep pace with our ever-increasing traffic, we upgraded our load balancers to F5 BigIPs that keep our requests and data routed quickly and reliably.
With our eyes fixed on the horizon like sea captains of yore, we’re currently planning a few major moves for 2012, including the following:
Adding a second datacenter – In a long awaited move, we’re building out a new datacenter. We’ll be using this datacenter to further increase our capacity and enable us to serve more customers faster. Not only will our capacity increase we will make sure that all customer data exists in multiple physical locations. For those of you familiar with the acronyms; this will form the backbone of our business continuity planning (BCP) and disaster recovery (DR) strategies. Because this is a big move, we’ve been doing this slowly and carefully and we expect to have it up and running in the first half of 2012.
Partitioning our application – A new datacenter is no good unless our application is prepared to run on more than one server cluster at once. Thus we’ve put in a huge amount of effort to break up the application into multiple pieces and to allow each piece to run in multiple places. This effort will launch in tandem with the datacenter, and while it’ll be transparent to you, it’ll be the culmination of months of work by us!
There are also a couple of things we’re constantly improving:
Improving system monitoring – Every 5 minutes we run more than 1000 checks across our service! Next year we’ll probably add at least another 2000 checks. By having our systems find out when something is broken, we make sure our customers don’t have to!
Reducing our downtime – We hate downtime and we’re constantly refining our failover processes to reduce the amount of time it takes to recover from hardware failure. In 2012 we’re aiming to have automated systems promote new hardware to recover from failure as soon as it is detected.
As you can see, we go to great lengths to plan for the future and ensure consistency. This is just a small glimpse at everything our operations team is currently working on, because they like to keep their reputation as the unsung heroes of Zendesk. This is the first of a three-part series–more to come on our current infrastructure setup, how we automate our processes and just how much data we’re processing each and every day.