Upgrade: The road to 1.9

Zendesk is now 2-3 times faster thanks to our recent upgrade to Ruby 1.9.3!

https://rpm.newrelic.com/public/charts/pW2SBFRzui

As our application and traffic continues to grow, we here in Zendesk Engineering are continually identifying and removing performance bottlenecks.

Much to our dismay, we have reached a point where there are very few easy optimizations left within our codebase – the “win 90% from 10%” rule can be tricky with Rails apps due to its tendency towards many, many small methods.

Much of our performance cost was due to Ruby interpreter overhead, and plenty more was garbage collection – our “main app” is over 80,000 LOC, and contains 170 libraries. Even on REE, Ruby 1.8 made us pay a pretty heavy penalty for having a codebase of this size.

A line in the sand

Every once in awhile, one of us in engineering would get ambitious and take a stab at getting some part of our test suite running under 1.9. But we’re quite busy, you know, actually building stuff in engineering, and the attempts were spaced out over a few weeks. Which is enough time to allow the rest of the team to add shiny, new, incompatible code. The Sisyphean nature of this would cause us to wander away and get coffee.

When we officially launched the 1.9 upgrade project, we needed a way to prevent our team from introducing new 1.9 incompatibilities while we worked on the old ones.

Here’s what we came up with. We started with this simple patch in our test suite

and followed this process:

  1. Run the test-suite under 1.9, collect failures, and add mark_19_incompat at the head of any test file that doesn’t pass under 1.9.
  2. Get a CI build passing under 1.9. Many parts of the code-base are still broken, but at least we know the entry points that cause failures, and can require the engineering team to keep both “legacy” (REE) and the 1.9 build green.
  3. The less glamorous but useful work begins: Start fixing and removing mark_19_incompat from tests.

Fixes and common problems

First off, we followed a bunch of the excellent advice from others who have been through the same ordeal, e.g. the Harvest 1.9.3 upgrade was very handy, and the mysql2 upgrade was absolutely crucial. We ended up sticking with syck as our YAML parser in order to remove YAML serialization from our critical path.

UTF-8 Encoding issues

Character encoding will definitely be the big line item when attempting a 1.9 upgrade. This class of errors ranged from simple one-liners, i.e. slapping # encoding: utf-8
at the top of the file, to more in-depth issues with Marshal.load and YAML-parsing – we’ll get to those later. Luckily, being of Danish origin, we had a decent amount of UTF-8 in our test suite already – we can’t recommend enough that you fill out your test fixtures with plenty of funky looking characters.

Array#to_s

This one bit us more times than we’d like to admit. There were quite a few places in our code-base where we called .to_s on a single-sized array, and expected the string back. These bugs often manifested themselves mysteriously. In retrospect, a better approach may have been to raise exceptions when calling incompatible methods.

Marshal compatibility

We ran some tests and found that Ruby 1.8 and 1.9 were bi-directionally compatible with regards to Marshal.dump/load, after adding this little monkey patch (utf-8, again!).

Deep, deep YAML oddness

In 99% of cases, the YAML serialized to the database was easily readable and writable from both 1.8 and 1.9. We came across one very deep nook, though, where certain strings coming back from YAML.parse with binary (known as ASCII-8BIT in ruby-land) encoding. Head scratching ensued. Eventually, we found this odd little nook in Ruby 1.8:
This caused certain UTF-8 heavy strings to be encoded as “BINARY” types in YAML. These types were then assigned incorrect encodings when read back in 1.9. The fix was simple enough, if obscure:

Forgive bad browsers

We were able to reduce encoding error noise quite a bit by falling back to encoding in Latin 1 (ISO-8859-1) when encountering invalid UTF-8 on the front-end. Since the majority of our problems are from users on older clients who are only attempting to browse pages, we have seen very few problems with this simple solution.

Productionizing the thing

Our test suite was passing. Our Soviet-block QA team said “da”. We still needed to battle-test the thing. There were a few routes we could have taken here. A smaller start-up might have simply thrown a hail-mary at the upgrade, cutting all the servers over to 1.9 and scrambling to fix the errors. A larger company could have probably dedicated a cluster of servers and mirrored traffic to it, collecting and fixing the errors located. We found ourselves in a somewhat unfortunate middle ground, so we upgraded a single app server and fed it traffic for 15 minutes, collecting all the errors returned.

Our rollout phase lasted roughly a week. Within a few days we were running half of our unicorns on 1.9. This allowed us to keep on top of the list of problems, and users were able to successfully retry requests that failed due to 1.9 issues.

We were surprised by the raw efficiency of 1.9 – our servers, running 1.8, ran pretty hot through the course of the day, with big spikes (likely due to garbage collection). The upgrade eased off our load troubles and gave us some scale-room on the frontend.

What’s next?

We’ve been in the process of splitting our main application into smaller components. We’re also working on upgrading to Rails 3, and hope this will allow us to accelerate the process. Additionally, we”ll be keeping on eye on projects like JRuby and Sidekiq for multi-threaded processing.

  • http://www.facebook.com/marco.jansen.754 Marco Jansen

    Nice write up! 

    Did you do any GC tuning when using REE in the old situation? When we started to use REE we had similar long average GC times.  We the started to tweek RUBY_HEAP_MIN_SLOTS /  RUBY_HEAP_SLOTS_INCREMENT / RUBY_GC_MALLOC_LIMIT / RUBY_HEAP_SLOTS_GROWTH_FACTOR we are able to reduce the GC time to a similar level that you see in the right side of your graph.

  • http://www.facebook.com/marco.jansen.754 Marco Jansen

    Nice write up! 

    Did you do any GC tuning when using REE in the old situation? When we started to use REE we had similar long average GC times.  We the started to tweek RUBY_HEAP_MIN_SLOTS /  RUBY_HEAP_SLOTS_INCREMENT / RUBY_GC_MALLOC_LIMIT / RUBY_HEAP_SLOTS_GROWTH_FACTOR we are able to reduce the GC time to a similar level that you see in the right side of your graph.

  • Philippe Le Rohellec

    Very nice article. What about the memory usage of your rails processes when transitioning from REE to ruby 1.9.3? Did it go up? I’m worried about the lack of copy on write in 1.9.3.

  • Ben Osheroff

    We ended up using more or less the same amount of memory under ree/1.9.  In my experience REE still has a lot of mutation on heap pages (iterating arrays, etc) that eventually causes one to lose a lot of copy-on-write friendly benefits.

    One artifact of ruby 1.9 vs. a nicely tuned REE install is that 1.9 appears difficult to tune properly to not hold on to memory — a particularly memory intensive request can cause one of our unicorn servers to bloat to >1GB, and it’s hard to convince 1.9 that it doesn’t need to hold onto that. 

    We’re looking into porting some of REE’s tuning parameters into 1.9, as they contain more-or-less the same garbage collector.

  • Ben Osheroff

    Hi Marco,

    Yeah, we were using variables:  FWIW here’s the list we were using before the upgrade:

    export RUBY_GC_MALLOC_LIMIT=60000000
    export RUBY_HEAP_SLOTS_GROWTH_FACTOR=1export RUBY_HEAP_MIN_SLOTS=500000export RUBY_HEAP_SLOTS_INCREMENT=1 

    of these, RUBY_GC_MALLOC_LIMIT and RUBY_HEAP_MIN_SLOTS still apply.  The GROWTH_FACTOR variable is one of the nicer ones that we’re thinking of porting, it can give you lots of smaller heaps instead of a few giant ones.

  • Anonymous

    Have you tried out Unicorn’s OOBGC?  I was able to eliminate a large chunk of GC that was slowing pages down as well as slow the growth of our application’s memory footprint as it runs.

    I didn’t realize that RUBY_HEAP_SLOTS_GROWTH_FACTOR isn’t part of 1.9.  This could be more of a problem than I thought.