TLB: Rocket boosters for your build

Most software developers and testers would agree that keeping code well covered with automated tests(programmed tests that run without manual intervention) is an essential recipe to driving a software project to success. Such a test-suite prevents software regression, helps flush out the design, allows collective ownership of complex and logic-heavy code, acts as a functional and executable documentation for new project members, allows easy code refactoring and clean-up, allows frequent releases and the list could just go on and on. However, the number of tests a project has is usually directly proportional its features, complexity and maturity and the numbers for these quantifiers are always headed up. Obviously, as tests increase, the time it takes to execute the suite goes up too. This is especially true for Integration and Functional tests because these typically take longer to execute than Unit tests. At some point, teams decide not to run these tests for every commit because running the whole suite starts to take way too long and this is about the time in the life of most projects when things start rolling downhill. I haven’t really set out to enumerate bad things that can happen to your project once it starts slipping down that slope, so lets step back and consider a better alternative.

So you have a big test suite that takes over 10 minutes to execute and you wonder if there is anything you can do to help make it faster? Here is something we have been working on for over an year now and we call it TLB. TLB is actually an acronym for Test Load Balancer, but we like to use TLB for convinience and carbon footprint reasons. It is an open-source, BSD licensed tool that splits the load of test execution to make several chunks(that we call partitions), hence allowing your tests to execute parallely, across several physical machines or VMs, with each one executing only one small chunk(one partition).

In short, here is what the tool can do for you:
Say, you have a project with 1000 test-suites, that takes about 10 minutes to execute. With a few lines worth of changes in your build script, and 4 more computers to deploy for the job, TLB can help you bring total time down to 2 minutes. We call this time-balancing. It partitions your test in such a way that all partitions have equal amount of load in terms of time. This implies all partitions take almost the same time to complete, and given that each partition only runs 1/5(one fifth) of the whole, it takes only 1/5th the time, thats 10 minutes/5, which brings us to the 2 minutes number. Similarly if you have a test-suite that takes 5 hours, it can be cut down to just 20 minutes with 15 machines to parallelize it.

Alright, thats exciting, how do I use this thing, you ask? Lets first understand what TLB does not do. TLB does not launch multiple processes across machines, it does not invoke your build task nor does it manage machines for you to run your build on. There are very good tools around that can help you do all of that. You use the tool of your choice to invoke build task, i.e. ‘ant test’, ‘rake spec’ or ‘buildr test’ etc. For instance, the launching of processes across different physical machines at the same time can be offloaded to CI server(like Hudson or Go) or a command-line driven tool like Capistrano or Cluster-SSH. Once the build task(ant or rake invocation for example) is triggered on multiple machines, each one of these processes executes only a part of the whole set of tests. By design, TLB is really non-intrusive, all that changes is a few lines in the project’s build script.

The only other piece that you need in addition to invoking your build task is a daemon we call TLB Server. This is a repository where data captured while running tests(for instance, running time, result etc) is stored. Some algorithms(eg. time-based partitioning or failed-first orderer etc) depend on such data. TLB can either work against an instance of TLB Server or Go Server. TLB server is a part of the TLB project and is available as download(you want to check tlb-server or tlb-complete archive). Go is a ThoughtWorks Studios product and it is a continious integration and release management application. You can use the utility script that server archive bundles to manage the TLB server process. Please check documentation page(this is a release specific link, points to 0.3) for details on both configurations.

Multiple partitions of a build obviously need to use the same TLB server instance, which means TLB server must be reachable over the network from all machines that are to execute partitions of a project’s test suite. One TLB Server instance can be shared by several projects or multiple builds of the same project. This means all projects and builds can share a single organization-wide/office-wide TLB server. TLB server binds to port 7019(unless overridden), so you’d want to unblock that port for inbound traffic on any firewalls/filters on the server machine(similarly you’d want to unblock it for outbound traffic on machines that are going to execute the partitions). I emphasize this here because sometimes firewalls can be really nasty and confusing.

The Java support library in the latest release of TLB(version 0.3) supports popular Java build tools Apache Ant and Apache Buildr. The Ruby library bundles support for Rake. On the testing framework side, TLB has support for widely used Java testing tools named JUnit and Twist and the Ruby library bundles support for two most popular testing frameworks named Test::Unit and RSpec(both 1.x and 2.x). The next TLB release(0.4) will include support for Maven(a Java build tool) and Cucumber(a Ruby testing tool) among other features.

You can download the latest TLB distribution from http://code.google.com/p/tlb/downloads/list. Ruby support is available as rubygems(namely tlb-testunit, tlb-rspec2 and tlb-rspec1). The archive named setup-examples bundles tiny projects we use for demonstration purpose. Each project is a unique language, testing-framework and build-framework combination and has a shell-script named run_balanced.sh that can be executed to have TLB make partitions of test-suite and execute them serially, one after another. The shell script, in short, just starts the TLB server and executes test-task with the appropriate environment variables set. In a real world situation, each one of these partitions would be executed on different machines/process-trees, parallely. To have ant-junit example run for instance, you’d want to:

$ wget http://tlb.googlecode.com/files/setup-examples-g0.3.0.tar.gz
$ tar -zxvf setup-examples-g0.3.0.tar.gz
$ cd setup-examples-g0.3.0/examples/ant_junit
$ ./run_balanced.sh #count balances due to lack of data
$ ./run_balanced.sh #time balances (has data from the previous run)

The last(and of-course second-last) step should make two partitions and execute them one after another. While executing tests, the script also prints messages to help user understand major life-cycle events.
To try out rspec2 integration, you’d want to do something similar to:

$ gem install tlb-rspec2
$ cd setup-examples-g0.3.0/examples/rspec2_example #assuming you already unarchived
$ ./run_balanced.sh #will count balance
$ ./run_balanced.sh #will time balance

Note: Because TLB does not have the necessary data to partition tests accurately(use time-balancing algorithm) when the very first invocation is made, it uses count-balancing algorithm, you need to run the script twice to actually see it time balance. Expect this behavior on your actual project as well, as in, the very first time you execute your build against a new TLB Server, balancing will not be as accurate. Second time onwards it will do a much better job.

The distribution archive named tlb-complete bundles all TLB artifacts(java support library, TLB server, alien support library(only developers adding support for new non-jvm languages would be interested in alien library), setup-examples etc). Other archives bundle subset of complete. Please check the README file on downloads page for details.

Quick Start on the TLB website is usually a good place to get started in just a few minutes. However, we highly recomend going through the Concepts and Configuration pages as well, which explain details that will allow you to tweek your TLB configuration to suit your project, tool-set and environment and zero out any potential impedance mismatch. Concepts documentation is also the first step to enhancing TLB. For instance, if you come up with an algorithm that suites your project better, you can implement it in Java and have TLB use that insteed of the canned ones. Detailed documentation(this is a release specific link, points to 0.3) covers all knobs you can tweak and explains the effect and implication of choices you make.

The setup-examples also come in handy while setting-up TLB on a new project. We recomend borrowing build script snippts from the corresponding dummy-projects(in setup-examples or tlb-complete archive) while trying to wire TLB up for your project(s). Once you are done configuring your project(you have imported the libraries and tweaked the build script) to use TLB, you’d want to refer to Configuration Section(release specific link, points to 0.3), which enumerates and documents configuration parameters.

Here is the slide-deck we use to drive TLB talks. This gets into a little bit of TLB internals, and should help you understand TLB better. This post is only meant to be a quick introduction to TLB and is not exhaustive, please refer to TLB website for exhaustive documentation and other details. Should you have any questions/queries, feature-requests, feedback or suggestions related to TLB, feel free to reach us at test-load-balancer@googlegroups.com.

So go ahead give your test-suite rocket boosters and watch it fly by. Happy TDDing!


About this entry