Wednesday, September 16, 2015

Sandboxes, Distributed Computing, and Closures

The sandbox abstraction comes up over and over again in the context of distributed computing.  We didn't invent it, but it appears so frequently that it's worth giving it a name and explaining how it works.  This abstraction is our model of execution in the Makeflow workflow system, the Work Queue master-worker framework, the Umbrella environment generator, and other systems, and this is what enables these tools to interoperate.

The sandbox abstraction is a way of specifying a remote execution in a way that can be efficiently delivered and precisely reproduced.  To run a task, the user states the following:

run C = T( A, B ) in environment E

In this example, A and B are the input files to the task, T is the command to be run, and C is the output file produced.  (Obviously, there can be more input and output files as needed.)

The sandbox itself is a private namespace in which the task can run, isolated from the outside world.  This enables the task to perceive the input and output files in a different way than the called.  A sandbox can be implemented in many ways: the simplest is just a plain old directory, but it could a Linux container or even a whole virtual machine.

The environment E is all the additional data that needs to be present within the sandbox: the operating system, the filesystem tree, program binaries, scripts, etc, which are represented by L1, L2, and L3 above.  The environment must be compatible with the sandbox technology.  For example, a tarball is a sufficient environment for executing within a directory sandbox, while a virtual machine image is needed for a virtual machine sandbox. 

Now, the pieces must all come together:  The sandbox must be created and the environment unpacked within it.  The input files must be moved to the execution site and copied or otherwise connected to the sandbox.  The task is run, producing the output, which must then be moved outside of the sandbox to the desired location.  Then, the sandbox may be discarded.

Once you begin to execute all tasks using the sandbox abstraction, many things become easier.
  • Executing tasks at remote sites becomes very easy, because all of the necessary dependencies are explicit and can be moved around the world.  (e.g. Work Queue
  • Similar tasks running on the same machine can share input objects, to improve efficiency.  (e.g. Umbrella)
  • Multiple tasks can be chained together while respecting independent namespaces.  (e.g. Makeflow)
Of course, all of these properties are not accidental: they have a good precedent in the realm of language theory.  A sandbox execution is really just a closure, which is the name for a function combined with an environment, which is a set of bindings from names to values.

Wednesday, May 20, 2015

Writing Solid Tests is (Still) Hard

We have a nice little automatic build-and-test system for the Cooperative Computing Tools which has nicely brought together the capabilities of Github, Condor, and Docker.

Every proposed merge to the codebase is packaged up as a build job which is dispatched to our Condor pool.  Some of those jobs run natively on bare hardware, some jobs run on virtual machines, and some are running in Docker containers, but all of them are managed by Condor, so we can have a zillion builds going simultaneously without too much conflict.

The result is that anytime someone proposes a pull request, it gets run through the system and a few minutes later we get a new row on the web display that shows whether each platform built and tested correctly.  It's very handy, and provides for objective evaluation and gentle pressure on those who break the build.

(I should point out here that Patrick Donnelly and Ben Tovar have done a bang-up job of building the system, and undergraduate student Victor Hawley added the Docker component.)

Some days the board is all green, and some days it looks more like this:



But the hardest part of this seems to be writing the tests properly.  Each test is a little structured script that sets up an environment, runs some component of our software, and then evaluates the results.  It might start up a Chirp server and run some file transfers, or run Parrot on a tricky bit of Unix code, or run Makeflow on a little workflow to see if various options work correctly.

Unfortunately, there are many ways that the tests can fail without revealing a bug in the code! We recently added several platforms to the build, resulting in a large number of test failures.  Some of these were due to differences between Unix utilities like sh, dd, and sed on the various machines. Others were more subtle, resulting from race conditions in concurrent actions.  (For example, should you start a Master in the foreground and then a Worker in the background, or vice versa.)  There is a certain art to being able to write a shell script that is portable and robust.

There is also a tension in the complexity of the tests.  On one hand, you want short, focused tests that exercise individual features, so that they can be completed in a few minutes at give immediate feedback.

On the other hand, you also want to run big complex applications, so as to test the system at scale and under load.  We don't really know that a given release of Parrot works at scale until it has run on 10K cores for a week for a CMS physics workload.  If each core consumes 30W of power over 7 days, that's a 50 megawatt-hour test!  Yikes!

Better not run that one automatically.