Wednesday, September 16, 2015

Sandboxes, Distributed Computing, and Closures

The sandbox abstraction comes up over and over again in the context of distributed computing.  We didn't invent it, but it appears so frequently that it's worth giving it a name and explaining how it works.  This abstraction is our model of execution in the Makeflow workflow system, the Work Queue master-worker framework, the Umbrella environment generator, and other systems, and this is what enables these tools to interoperate.

The sandbox abstraction is a way of specifying a remote execution in a way that can be efficiently delivered and precisely reproduced.  To run a task, the user states the following:

run C = T( A, B ) in environment E

In this example, A and B are the input files to the task, T is the command to be run, and C is the output file produced.  (Obviously, there can be more input and output files as needed.)

The sandbox itself is a private namespace in which the task can run, isolated from the outside world.  This enables the task to perceive the input and output files in a different way than the called.  A sandbox can be implemented in many ways: the simplest is just a plain old directory, but it could a Linux container or even a whole virtual machine.

The environment E is all the additional data that needs to be present within the sandbox: the operating system, the filesystem tree, program binaries, scripts, etc, which are represented by L1, L2, and L3 above.  The environment must be compatible with the sandbox technology.  For example, a tarball is a sufficient environment for executing within a directory sandbox, while a virtual machine image is needed for a virtual machine sandbox. 

Now, the pieces must all come together:  The sandbox must be created and the environment unpacked within it.  The input files must be moved to the execution site and copied or otherwise connected to the sandbox.  The task is run, producing the output, which must then be moved outside of the sandbox to the desired location.  Then, the sandbox may be discarded.

Once you begin to execute all tasks using the sandbox abstraction, many things become easier.
  • Executing tasks at remote sites becomes very easy, because all of the necessary dependencies are explicit and can be moved around the world.  (e.g. Work Queue
  • Similar tasks running on the same machine can share input objects, to improve efficiency.  (e.g. Umbrella)
  • Multiple tasks can be chained together while respecting independent namespaces.  (e.g. Makeflow)
Of course, all of these properties are not accidental: they have a good precedent in the realm of language theory.  A sandbox execution is really just a closure, which is the name for a function combined with an environment, which is a set of bindings from names to values.