Thursday, February 2, 2017

Writing a Compilers Textbook

To my surprise, I am in the final steps of writing a textbook!  You can see a sample chapter today at

The effort began in the fall of 2016, as I was putting together my materials for CSE 40243, our undergraduate class in Compilers and Language Design.  This class focuses on the challenges of engineering a working language: students implement a working compiler that translates a C-like language into X86 assembly.

While there are a variety of solid textbooks that are great for a graduate course in compiler theory and optimization, none quite had the flavor I was looking for.  Nearly every CS grad needs to write a parser, evaluator, or translator for some kind of little language in their career, but relatively few need to dig deeply into assembly language optimization.  So, I wanted to focus on language design choices and show that simple languages are not hard to implement.

I began to combine my handwritten chalkboard notes and some sample code into a LaTeX document, and the next thing you know, I have seven chapters written.  I expect to finalize everything in the spring 2017 semester.

What has made it relatively easy so far is that my compiler automatically generates many of the figures and code examples automatically, so relatively few things have to be drawn by hand.  For example, this sample AST is produced automatically by the compiler emitting Graphviz DOT code from the internal representation.  Neat, eh?

Following the example of Remzi and Andrea Arpaci-Dusseau with OSTEP the book will be made available for free online in PDF form, and also in an inexpensive hardcover edition printed on-demand.

Stay tuned for the release later in 2017...

Tuesday, April 5, 2016

NunyaOS: An Experimental OS Kernel

This semester, I am organizing an experimental class around the design of an operating system kernel.  Six students formed a team in response to a call for volunteers, and now busy designing NunyaOS, an experimental OS kernel.  Building on top of the Basekernel, they have built a system that boots an X86 machine, reads a CD-ROM filesystem, runs multiple processes in paged virtual memory, and has a simple windowing system.  We are off too a good start.

To try it out, download the source, build it, and run it in a VM like this:
qemu-system-i386 --cdrom basekernel.iso

The key organizing principle of NunyaOS is hierarchical containment.  This means that each process lives within a security container.  Within that container, the process has complete authority to manipulate its resources.  It also has the power to create sub-containers and then place child processes within them.  The containment can be applied to each of the resources within the system -- currently the filesystem, the window system, and the memory allocator.  As a result, each process lives a in a sort of a lightweight virtual machine, where it perceives itself to be the superuser.

For example, here are a few nested containers, each with their own filesystem root, display, and memory allocation:

Ideally, every child process will live in a container, so that we can eliminate attack vectors between code provided from different sources.  For example, your desktop should run your web browser in a container, your web browser should run each tab in a container, and each tab should run downloaded code (like a video codec) in yet another container.  In this way, untrusted code has very little leeway to affect other elements of your system.

Of course, this idea changes the customs by which processes interact with each other.  We can no longer build programs that scatter data all over the filesystem, and expect others to read it.  There are many challenges here, and we have only begun to dig into them.

Wednesday, September 16, 2015

Sandboxes, Distributed Computing, and Closures

The sandbox abstraction comes up over and over again in the context of distributed computing.  We didn't invent it, but it appears so frequently that it's worth giving it a name and explaining how it works.  This abstraction is our model of execution in the Makeflow workflow system, the Work Queue master-worker framework, the Umbrella environment generator, and other systems, and this is what enables these tools to interoperate.

The sandbox abstraction is a way of specifying a remote execution in a way that can be efficiently delivered and precisely reproduced.  To run a task, the user states the following:

run C = T( A, B ) in environment E

In this example, A and B are the input files to the task, T is the command to be run, and C is the output file produced.  (Obviously, there can be more input and output files as needed.)

The sandbox itself is a private namespace in which the task can run, isolated from the outside world.  This enables the task to perceive the input and output files in a different way than the called.  A sandbox can be implemented in many ways: the simplest is just a plain old directory, but it could a Linux container or even a whole virtual machine.

The environment E is all the additional data that needs to be present within the sandbox: the operating system, the filesystem tree, program binaries, scripts, etc, which are represented by L1, L2, and L3 above.  The environment must be compatible with the sandbox technology.  For example, a tarball is a sufficient environment for executing within a directory sandbox, while a virtual machine image is needed for a virtual machine sandbox. 

Now, the pieces must all come together:  The sandbox must be created and the environment unpacked within it.  The input files must be moved to the execution site and copied or otherwise connected to the sandbox.  The task is run, producing the output, which must then be moved outside of the sandbox to the desired location.  Then, the sandbox may be discarded.

Once you begin to execute all tasks using the sandbox abstraction, many things become easier.
  • Executing tasks at remote sites becomes very easy, because all of the necessary dependencies are explicit and can be moved around the world.  (e.g. Work Queue
  • Similar tasks running on the same machine can share input objects, to improve efficiency.  (e.g. Umbrella)
  • Multiple tasks can be chained together while respecting independent namespaces.  (e.g. Makeflow)
Of course, all of these properties are not accidental: they have a good precedent in the realm of language theory.  A sandbox execution is really just a closure, which is the name for a function combined with an environment, which is a set of bindings from names to values.

Wednesday, May 20, 2015

Writing Solid Tests is (Still) Hard

We have a nice little automatic build-and-test system for the Cooperative Computing Tools which has nicely brought together the capabilities of Github, Condor, and Docker.

Every proposed merge to the codebase is packaged up as a build job which is dispatched to our Condor pool.  Some of those jobs run natively on bare hardware, some jobs run on virtual machines, and some are running in Docker containers, but all of them are managed by Condor, so we can have a zillion builds going simultaneously without too much conflict.

The result is that anytime someone proposes a pull request, it gets run through the system and a few minutes later we get a new row on the web display that shows whether each platform built and tested correctly.  It's very handy, and provides for objective evaluation and gentle pressure on those who break the build.

(I should point out here that Patrick Donnelly and Ben Tovar have done a bang-up job of building the system, and undergraduate student Victor Hawley added the Docker component.)

Some days the board is all green, and some days it looks more like this:

But the hardest part of this seems to be writing the tests properly.  Each test is a little structured script that sets up an environment, runs some component of our software, and then evaluates the results.  It might start up a Chirp server and run some file transfers, or run Parrot on a tricky bit of Unix code, or run Makeflow on a little workflow to see if various options work correctly.

Unfortunately, there are many ways that the tests can fail without revealing a bug in the code! We recently added several platforms to the build, resulting in a large number of test failures.  Some of these were due to differences between Unix utilities like sh, dd, and sed on the various machines. Others were more subtle, resulting from race conditions in concurrent actions.  (For example, should you start a Master in the foreground and then a Worker in the background, or vice versa.)  There is a certain art to being able to write a shell script that is portable and robust.

There is also a tension in the complexity of the tests.  On one hand, you want short, focused tests that exercise individual features, so that they can be completed in a few minutes at give immediate feedback.

On the other hand, you also want to run big complex applications, so as to test the system at scale and under load.  We don't really know that a given release of Parrot works at scale until it has run on 10K cores for a week for a CMS physics workload.  If each core consumes 30W of power over 7 days, that's a 50 megawatt-hour test!  Yikes!

Better not run that one automatically.

Monday, May 19, 2014

Toward a Common Model of Highly Concurrent Programming

(This is the short version of a talk I gave at the MTAGS workshop at Supercomputing 2013.  See the slides here.)

Historically, highly concurrent programming has been closely associated with high performance computing.  Two programming models have been dominant: shared memory machines in which concurrency was expressed via multiple threads, and distributed memory machines in which concurrency was expressed via explicit message passing.  It is widely agreed that both of these programming models are very challenging, even for the veteran programmer.  In both cases, the programmer is directly responsible for designing the program from top to bottom and handling all of the issues of granularity, consistency, and locality necessary to achieve acceptable performance, with very little help from the runtime or operating systems.

However, a new approach to concurrent programming has been emerging over the last several years, in which the user programs in a much higher level language and relies upon the system to handle many of the challenging underlying details.  To achieve this, the program is successively decomposed into simpler representations, such that each layer of the system can gradually adapt it to the hardware available.

The layers can be described as follows:
  • declarative language (DL) for compactly representing a complete program.
  • directed graph (DAG) to represent the expanded program and its resources.
  • bag of independent tasks (BOT) with explicit input and output dependencies.
  • A shared-nothing cluster to which data and tasks must be assigned.
Several different research communities have arrived at this computing model somewhat independently: the high performance computing community, the scientific workflow community, and the cloud computing community.  In each case, the scale and complexity of the systems in use eventually made it impossible for the programmer or the user to take responsibility for all of the challenges of parallel/distributed computing.  Although each community employs different technologies and has distinct optimization goals, the overall structure of these systems is surprisingly similar.

A (very incomplete) selection of systems that follow this model:

Layer Cloud Stack Workflow Stack HPC Stack
Declarative Language (DL)    Pig Weaver Swift-T
Directed Acyclic Graph (DAG)Map-Reduce    Makeflow -
Bag of Tasks (BOT)JobTracker Work Queue Master Turbine
Distributed Data HDFS Work Queue Workers    MPI

Each layer of the system fulfills a distinct need.  The declarative language (DL) at the top is compact, expressive, and easy for end users, but is intractable to analyze in the general case because it may have a high order of complexity, possibly Turing-complete.  The DL can be used to generate a (large) directed acyclic graph (DAG) that represents every single task to be executed.  The DAG is not a great user-interface language, but it is much more suitable for a system to perform capacity management and optimization because it is a finite structure with discrete components.  A DAG executor then plays the graph, dispatching individual tasks as their dependencies are satisfied.  The BOT consists of all the tasks that are ready to run, and is then scheduled onto the underlying computer, using the data dependencies made available from the higher levels.

Why bother with this sort of model?  It allows us to compare the fundamental capabilities and expressiveness of different kinds of systems.  For example, in the realm of compilers, everyone knows that a proper compiler consists of a scanner, a parser, an optimizer, and a code generator.  Through these stages, the input program is transformed from a series of tokens to an abstract syntax tree, an intermediate representation, and eventually to assembly code.  Not every compiler uses all of these stages, much less the same code, but by using a common language, it is helpful to understand, compare, and design new systems.

Friday, February 14, 2014

Visualizing 10,000 Cores

Our Condor pool at the University of Notre Dame has been slowly growing, in no small part due to our collaboration with the Center for Research Computing, where it is now scavenging unused cycles from HPC clusters at the CRC.  When the dedicated batch system leaves a node unused, Condor is started on that node and keeps going until the dedicated system wants the node back.  Depending on the time of year, that leaves anywhere between 4K and 10K nodes available in the Condor pool.

We have tried a number of approaches at visualizing this complex system over the years.  Our latest tool, the Condor Matrix Display started as a summer project by Nick Jaeger, a student from the University of Wisconsin at Eau Claire.  The display shows a colored bar for each slot in the pool, where the width is proportional to the number of cores.

With a quick glance, you can see how many users are busy and whether they are running "thin" (1 core) or "fat" (many core) jobs.  Sorting by the machine name gives you sense of how each sub-cluster in the pool is used:

While sorting by users gives you a sense of what users are dominating the pool:

The display is always a nice way of viewing the relatively new feature of "dynamic slot" in Condor.  A large multi-core machine is now represented as a single slot with multiple resources.  For example, this bit of the display shows a cluster of 8-core machines where some of the machines are unclaimed (green), some are running 4-core jobs (blue), and some are running 1-core jobs (green):

Monday, February 6, 2012

Some Open Computer Science Problems in Workflow Systems

In the previous article, I extolled the virtues of Makeflow, which has been very effective at engaging new users and allowing them to express their workflows in a way that facilitates parallel and distributed computing. We can very consistently get new users going from one laptop to 100 cores in a distributed system very easily.

However, as we develop experience in scaling up workflows to thousands of cores across wide area distributed systems, a number of interesting computer science challenges have emerged. These problems are not specific to Makeflow, but can be found in most workflow systems:

Capacity Management
Just because a workflow expresses thousand-way concurrency doesn't mean that it is actually a good idea to run it on one thousand nodes! The cost of moving data to and from the execution nodes may outweigh the benefit of the added computational power. If one uses fewer nodes than the available parallelism, then it may be possible to pay the data movement cost once, and then exploit it multiple times. For most workflows, there is a "sweet spot" at which performance is significantly maximized. Of course, users don't want to discover this by experiment, they need some tool to recommend an appropriate size for the given workflow.

Software Engineering Tools
A workflow is just another kind of program: it has source code that must be managed, dependencies that must be gathered, and a history of modification to be tracked. In this sense, we are badly in need of tools for manipulating workflows in coherent ways. For example, we need a linker that can take a workflow, find all the dependent components, and gather them together in one package. We need a loader that can take an existing workflow, load it into a computing system, and then update file names and other links to accomodate it. We need a profiler that can report on the time spent across multiple runs of a workflow, so as to determine where problem spots may be.

Portability and Reproducibility
Makeflow itself enables portability across execution systems. For example, you can run your application on Condor or SGE without modification. However, that doesn't mean that your applications are actually portable. If one cluster runs Blue Beanie Linux 36.5 and another runs Green Sock Linux 82.7, your chances of the same executable running on both are close to zero. Likewise, if you run a workflow one day, then set it aside for a year, it's possible that your existing machine has been updated to the point where the old workflow no longer runs.

However, if we also explicitly state the execution environment in the workflow, then this can be used to provide applications with what they need to run. The environment might be as simple as a directory structure with the applications, or as complex as an entire virtual machine. Either way, the environment becomes data that must be managed and moved along with the workflow, which affects the performance and cost issues discussed above.

Everything in computing must be composable. That is, once you get one component working, the very next step is to hook it up to another so that it runs as a subcomponent. While we can technically hook up one Makeflow to another, this doesn't currently happen in a way that results in a coherent program. For example, the execution method and resource limits don't propagate from one makeflow to another. To truly enable large scale structures, we need a native method of connecting workflows together that connects not only the control flow, but the resource allocation, capacity management, and everything else discussed above.

Effortless Scalability

As a rule of thumb, I tell brand new users that running a Makeflow on 10 cores simultaneously is trivial, running on 100 cores is usually easy, and getting to 1000 cores will require some planning and debugging. Going over 1000 cores is possible (our largest system is running on 5000 cores) but requires a real investment of time by the user.

Why does scale make things harder? One reason is that computer systems are full of artificial limits that are not widely know or managed effectively. On a Unix-like system, a given process has a limited number of file descriptors per process and a limited number of files per directory. (Most people don't figure this out until they hit the limit, and then the work must be restructured to accomodate.) A complex network with translation devices may have a limited number of simultaneously network connections. A data structure that was small to ignore suddenly becomes unmanageable when there are 10,000 entries.

To have a software system that can scale to enormous size, you need to address these known technical issues, but also have methods of accomodating limits that you didn't expect. You also need an architecture that can scale naturally and observe its own limits to understand when they are reached. An ideal implementation would know its own limits and not require additional experts in order to scale up.


Each of these problems, though briefly described, are pretty hefty problems once you start digging into them. Some of them are large enough to earn a PhD. (In fact, some are already in progress!) They all have the common theme of making data intensive workflows manageable, useable, portable, and productive across a wide variety of computing systems.

More to follow.