Dockerized IPython and GraphLab Create for Machine Learning

So here's a fun surprise: I love machine learning just like everyone else! I have played around with a few ML concepts in a couple of weekend projects, but nothing serious. Earlier this year I came across a Coursera ML course which seemed like a great place to start a more formal education about the subject. Sweet! I'm a self-taught programmer, but taking university CS courses as a non-degree-seeking student helped fill the gaps in my knowledge. I see the value of formal education.

To prepare for the class, they want you to set up a Python 2.7 environment (Anaconda), IPython, and GraphLab Create. IPython is an interactive environment for programming languages. Obviously, it was first targeted for Python, but I guess they do all kinds of languages now. One of the cool features is built-in support for data visualizations. We're specifically concerned with the IPython "Notebook" feature set. We are also using GraphLab Create, which is a commercial product spawned out of a project from CMU and released by Dato. The CEO of Dato is one of the primary instructors of the course.

Real quick, GraphLab Create is a commercial product, as I've mentioned. Dato offers a free "student" license for this product, which is what we will be using for the course. This blog post assumes that you have already signed up for one of these educational licenses, and you have already been given your license key (which looks, for example, like ABCD-0123-EF45-6789-9876-54FE-3210-DCBA). You will need both this license key and the email address you signed up with in order to continue.

Now, all this software is great (really, you'll see what I mean when you start using it), but it seems to be quite a lot of stuff that I'd rather not have installed on my filesystem if I can help it. If you have spoken to me at all in the last two and half years, you know that I am a big fan of Docker. Probably 80% of this blog is about Docker. And so it should come as no surprise that I've got compartmentalization on my mind. I install and use just about everything Dockerized. These tools are phenomenally useful for this illuminating ML course, but I want them containerized.

I've taken the installation instructions for Anaconda Python and GraphLab Create and put them into a Dockerfile, which you'll have a chance to look at a little further down. Before I get to that, I want to point out that if you look closely at the install instructions for GraphLab Create, you'll see a mention of getting your Nvidia GPU to work with the software in order to speed things along. Specifically for machine learning, having a GPU workhorse can be a computational difference of days, weeks, or even years.

CPUs are fine for most projects and will probably work just fine for this course, but I have seen the question of CUDA processing specifically with respect to Docker. I had heard that CUDA was not available to Docker containers because of the difficulty of making the Nvidia drivers available to the containerized process. Well, I took it as an opportunity for research, and it just so happens Nvidia has very recently released an application called nvidia-docker specifically for making the GPU available to Docker containers! You can follow that link for all kinds of interesting information, but suffice it to say nvidia-docker is a drop-in replacement for the docker executable which is used on images that you want to be CUDA-capable. They also offer similar functionality in a daemon plugin called nvidia-docker-plugin. You can read about the differences in the nvidia-docker documentation.

A quick note about nvidia-docker: I ran into a problem with the .deb I had installed per their instructions, because I am using the latest version of Docker (1.11), and as of this writing, they haven't released an updated .deb with the working code. And so that meant that I had to complie my own nvidia-docker binary. It's super easy (and doubly so for anyone with the technical wherewithal to be taking a machine learning course): Just git clone https://github.com/NVIDIA/nvidia-docker and then cd nvidia-docker && make and then sudo make install and you've got a working binary!

Also, I ran into import errors during the ML course (specifically with matplotlib). I ended up solving this by installing python-qt4 inside the Docker container right off the bat.

So finally, here's what my Dockerfile looks like:

FROM ubuntu:14.04
MAINTAINER curtisz <software@curtisz.com>

# get stuff
RUN apt-get update -y && \
	apt-get install -y \
		curl \
		python-qt4 && \
	rm -rf /var/cache/apt/archive/*

# get more stuff in one layer so unionfs doesn't store the 400mb file in its layers
WORKDIR /tmp
RUN curl -o /tmp/Anaconda2-4.0.0-Linux-x86_64.sh http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh && \
	chmod +x ./Anaconda2-4.0.0-Linux-x86_64.sh && \
	./Anaconda2-4.0.0-Linux-x86_64.sh -b && \
	rm ./Anaconda2-4.0.0-Linux-x86_64.sh
# make the anaconda stuff available
ENV PATH=${PATH}:/root/anaconda2/bin

## anaconda
RUN conda create -n dato-env python=2.7 anaconda
# (use JSON format to force interpretation by /bin/bash)
RUN ["/bin/bash", "-c", ". activate dato-env"]
RUN conda update pip

## install graphlab create with creds provided in --build-arg in 'docker build' command:
ARG USER_EMAIL
ARG USER_KEY
RUN pip install --upgrade --no-cache-dir https://get.dato.com/GraphLab-Create/1.9/${USER_EMAIL}/${USER_KEY}/GraphLab-Create-License.tar.gz

## install ipython and ipython notebook
RUN conda install ipython-notebook

## upgrade GraphLab Create with GPU Acceleration
RUN pip install --upgrade --no-cache-dir http://static.dato.com/files/graphlab-create-gpu/graphlab-create-1.9.gpu.tar.gz

CMD jupyter notebook

I ended up having to update this Dockerfile when Dato released GraphLab Create version 1.9. It was as easy as changing "1.8.5" to "1.9" in the Dockerfile. Everything else was the same. Keep this in mind if you find that you need to install a newer version of GraphLab Create. Now, you'll build the image with this command, making sure to replace the email and license key with your own details:

docker build -t=graphlab --build-arg "USER_EMAIL=genius@example.edu" --build-arg "USER_KEY=ABCD-0123-EF45-6789-9876-54FE-3210-DCBA" .

The build will take a few minutes. It downloads a few hundred megabytes of stuff. When you're done with that, you can launch IPython with the following command:

nvidia-docker run -d --name=graphcreate -v "`pwd`/data:/data" --net=host graphlab:latest

Voila! You've got this whole operation running and you can access your IPython notebook by going to http://localhost:8888/ in your browser!

Please note: When it comes time to use GraphLab Create, you will be able to browse its UI normally, because we have specified --net=host in the docker run command, which shares the host's network stack with the container. The reason we do it this way is because GraphLab Create uses tcp/0 to set its server port. If you remember, that means the system chooses a random high port number, which prevents us from targeting a specific port with an EXPOSE Dockerfile directive (or -p port assignment in the docker run command). Exposing the host's network stack to the container could have some security implications if you run an untrusted application in that container. The applications we're using for this course are okay, it's just something you should be aware of.

Finally, I've been asked to include a tiny Docker crash course in case it's new to you. So our particular run command also mounts the ./data/ directory into the container at /data! This means you can download notebooks and datasets for the course and put them in that directory, and they'll be accessible in the container under the /data directory. For example, you would use sf = graphlab.SFrame('/data/people-example.csv') to load the sample data. In your terminal, you can use docker logs graphlab to see the container's logs, but don't forget you can swap out -d with -it in your docker run command if you want to create an interactive session for the container so that you can see the output in your terminal. You can also drop into a shell on the running container with docker exec -it graphlab /bin/bash and poke around if you need to. Killing the container happens with docker stop graphlab and deleting the container happens with docker rm graphlab. The Docker documentation is generally well-written, concise, and accurate. The source is also very approachable, as is the Docker community itself! Don't be afraid to drop by #docker on Freenode IRC if you need help!

For your convenience, I have created a Github repository with the Dockerfile and related scripts, as well as the sample starter data provided by Coursera.

Good luck!

Running Crons in Docker with Supervisord

Recently I've had an interesting conversation in #docker on Freenode with a guy that's been trying to get crons working inside his Docker container. I hadn't yet had a chance to look at that, and so we took off on a late-night debug session exchanging Dockerfiles via Pastebin. He has a bunch of other stuff going on, but at the core, he's just running an Apache webserver instance and then wants to run some crons in that container as well. I took his Dockerfile and related scripts, and pared them down to the bare minimum, commenting out everything that wasn't related directly to getting Apache and cron to work. You can take a look at what I came up with:

FROM ubuntu:14.04
MAINTAINER curtisz <software@curtisz.com>

# we install stuff this way to keep it all on one layer
# (which reduces the overall size of our image)
RUN apt-get update -y && \
	apt-get install -y \
		cron \
		apache2 \
		supervisor && \
	rm -rf /var/lib/apt/lists/*

# apache stuff
RUN mkdir -p /var/lock/apache2 /var/run/apache2 /etc/supervisor/conf.d/
ENV APACHE_RUN_USER www-data
ENV APACHE_RUN_GROUP www-data
ENV APACHE_LOG_DIR /var/log/apache2
ENV APACHE_LOCK_DIR /var/lock/apache2
ENV APACHE_PID_FILE /var/run/apache2.pid
# empty out the default index file
RUN echo "" /var/www/html/index.html

# cron job which will run hourly
# (remember that COPY is better than ADD for plain files or directories)
COPY ./crons /etc/cron.hourly/
RUN chmod +x /etc/cron.hourly/crons
# test crons added via crontab
RUN echo "*/1 * * * * uptime >> /var/www/html/index.html" | crontab -
RUN (crontab -l ; echo "*/2 * * * * free >> /var/www/html/index.html") 2>&1 | crontab -

# supervisord config file
COPY ./supervisord.conf /etc/supervisor/conf.d/supervisord.conf
 
EXPOSE 80
WORKDIR /var/www/html/
CMD /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf

You can tell what's going on here, it's pretty straightforward. To get this working, we need to install apache2, supervisor, and of course, cron. The next few lines are configuration options for Apache. Then finally we get to the test crons. I'm dumping a simple cron shell script into /etc/cron.hourly in the container and making it executable, and then creating two new crons via crontab. Please note how I have added these crons with crontab - by also listing the previously-added crons with crontab -l. If you don't do this, whatever you add via crontab will overwrite whatever's in there now. My crons are just stupid simple crons for dumping something easy into Apache's index.html file so we can prove they're running. My little crons file looks like this:

#!/bin/sh
ps aux | grep apache > /var/www/html/index.html

Some things to notice about this file... First, it starts with a typical shell script shebang (#!/bin/sh). It should also be executable. Lastly, this file can only contain alphanumeric characters in its filename (and underscores). It cannot contain a dot, which means naming it something like crons.sh won't work.

Next, we're adding the config file for supervisord. Then simply exposing the port, changing our working directory, and then specifying our CMD to start supervisord. Speaking of, this is what our supervisord.conf file looks like:

[supervisord]
nodaemon=true
logfile = /var/log/supervisord.log
logfile_maxbytes = 50MB
logfile_backups=10
 
[program:cron]
autorestart=false
command=cron -f
 
[program:apache2]
autorestart=false
command=/usr/sbin/apache2ctl -D FOREGROUND

Pretty standard fare. And so let's build our container and start it:

docker build -t="crontest" .
docker run -it --name="crontest" -p 8080:80 crontest:latest

And there you have it! Give your crons a few minutes to execute, then fire up a browser on your localhost and point it to http://localhost:8080/index.html. You should see the output of our test crons there at the tail of the file. Refresh to see more.

That's all there is to it! Previously, I've used the cron system available on the Docker host, which certainly has its benefits. First and foremost of which is not having to use supervisord inside a Docker container. Since Docker is just a fancy way to run a process, you want to avoid loading up your container with a bunch of cruft. It's not a VM! But when you can't avoid it and you really need to run crons alongside your containerized processes, it's no sweat to get it going.

Using a Private Docker v2 Registry with Nginx-Proxy

Today in #docker on Freenode there was a person with a problem with their v1 Docker registry. I think I jinxed it when I said it was "extremely easy" to get a v2 registry running behind an Nginx proxy. It turned into a nightmare, and I'm sharing the design process to help anyone else that might need to debug problems with a similar setup.

So I had previously spent time getting a private registry to work behind the jwilder/nginx-proxy image, which is a great reverse proxy for docker containers. There is a lot of movement with Docker and especially the ecosystem of orchestration applications that exist around it. These days, plenty of stuff out there does what the nginx-proxy image does. Personally I like Consul and Serf from Hashicorp. Incidentally, they also make a hell of a nice application -- Vault -- that solves most of the problem with sharing sensitive configuration information. Anyways, for single-host-multiple-container environments, I still prefer proxying with the nginx-proxy image. I use it to front for all my web applications, and our Gitlab installation, so it only made sense to front my v2 registry with the same proxy.

The way the nginx-proxy image works is that it is bound to listen on tcp/80 and tcp/443 and uses the host machine's /var/run/docker.sock to listen for docker events, and then other containers are started with a VIRTUAL_HOST environment variable. When that happens, the docker-gen utility creates an Nginx template and starts routing requests to the container. You can also mount htpasswd files into the proxy and SSL certs to manage authentication and HTTPS connections. It also supports custom directives and custom templates.

So anyways, the problem the guy was having I was pretty sure he wouldn't be able to get support for, because v1 has been officially deprecated on Docker Hub, and also is no longer the primary registry endpoint the Docker client looks for when attempting to connect to a registry. Support for v2 was introduced in 1.6, and as of version 1.9, the client now prefers v2 registry endpoints over v1. So the time has come to upgrade.

Let's get started. First, pull the images we need to work with:

docker pull jwilder/nginx-proxy
docker pull registry:2.2

It's important to note that registry:latest does not point to the latest version of registry. The latest version points to v1! We need to make sure we're pulling v2. The docker registry is actually a part of the "distribution" repository. You can and should check for the latest version of both the v2 registry image and also any documentation there. It's generally true that the Docker documentation is mostly very good and correct, so make sure you read that and prefer that information above mine.

To configure the v2 registry, we need to create a minimal config.yml file. I usually keep all my stuff for docker under /var/docker/<container>, so sudo sh -c "mkdir -p /var/docker/registry && cd /var/docker/registry/ && vim config.yml" and put this into it:

version: 0.1
log:
    level: info
    formatter: json
    fields:
        service: registry
        environment: staging
        source: registry
http:
    addr: :5000
    host: myregistry.example.com
    secret: biglongsecretwhatever
storage:
    filesystem:
        rootdirectory: /var/lib/registry

This is the bare minimum you'll need to get your registry going. You should change http.secret to something long and random. For example you can use a bash one-liner like cat /dev/urandom | tr -dc a-zA-Z0-9 | head -c36 to get a random string. Now save this yaml file. Finally, mkdir /var/docker/registry/lib to create a directory for storing our registry images on the local host. I like to use AWS S3, and if this interests you, take a look at my last post on this subject for instructions.

We've got our v2 registry primed to run, but we won't run it quite yet.

Let's get the nginx-proxy image going. First mkdir /var/docker/nginx-proxy && cd /var/docker/nginx-proxy to get into our base host directory. Use mkdir vhost.d to create a directory for storing our special Nginx directives.

Now, if you want to have nginx-proxy handle your SSL certificates and authentication -- and we do, since docker will complain about a registry running without HTTPS -- you'll want to mdir htpasswd && mkdir certs.d at this time as well. For illustrative purposes, let's say our registry domain is myregistry.example.com and we've already pointed DNS at the host. So we'll name our SSL certificate to myregistry.example.com.crt and our key to myregistry.example.com.key and drop both of those files into the /var/docker/nginx-proxy/certs.d directory.

With HTTP authentication, we can just do some real basic stuff. We don't need anything fancy, since the client supports basic authentication. You can use htpasswd (from the apache2-utils repo on Linux Mint or Ubuntu) to generate authentication information. Save this information in /var/docker/nginx-proxy/htpasswd/myregistry.example.com similar to the way we named our SSL data. For future reference, you can also store a default certificate and key for HTTPS requests that arrive at your nginx-proxy which aren't routable to one of your containers.

We need a couple more files dropped into /var/docker/nginx-proxy/vhost.d. The first is myregistry.example.com and looks like this:

client_max_body_size 0;
chunked_transfer_encoding on;
 
location /v2/ {
  # Do not allow connections from docker 1.5 and earlier
  # docker pre-1.6.0 did not properly set the user agent on ping, catch "Go *" user agents
  if ($http_user_agent ~ "^(docker\/1\.(3|4|5(?!\.[0-9]-dev))|Go ).*$" ) {
    return 404;
  }
 
  add_header Docker-Distribution-Api-Version "registry/2.0";
  #more_set_headers     'Content-Type: application/json; charset=utf-8';
  include               vhost.d/docker-registry.conf;
}
 
location /v1/_ping {
  auth_basic off;
  include               vhost.d/docker-registry.conf;
  add_header X-Ping     "inside /v1/_ping";
  add_header X-Ping     "INSIDE /v1/_ping";
}
 
location /v1/users {
  auth_basic off;
  include               vhost.d/docker-registry.conf;
  add_header X-Users    "inside /v1/users";
  add_header X-Users    "INSIDE /v1/users";
}

These directives do a couple of things. First, they lift the artificial limit Nginx places on request size. Since you're going to be uploading huge files to your v2 registry, we need to turn off that limitation. This directive also prevents access by client versions 1.5 and below (which only use v1 registry endpoints anyway). Of particular interest is that the location directives for v1 endpoints have been proven to fix docker client bugs which caused the v2 registry to throw a 404 when trying to connect.

The second file is /var/docker/nginx-proxy/vhost.d/docker-registry.conf and looks like this:

proxy_pass                          http://myregistry.mexample.com;
proxy_set_header  Host              $http_host;   # required for docker client's sake
proxy_set_header  X-Real-IP         $remote_addr; # pass on real client's IP
proxy_set_header  X-Forwarded-For   $proxy_add_x_forwarded_for;
proxy_set_header  X-Forwarded-Proto $scheme;
proxy_read_timeout                  900;

This forwards the IP address of your clients to the registry for logging purposes, as well as providing a couple of required headers.

Now that we've configured our v2 registry and nginx-proxy, let's start them up! We'll begin with the nginx-proxy container (shoutout to Arthur for noticing I'd forgotten to mount vhost.d in this next command):

docker run -d \
  --name "nginx-proxy" \
  --restart "always" \
  -p 80:80 \
  -p 443:443 \
  -v /var/docker/nginx-proxy/certs:/etc/nginx/certs:ro \
  -v /var/docker/nginx-proxy/htpasswd:/etc/nginx/htpasswd:ro \
  -v /var/docker/nginx-proxy/vhost.d:/etc/nginx/vhost.d \
  -v /var/run/docker.sock:/tmp/docker.sock \
  jwilder/nginx-proxy

Next, let's start our v2 registry container:

docker run -d \
  --name="registry" \
  --restart="always" \
  -v "/var/docker/registry/config/config.yml:/etc/docker/registry/config.yml" \
  -v "/var/docker/registry/lib:/var/lib/registry" \
  -e "VIRTUAL_HOST=myregistry.example.com" \
  registry:2.2

Notice here that we're not binding our v2 registry port to any port on the host. This is because we don't want to bind that port to the host interface. We want to "bind" that port to the nginx-proxy container. You may be wondering how the nginx-proxy container knows that it should route inbound HTTPS requests to port 5000 on the registry container. The answer is that it binds to whatever port your container has exposed, either via EXPOSE in the Dockerfile or --expose in the docker run command. In our example, the v2 registry image has EXPOSE 5000 in its Dockerfile.

This brings me to one last pain point I had with the nginx-proxy image. As wonderful as it is, it's not clear that you cannot proxy anything that isn't based on the HTTP protocol. So for example you can't have the nginx-proxy image front for your SMTP server or your FTP server. Also, something else that took me a while to understand... Let's say you want to run a cAdvisor container to expose some metrics for your Prometheus server. The documentation wants you to make that thing listen on tcp/8080. That's totally fine, but if you want to run this container behind your nginx-proxy container, you'll want to not bind that port to the host. And since it's exposed port 8080 in its Dockerfile, you can simply start it with -e "VIRTUAL_HOST=cadvisor.example.com" and it will then be available at https://cadvisor.example.com/, which will be served from behind your nginx-proxy. The nginx-proxy container gets an inbound HTTPS request on tcp/443, then routes the connection on the backend to tcp/8080 on your cAdvisor container. No --link necessary!

A piece of advice when debugging problems: Start with docker logs nginx-proxy and check if the problem is with your container or if it's with your nginx-proxy. Also you can use docker exec -it nginx-proxy /bin/bash to drop into a shell on your nginx-proxy container and poke around. You should check the /etc/nginx/conf.d/default.conf to make sure your container is being properly exposed to the nginx-proxy and its docker-gen utility.

Good luck!

Publish/Subscribe: The Five Ws (and of course, the How)

A friend of mine recently asked me about the publish/subscribe ("pubsub") programming pattern. As this is something I use in almost every project, I thought I'd be able to find a decent tutorial for him online. Something that would be helpful to someone familiar with programming, but not familiar with this pattern. As it happens, most of the pubsub documentation or tutorials out there are specific to their use in one situation: APIs. That's all fine and dandy, but the pubsub pattern is so much more powerful and applicable in so many more situations than that niche. I am pretty sure it's my favorite programming pattern of them all. Not so hard to believe once you know that I write JavaScript for a living. But, I think JavaScript programmers are more sensitive to the applicability of the pubsub pattern than programmers in other languages, because JavaScript is asynchronous.

So what exactly does that mean?

If you've ever written any JavaScript, you have probably gotten yourself in trouble with its asynchronous nature. For example, you've probably thought about doing something like this:

// read data from a file ...
var data = readDataFromFile('/path/to/file.txt');
// ... then do something with the data
console.log('your name is: ' + data.username);

But you can't do that. JavaScript's asynchronous nature means the (totally made-up) function readDataFromFile('/path/to/file.txt') is not executed immediately. Instead it is tossed onto the event queue and executed later, after the console.log('your name is: ' + data.username); is executed. This happens because all the code in the current scope is executed in a batch, and then JavaScript grabs the next thing on the event queue and executes everything in that function's scope, and so on. "External" function calls get put onto the event queue as they are encountered, instead of executed, as they are in synchronous languages like Python or C. You could write a lot of JavaScript before you encounter this behavior, though. JavaScript is pretty nebulous about when functions get put on the event queue versus executed directly. Usually you trigger the event queue when making calls out to external resources, for example via AJAX or file I/O. Well, in order to work around this limitation in JavaScript, we use callbacks. Callbacks are functions that we pass as parameters to other functions that get executed after all the processing inside the first function is done. That sounds stupid, so let me illustrate. Here's an example of proper asynchronous JavaScript using callbacks:

// read data from a file ...
fs.readFile('/path/to/file.txt', function( err, data ) {
    //                           ^^^^^^^^^^^^^^^^^^^^^
	// ... then do something with the data *in a callback*
	if (err) {
    	throw new Error('problem doing stuff: ' + err);
    }
    console.log('your name is: ' + data.username);
});

Now you're probably asking yourself why I'm even talking about asynchronous programming in JavaScript. What does that have to do with the pubsub pattern? Well, after a while, you fall into what's called "callback hell" when you chain callbacks with callbacks with callbacks. Take a look at what should be a simple operation: Reading from a file, making a change to the data, then writing those changes to a file:

var filename = '/path/to/file.txt';
var savefile = '/path/to/new/file.txt';

fs.stat(filename, function(err, stat) {
	// inside first callback
	if (err) throw new Error('could not stat file: '+err);
    if (!stat.isFile()) throw new Error('path is not a file!');
    
    fs.readFile(filename, function(err, data) {
    	// inside second callback
    	if (err) throw new Error('could not read file: '+err);
        
        modifyUsername(data.username, function(name) {
        	// inside third callback
            
        	fs.stat(savefile, function(err, stat) {
            	// inside fourth callback
            	if (err) throw new Error('could not stat file: '+err);
                if (!stat.isFile()) throw new Error('path is not a file!');
                
                fs.writeFile(savefile, name, function(err) {
                	// i want to kill myself ..............
                    // and look at all the close brackets and parentheses!
                    // how embarrassing!
                });
            });
        });
    });
});
// yikes. are you sure you closed up all your functions properly?

"Holy shit. JavaScript sucks!" they will say. And so. Even the most faithful will pause to think.

Now, if only there were a way to write a blob of callback code, and then create a "trigger" that we could pull when we were ready to execute the callback blob, we could clean this mess right up. ... Well, that's right! You've figured out that my beloved pubsub can swoop in and save the day.

Here is what super-simplified pubsub calls look like:

// when you have a blob of code you want to run later:
subscribe(  "this-can-be-any-string",    functionToCall    );
// then when you're ready to execute the blob of code:
publish(  "this-can-be-any-string",   [  array, of, parameters  ]);

The first parameter to the subscribe() function is a string that we use to "index" the function we want to call. We store the function using this string as a label. This is so that later, when we call the publish() function, we reference the function we want to execute using the same string ("this-can-be-any-string") and then the second parameter passed to the publish() function is an array of arguments to pass to the function we will execute. In the above two lines, the result of the code ends up logically looking like this:

functionToCall(array, of, parameters);

Take a look at the pubsub object. It's very easy to read:

// here is our pubsub object
// this is how we enabled the pattern
var $pubsub = (function() {
	var cache = {};
    function _flush() { cache = {}; }
	function _pub( topic, args, scope ) {
    	if (cache[topic]) {
        	var current = cache[topic];
            for (var i=0; i<current.length; i++) {
				current[i].apply(scope || this, args || []);
            }
        }
    }
    function _sub( topic, callback ) {
    	if (!cache[topic]) {
        	cache[topic] = [];
		}
        cache[topic].push(callback);
    }
	return {
    	flush: _flush,
    	pub: _pub,
    	sub: _sub
    };
})();

Now take a look at the refactored code, which uses pubsub to escape callback hell by "subscribing" some functions to events which we later "publish":

// convenience function to DRY up our calls to fs.stat()
var fileStat = function( filename, callback ) {
	fs.stat(filename, function(err, stat) {
    	if (err) throw new Error('could not stat file: '+err);
        if (!stat.isFile()) throw new Error('path is not a file!');
        typeof(callback) === 'function' && callback();
    });
};
// now begins our list of discrete functions to execute in a specific order
// (note the $pubsub.pub() calls within each function!)
var fileRead = function( filename ) {
	fileStat(filename, function() {
        fs.readFile(filename, function(err, data) {
        	if (err) throw new Error('could not read file: '+err);
            $pubsub.pub('/username/modify', [data]);
        });
    });
};
var fileWrite = function( filename, data ) {
	fileStat(filename, function() {
        fs.writeFile(filename, data, function(err) {
        	if (err) throw new Error('could not write file: '+err);
            $pubsub.pub('/continue/process');
        });
    });
};
var modifyUsername = function( username ) {
	var newUsername = username + '_modified';
	$pubsub.pub('/file/write', ['/path/to/new/file.txt', newUsername]);
};
var moreStuff = function() {
	// do more stuff after everything else
};

// set up our subscriptions
$pubsub.sub('/file/read', fileRead);
$pubsub.sub('/file/write', fileWrite);
$pubsub.sub('/username/modify', modifyUsername);
$pubsub.sub('/continue/process', moreStuff);

// kick off the whole process with this "publish" statement
$pubsub.pub('/file/read', ['/path/to/file.txt']);

This looks a lot nicer, doesn't it? It's a bit more typing, but once you grok what's happening here in this post, you will never want to go back to that awful callback hell. So I've spent the last few minutes answering the 5 Ws of the publish/subscribe pattern by showing you how to do it. If you're not totally clear on what's happening here, start reading from the top, and hand-copy the code into your IDE. Taking a closer look by hand-copying the code is something that always helps things sink in. Before you know it, you'll grok the pubsub pattern and be using it to hit all kinds of nails.

Meet DOSBox, the kickass... Debugger?

One of the members of an ARG I play recently started talking about an old piece of equipment he'd purchased, which supposedly had been used by phone repair technicians to do their work. The equipment in question is an Itronix T5000, which has an in-built modem, speedy 486 processor, and 640KB of RAM. Kilobytes, folks. This was the 90s. You know, incidentally, I fondly remember having 640KB of RAM in my very first computer, and having to juggle peripherals.

Anyways, unfortunately for our friend, when he powered the device on, this is what he saw:

Our friend Mister Argent managed to offload all of the device's files to USB using an available restore feature. He just didn't know how to proceed. Fortunately for him, when it comes to binaries from the 90s, I'm your guy. Now, it's been a few years since my last encounter with something of this nature, and I hadn't realized that most of the tools we used to use for reversing no longer function on today's platforms. A simple strings of the files gave me nothing useful. Interesting, sure, but nothing useful to solve our primary predicament. There is a PASSWORD.DAT file in the collection but it's clearly not plaintext and here specifically, I am nothing more than a hobbyist, and definitely no cypherpunk. I would need to reverse this binary to get anywhere. It didn't take me long, however, to remember the only game I spend time playing these days -- which also happens to be a binary from the 90s -- and more importantly, the platform I use to play it: DOSBox.

If you've never heard of DOSBox, it is basically what you're thinking it is after my description above: An emulator for DOS applications. The thing about DOSBox that makes it special -- besides being the key to many glorious, wonderful games from the 90s that you couldn't otherwise play -- is that the creator has built in a very useful debugger.

I'm a die-hard Linux Mint user, since I can't stand Ubuntu's Unity UI almost as much as I can't stand Michael Shuttleworth. One of the nice things about Linux Mint -- besides the fact that it hasn't immediately jumped into the systemd assimilation chamber -- is that it uses Ubuntu as a base, and therefore has its repositories available for consumption. DOSBox is available in Ubuntu's default repository (and probably in other default distro repos), but if you want to use the debugger, you've got to compile it with a special option, which means building from source. On LM/Ubuntu, you're going to need a few things in order to compile. If you're adventurous, you probably already have build-essential, autoconf, and automake. If not:

sudo apt-get install build-essential autoconf automake

Either way, you're going to need to get the DOSBox dependencies:

sudo apt-get build-dep dosbox

When you're compiling DOSBox for its debugger, you need a curses library. You'll need to install one to continue. Thanks to this answer on Stack Overflow, resolving this on Linux Mint/Ubuntu is a cinch:

sudo apt-get install lib32ncurses5-dev

Next, download the source and extract the source (sorry for the SourceForge link):

wget "http://downloads.sourceforge.net/project/dosbox/dosbox/0.74/dosbox-0.74.tar.gz"
tar -xvf dosbox-0.74.tar.gz

Next we'll build our awesome DOSBox debugger, but something you should know here is that DOSBox actually comes with two "levels" of debugging capability. Compiling with --enable-debug will get you most of the debugging features, but there are a few important ones you'll want the convenience of by compiling with --enable-debug=heavy. Most importantly, this "heavy" debugger enables the heavycpu command, which is a hardcore CPU logger that makes following code a lot easier:

cd dosbox-0.74
./autogen.sh

At this point, we're going to need to modify the source a little bit to prevent some errors in the actual compilation. Thanks to this helpful post by the DOSBox author, we know exactly what we need to change in the source to prevent the error. Let's create a little patch file to do the work for us. Create a new file in the dosbox-0.74 directory:

vim ./dosbox-0.74.patch

Paste the following contents into the editor:

diff -rupN dosbox-0.74/include/dos_inc.h dosbox-0.74.patched/include/dos_inc.h
--- dosbox-0.74/include/dos_inc.h	2010-05-10 10:43:54.000000000 -0700
+++ dosbox-0.74.patched/include/dos_inc.h	2015-07-07 14:52:42.057078234 -0700
@@ -28,6 +28,8 @@
 #include "mem.h"
 #endif
 
+#include <stddef.h>
+
 #ifdef _MSC_VER
 #pragma pack (1)
 #endif
diff -rupN dosbox-0.74/src/cpu/cpu.cpp dosbox-0.74.patched/src/cpu/cpu.cpp
--- dosbox-0.74/src/cpu/cpu.cpp	2010-05-12 02:57:31.000000000 -0700
+++ dosbox-0.74.patched/src/cpu/cpu.cpp	2015-07-07 14:52:23.641077942 -0700
@@ -30,6 +30,7 @@
 #include "paging.h"
 #include "lazyflags.h"
 #include "support.h"
+#include <stddef.h>
 
 Bitu DEBUG_EnableDebugger(void);
 extern void GFX_SetTitle(Bit32s cycles ,Bits frameskip,bool paused);
diff -rupN dosbox-0.74/src/dos/dos.cpp dosbox-0.74.patched/src/dos/dos.cpp
--- dosbox-0.74/src/dos/dos.cpp	2010-05-10 10:43:54.000000000 -0700
+++ dosbox-0.74.patched/src/dos/dos.cpp	2015-07-07 14:52:11.929077757 -0700
@@ -31,6 +31,7 @@
 #include "setup.h"
 #include "support.h"
 #include "serialport.h"
+#include <stddef.h>
 
 DOS_Block dos;
 DOS_InfoBlock dos_infoblock;
diff -rupN dosbox-0.74/src/ints/ems.cpp dosbox-0.74.patched/src/ints/ems.cpp
--- dosbox-0.74/src/ints/ems.cpp	2010-05-10 10:43:54.000000000 -0700
+++ dosbox-0.74.patched/src/ints/ems.cpp	2015-07-07 14:51:59.081077554 -0700
@@ -32,6 +32,7 @@
 #include "setup.h"
 #include "support.h"
 #include "cpu.h"
+#include <stddef.h>
 
 #define EMM_PAGEFRAME	0xE000
 #define EMM_PAGEFRAME4K	((EMM_PAGEFRAME*16)/4096)

Now save the file and exit, and apply the patch you have just created:

patch -p1 < ./dosbox-0.74.patch

Finally, compile DOSBox:

./configure --enable-debug=heavy
make

I've already got DOSBox installed, and so I chose not to install over it with my debugger-enabled version. But if you don't care, go ahead and place your newly-built debugging DOSBox version into your executables directory:

sudo make install

Awesome! We've got a bitchin' debugger! The second part of this story will cover the password discovery process using our fresh-from-source DOSBox debugger.