Dockerized IPython and GraphLab Create for Machine Learning

So here's a fun surprise: I love machine learning just like everyone else! I have played around with a few ML concepts in a couple of weekend projects, but nothing serious. Earlier this year I came across a Coursera ML course which seemed like a great place to start a more formal education about the subject. Sweet! I'm a self-taught programmer, but taking university CS courses as a non-degree-seeking student helped fill the gaps in my knowledge. I see the value of formal education.

To prepare for the class, they want you to set up a Python 2.7 environment (Anaconda), IPython, and GraphLab Create. IPython is an interactive environment for programming languages. Obviously, it was first targeted for Python, but I guess they do all kinds of languages now. One of the cool features is built-in support for data visualizations. We're specifically concerned with the IPython "Notebook" feature set. We are also using GraphLab Create, which is a commercial product spawned out of a project from CMU and released by Dato. The CEO of Dato is one of the primary instructors of the course.

Real quick, GraphLab Create is a commercial product, as I've mentioned. Dato offers a free "student" license for this product, which is what we will be using for the course. This blog post assumes that you have already signed up for one of these educational licenses, and you have already been given your license key (which looks, for example, like ABCD-0123-EF45-6789-9876-54FE-3210-DCBA). You will need both this license key and the email address you signed up with in order to continue.

Now, all this software is great (really, you'll see what I mean when you start using it), but it seems to be quite a lot of stuff that I'd rather not have installed on my filesystem if I can help it. If you have spoken to me at all in the last two and half years, you know that I am a big fan of Docker. Probably 80% of this blog is about Docker. And so it should come as no surprise that I've got compartmentalization on my mind. I install and use just about everything Dockerized. These tools are phenomenally useful for this illuminating ML course, but I want them containerized.

I've taken the installation instructions for Anaconda Python and GraphLab Create and put them into a Dockerfile, which you'll have a chance to look at a little further down. Before I get to that, I want to point out that if you look closely at the install instructions for GraphLab Create, you'll see a mention of getting your Nvidia GPU to work with the software in order to speed things along. Specifically for machine learning, having a GPU workhorse can be a computational difference of days, weeks, or even years.

CPUs are fine for most projects and will probably work just fine for this course, but I have seen the question of CUDA processing specifically with respect to Docker. I had heard that CUDA was not available to Docker containers because of the difficulty of making the Nvidia drivers available to the containerized process. Well, I took it as an opportunity for research, and it just so happens Nvidia has very recently released an application called nvidia-docker specifically for making the GPU available to Docker containers! You can follow that link for all kinds of interesting information, but suffice it to say nvidia-docker is a drop-in replacement for the docker executable which is used on images that you want to be CUDA-capable. They also offer similar functionality in a daemon plugin called nvidia-docker-plugin. You can read about the differences in the nvidia-docker documentation.

A quick note about nvidia-docker: I ran into a problem with the .deb I had installed per their instructions, because I am using the latest version of Docker (1.11), and as of this writing, they haven't released an updated .deb with the working code. And so that meant that I had to complie my own nvidia-docker binary. It's super easy (and doubly so for anyone with the technical wherewithal to be taking a machine learning course): Just git clone https://github.com/NVIDIA/nvidia-docker and then cd nvidia-docker && make and then sudo make install and you've got a working binary!

Also, I ran into import errors during the ML course (specifically with matplotlib). I ended up solving this by installing python-qt4 inside the Docker container right off the bat.

So finally, here's what my Dockerfile looks like:

FROM ubuntu:14.04
MAINTAINER curtisz <software@curtisz.com>

# get stuff
RUN apt-get update -y && \
	apt-get install -y \
		curl \
		python-qt4 && \
	rm -rf /var/cache/apt/archive/*

# get more stuff in one layer so unionfs doesn't store the 400mb file in its layers
WORKDIR /tmp
RUN curl -o /tmp/Anaconda2-4.0.0-Linux-x86_64.sh http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh && \
	chmod +x ./Anaconda2-4.0.0-Linux-x86_64.sh && \
	./Anaconda2-4.0.0-Linux-x86_64.sh -b && \
	rm ./Anaconda2-4.0.0-Linux-x86_64.sh
# make the anaconda stuff available
ENV PATH=${PATH}:/root/anaconda2/bin

## anaconda
RUN conda create -n dato-env python=2.7 anaconda
# (use JSON format to force interpretation by /bin/bash)
RUN ["/bin/bash", "-c", ". activate dato-env"]
RUN conda update pip

## install graphlab create with creds provided in --build-arg in 'docker build' command:
ARG USER_EMAIL
ARG USER_KEY
RUN pip install --upgrade --no-cache-dir https://get.dato.com/GraphLab-Create/1.9/${USER_EMAIL}/${USER_KEY}/GraphLab-Create-License.tar.gz

## install ipython and ipython notebook
RUN conda install ipython-notebook

## upgrade GraphLab Create with GPU Acceleration
RUN pip install --upgrade --no-cache-dir http://static.dato.com/files/graphlab-create-gpu/graphlab-create-1.9.gpu.tar.gz

CMD jupyter notebook

I ended up having to update this Dockerfile when Dato released GraphLab Create version 1.9. It was as easy as changing "1.8.5" to "1.9" in the Dockerfile. Everything else was the same. Keep this in mind if you find that you need to install a newer version of GraphLab Create. Now, you'll build the image with this command, making sure to replace the email and license key with your own details:

docker build -t=graphlab --build-arg "USER_EMAIL=genius@example.edu" --build-arg "USER_KEY=ABCD-0123-EF45-6789-9876-54FE-3210-DCBA" .

The build will take a few minutes. It downloads a few hundred megabytes of stuff. When you're done with that, you can launch IPython with the following command:

nvidia-docker run -d --name=graphcreate -v "`pwd`/data:/data" --net=host graphlab:latest

Voila! You've got this whole operation running and you can access your IPython notebook by going to http://localhost:8888/ in your browser!

Please note: When it comes time to use GraphLab Create, you will be able to browse its UI normally, because we have specified --net=host in the docker run command, which shares the host's network stack with the container. The reason we do it this way is because GraphLab Create uses tcp/0 to set its server port. If you remember, that means the system chooses a random high port number, which prevents us from targeting a specific port with an EXPOSE Dockerfile directive (or -p port assignment in the docker run command). Exposing the host's network stack to the container could have some security implications if you run an untrusted application in that container. The applications we're using for this course are okay, it's just something you should be aware of.

Finally, I've been asked to include a tiny Docker crash course in case it's new to you. So our particular run command also mounts the ./data/ directory into the container at /data! This means you can download notebooks and datasets for the course and put them in that directory, and they'll be accessible in the container under the /data directory. For example, you would use sf = graphlab.SFrame('/data/people-example.csv') to load the sample data. In your terminal, you can use docker logs graphlab to see the container's logs, but don't forget you can swap out -d with -it in your docker run command if you want to create an interactive session for the container so that you can see the output in your terminal. You can also drop into a shell on the running container with docker exec -it graphlab /bin/bash and poke around if you need to. Killing the container happens with docker stop graphlab and deleting the container happens with docker rm graphlab. The Docker documentation is generally well-written, concise, and accurate. The source is also very approachable, as is the Docker community itself! Don't be afraid to drop by #docker on Freenode IRC if you need help!

For your convenience, I have created a Github repository with the Dockerfile and related scripts, as well as the sample starter data provided by Coursera.

Good luck!

Tactical Short Circuit

Come play my game.

Dockerized IPython and GraphLab Create for Machine Learning