When IT strikes: Recovering a deleted ucp-agent system service in Docker's Universal Control Plane

So you've been using Docker's Universal Control Plane (UCP) for some time, and it's going well. Of course it's going well, UCP is a great platform. It offers you a ton of features and a yuge amount of power. And incidentally, UCP also now supports Kubernetes.

Before I continue, I'd like to include something I said about Kubernetes...

Now, Kubernetes is like other container orchestration platforms, except that it has many layers of abstraction built into it with the idea that these abstractions allow you more headroom to scale. The thing is that these abstractions make Kubernetes far, far more complex than the vast majority of organizations need to leverage. Bear with me for a minute... Kubernetes was built by Google to orchestrate its services. But your company is not Google and there are many more important problems on the road between your company and Google. You do what whatever you want, but I don't spend time solving problems I don't have. And we haven't even touched on the problems you're creating by introducing so many layers of abstraction into your environment. At some point, many of the folks using Kubernetes will hit its complexity like a brick wall. They may not be able to extricate themselves because of sunk cost. Fortunately for them, UCP is in a position to serve as a lifeline for these people to automate away much of the complexity around Kubernetes and bring their environment and orchestration back under their control again. And of course UCP is excellent tooling by itself for orchestrating workloads being run on top of many nodes across a Docker Swarm.

Maybe you'll get something useful out of that. Anyway, talking about things we don't control... There are still times when circumstances are completely out of your control and chaos happens. For example, someone does you a favor by going through your UCP cluster and cleaning out old crufty services on a Friday afternoon. It's going well but suddenly, with the weekend on their mind, a catastrophic brainfart happens and they hose the ucp-agent service from your cluster. Nevermind that system services are prefixed with ucp-. Nevermind that every single node in your entire cluster runs a few containers with ucp-agent either in the container name or image name. What is this I don't even...

Let's start from there: It's 1630 and someone has just issued docker service rm ucp-agent from a manager node. Whoops!

So, what are the symptoms? Nothing obvious, actually. Your monitoring is green across the board. There is no outage. As it turns out, because UCP is built on top of Docker Swarm, your production payload is perfectly fine. Services will continue to be deployed even from the UCP web UI. Your monitoring won't freak out because traffic is still flowing from ingress across the overlay to your application backend service containers.

So what's the problem? Well, UCP can't monitor itself on any of your nodes, so if you need to reconfigure any of the nodes in your cluster (or add workers or managers) or UCP otherwise needs to maintain itself, it can't do that. All else being okay in the world, the first problem you'll have is your certificates expiring because the ucp-agent service is unable to spawn the ucp-reconcile process to ask for new certs. And then you'll have your outage.

So it's not an emergency, but yeesh, how do we replace this system service? We can manually recreate it. The service creation will look something like this:


# probably best to "echo" this before you try to run it on your cluster.

# UCP_VERSION is something like "2.1.4"
UCP_VERSION=$( docker ps -a | grep -oe 'ucp-controller:[0-9]\.[0-9]\.[0-9]' | tr ':' ' ' | awk '{print $2}' )

# UCP_INSTANCE_ID is something like "JGNX:AKX2:5ZG3:4SNA:MS5V:LZ74:5NSL:O6TO:UFIH:A35M:G6R3:XMFV"
UCP_INSTANCE_ID=$(  docker container run -it --rm --name ucp -v /var/run/docker.sock:/var/run/docker.sock docker/ucp:2.1.4 id | awk 'NR%2==0' )

# SWARM_PORT and CONTROLLER_PORT are self-explanatory
SWARM_PORT=$( docker inspect ucp-controller --format '{{ .Args }}' | grep -m1 -oe '--swarm-url [^-]\+' | awk '{print $2}' | tr ':' ' ' | awk '{print $3}' )
CONTROLLER_PORT=$( docker inspect ucp-controller --format '{{ .Args }}' | grep -A1 -m1 -oe '--controller-port [0-9]\+' | awk '{print $2}' )

# try to grab DNS options
DNS=$( docker inspect ucp-controller --format '{{ .HostConfig.Dns }}' | sed 's/\[//; s/\]//' )
DNS_OPT=$( docker inspect ucp-controller --format '{{ .HostConfig.DnsOptions }}' | sed 's/\[//; s/\]//' )
DNS_SEARCH=$( docker inspect ucp-controller --format '{{ .HostConfig.DnsSearch }}' | sed 's/\[//; s/\]//' )

# try to grab KV data
KV_TIMEOUT=$( docker inspect ucp-reconcile --format '{{ .Args }}' | grep -oe '"Expected":{[^}]\+' | grep -oe '"KVTimeout":[0-9]\+' | tr ':' ' ' | awk '{print $2}' )
KV_SNAPSHOT_COUNT=$( docker inspect ucp-reconcile --format '{{ .Args }}' | grep -oe '"Expected":{[^}]\+' | grep -oe '"KVSnapshotCount":[0-9]\+' | tr ':' ' ' | awk '{print $2}' )

docker service create \
        --constraint "node.platform.os==linux" \
        --env "IMAGE_VERSION=${UCP_VERSION}" \
        --env "UCP_INSTANCE_ID=${UCP_INSTANCE_ID}" \
        --env "SWARM_PORT=${SWARM_PORT}" \
        --env "SWARM_STRATEGY=spread" \
        --env "DNS=${DNS}" \
        --env "DNS_OPT=${DNS_OPT}" \
        --env "DNS_SEARCH=${DNS_SEARCH}" \
        --env "KV_TIMEOUT=${KV_TIMEOUT}" \
        --env "EXTERNAL_SERVICE_LB=" \
        --env "DEBUG=1" \
        --label "com.docker.ucp.InstanceID=${UCP_INSTANCE_ID}" \
        --label "com.docker.ucp.version=${UCP_VERSION}" \
        --mode global \
        --mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock \
        --mount type=bind,source=/etc/docker,destination=/etc/docker \
        --name ucp-agent \
        --restart-max-attempts 0 \
        --update-delay 2s \
        --update-failure-action pause \
        --update-max-failure-ratio 0 \
        --update-parallelism 1 \
        docker/ucp-agent:${UCP_VERSION} agent

Like it says above, you should run this through echo before you blindly issue this on your cluster. Different environments use different versions of grep, and while grep -e is pretty consistent, you need to make sure that your UCP_VERSION, UCP_INSTANCE_ID, SWARM_PORT, CONTROLLER_PORT, DNS, DNS_OPT, DNS_SEARCH, KV_TIMEOUT, and KV_SNAPSHOT_COUNT values look sane. Unless you've configured special DNS settings, the DNS* options should be blank. The KV* options should be 2000 and 20000 by default respectively.

This will manually recreate the ucp-agent service on your cluster. Depending on how long this service was gone, you might see ucp-reconcile containers start to kick off and your other UCP system component containers restarted.

Configuring Trackpoint on the Lenovo Thinkpad

This is more or less just a post for myself. I always end up dumping a couple of hours into this problem whenever I get a new machine for work -- surprise! I work for Docker now -- and tonight especially I really could have used this post instead of wasting that time researching the problem all over again. I choose Thinkpads when I have a choice, because the popular alternative is stupid.

Anyway, I use Linux Mint with Cinnamon, and LM18 is the current version. It's based on Ubuntu 16.04. I've chosen a P50 and upgraded the RAM to 64GB. Everything works out of the box, including the weird dual graphics situation going on under the hood. However, I want a super sensitive Trackpoint. The sensitivity settings are under something like /sys/devices/platform/i8042/serio1/serio2/ in the sensitivity, speed and inertia files. I like to keep mine at about 255, 230, and 4, respectively. The value of 255 is maximal, btw.

Now, simply dumping my preferences into those files works for the current session. Meaning, when I reboot the machine, they are reset to their defaults. So I'm using systemd to write values into these files on boot. I've got a /etc/tmpfiles.d/tpoint.conf file with the following contents:

w /sys/devices/platform/i8042/serio1/serio2/speed - - - - 230
w /sys/devices/platform/i8042/serio1/serio2/sensitivity - - - - 255
w /sys/devices/platform/i8042/serio1/serio2/inertia - - - - 4

Now, this works great. However, after I resume (or thaw) from a suspend (or hibernate), the location of these config files changes. So what I've done is added a simple shell script to /etc/pm/sleep.d/trackpoint-fix which contains the following:

# set sensitivity/speed of trackpoint on resume
case "${1}" in
		# suspending to RAM
		sleep 0
		# resume from suspend 
		newdir=$(find /sys/devices/platform/ | grep sensitivity | sed -e "s/sensitivity//")
		echo 230 | sudo tee > ${newdir}speed
		echo 255 | sudo tee > ${newdir}sensitivity
		echo 4 | sudo tee > ${newdir}inertia

This is pretty self-explanatory. I dig around for the new location of these configuration files and then dump my favorite values into them. That's all there is to it.

Dockerized IPython / Anaconda for Machine Learning

Hey! You might have seen my recent post about having Dockerized some software called GraphLab Create (together with IPython) for a machine learning course I was taking. As it happens, I've found that image so useful for other, generic ML work that I've pared it down to its IPython/Anaconda bundle only. So I'd like to introduce the super-simple but super-useful Dockerized IPython / Anaconda project!

This repo includes a couple of useful scripts: One for building the image (build.sh) and one for running the resultant image as a container (run.sh). Just run the build script and then the run script (and optionally provide a directory to mount into the container for data files) and you're all set! Note: Either your specified directory or current working directory will be mounted to /data as a volume into the container! Also, your IPython Notebook may include import statements which reference functions inside files in your new /data volume directory. This means you will need to change any path references to include /data, and specifically add the /data directory to your import path by adding this to the top of your IPython Notebook:

import sys
sys.path.insert(0, '/data')

## this relative dir won't work:
# data_dir = 'foo/dataset.1'
## so we just add /data to the front:
data_dir = '/data/foo/dataset.1' 

Finally, I try to be readable, but take a look at my earlier post linked above if you want a breakdown of what's going on in there. Have fun!

Let's Encrypt: Nginx-Proxy Docker Companion

I've been using the fairly popular nginx-proxy reverse proxy for Docker containers, created by Jason Wilder. It's a slick, super-simple method of putting many containers on a single host that all need to share HTTP/HTTPS ports. I am also a huge fan of the Let's Encrypt project. Free SSL certificates as long as you can prove that you are the domain's operator. This is really how it should work: In my book, forcing people to pay for SSL certificates is shitty and exploitative, especially considering how incredibly important encryption is these days. And incidentally, the behavior around self-signed certificates in browsers is stupid and broken.

So anyways, I played with the Let's Encrypt stuff late last year when they went into public beta. Best Christmas present ever! So the idea is pretty simple: You tell your Let's Encrypt client (probably best to use certbot) that you want a certificate, and it talks to the Let's Encrypt server, requesting a certificate from the CA. That causes Let's Encrypt to make a curl call to your domain, requesting a specific resource that certbot creates. That resource looks something like http://www.example.com/.well-known/acme-challenge/g89SrgM4UAJGHiukm3GqQ3xMjTnpN-kZDYb27u4aTRW. That resource is just a regular file on disk. It has contents that look something like DTe7mGGhLlML7Vlh4dyNTu97OiIrIIs7xd5O0Fpmlq8.TaRs2K47il2D0K9RjmOKOx7Neuu91FdEpLp2Wo4FcNI. As long as that resource matches what Let's Encrypt is looking for, you get a free SSL certificate! And they've built some magic into certbot so that it automagically installs that cert into your webserver if it's a common one (e.g. Apache or Nginx or something).

I wanted to use it on on my Dockerized web frontends, which use nginx-proxy. I had spotted a couple of issues on nginx-proxy's Github page which mentioned Let's Encrypt, but I hadn't yet tried to get this working with my nginx-proxy container. Getting that to work does not seem like a trivial task. Not having the spare minutes to get Let's Encrypt working in my infrastructure, I put it on the back burner and made a mental note to check in every once in a while. Well, I completely forgot about it until I got an email recently reminding me to renew one of my SSL certs. And I'm not paying $8.99 for something that should be free, so I knew it was time to check back in with Let's Encrypt being incorporated into the nginx-proxy project.

Enter nginx-proxy-letsencrypt-companion. This is a docker container that sits coupled to your nginx-proxy container, sharing its volumes and paying attention to containers spinning up that have LETSENCRYPT_HOST and LETSENCRYPT_EMAIL environment variables set. The idea is that you start your nginx-proxy container, then start up this nginx-proxy-letsencrypt-companion container, and then start up your other containers that need Let's Encrypt certificates. The companion will request new Let's Encrypt certificates for containers that do not have current certificates and which also have those LETSENCRYPT_* environment variables set.

So here are my notes for getting this going. I ended up adding /usr/share/nginx/html as a data volume in my nginx-proxy container, and making a couple of the volumes rw instead of ro. Thus, my nginx-proxy run command looks something like this:

docker run -d \
    --name="nginx-proxy" \
    --restart="always" \
    -p 80:80 \
    -p 443:443 \
    -v "/var/docker/nginx-proxy/htpasswd:/etc/nginx/htpasswd" \
    -v "/var/docker/nginx-proxy/vhost.d:/etc/nginx/vhost.d" \
    -v "/var/docker/nginx-proxy/certs:/etc/nginx/certs" \
    -v "/var/run/docker.sock:/tmp/docker.sock" \
    -v "/usr/share/nginx/html" \

And the command for the brand new nginx-proxy-letsencrypt-companion container looks like this:

docker run -d \
    --name="nginx-proxy-letsencrypt-companion" \
    --restart="always" \
    -v "/var/run/docker.sock:/var/run/docker.sock:ro" \
    --volumes-from "nginx-proxy" \

And finally, your individual containers will follow this pattern (note the environment variables mentioned above):

docker run -d \
    --name="example.com" \
    --restart="always" \
    -e "VIRTUAL_HOST=example.com,www.example.com" \
    -e "VIRTUAL_PORT=2368" \
    -e "LETSENCRYPT_HOST=example.com,www.example.com" \
    -e "LETSENCRYPT_EMAIL=contact@example.com" \
    -v /var/docker/example.com/ghost:/var/lib/ghost \

And there you have it! Once you get your containers up -- in that order: nginx-proxy, nginx-proxy-letsencrypt-companion, and your web container -- give the Let's Encrypt a minute to phone home and get a call back, and you'll have a free SSL certificate and related miscellany there in /var/docker/nginx-proxy/certs on your host. Also, the reason I created a data volume in the nginx-proxy container at /usr/share/nginx/html instead of a bind-mount volume. I do this because the Let's Encrypt client in the companion container will continually update the authorization data, and I don't want to have to worry later about cleaning up a huge, sprawling directory full of those files, some of which would contain valid information, and many others containing invalid information. Of course, you do whatever you want, as long as its right for you. Good luck!

Private Docker v2 Registry Upgrade Notes

Recently, I was offering some help on Stack Overflow to someone asking about deleting images from a private Docker registry. Here I mean a v2 registry, which is part of the Docker Distribution project. I should advise anyone reading this that no one ever refers to a v1 registry anymore: That project is dead, even though it occupies the registry:latest image tag on Dockerhub (you want to pull registry:2 at least). The v1 registry is an old Python project, and v2 is written in Go. Anyways, the v2 registry has not had delete capabilities (via the API) since its inception. This was my initial assumption, but I took the opportunity to research the latest information.

As it turns out, the latest versions of registry (later than v2.4 I think) do have delete functionality. While the main Docker project ("docker-engine") has excellent documentation, the Distribution project has previously not been. It's not bad, but it's not great, either. The API documentation is not very clear on using the new delete API functionality. But it's there, along with an interesting garbage collection mechanism. That's a topic for another day, but it's the reason I wanted to upgrade to version 2.4 of the registry. I was using version 2.1 or something.

Cue an upgrade, and a couple of problems. First, in the config.yml file, in the cache section under the storage section, the layerinfo setting has been deprecated and renamed to blobdescriptor. That isn't a blocking change yet, but it will be soon, so rename it now while you have the chance.

Finally, if you're backing your registry with S3 like a sane human being, the permissions have changed and the change is not documented anywhere. Zing! I couldn't push when I fired up my new registry container. I kept getting "Retrying in X seconds" messages when pushing individual layers. I killed and deleted the container, started up a new one with the level setting under the log section set to debug in my config.yml file. This yielded the key to the issue (notice the "s3aws: AccessDenied" message and 403 status code):

"err.code":"unknown","err.detail":"s3aws: AccessDenied: Access Denied\n\tstatus code: 403, request id: 11E0123C033B0DB5","err.message":"unknown error"

Here's what the new S3 policy needs to look like (beware copying from the documentation linked above: There is an errant comma in the documented policy):

 "Statement": [
        "Effect": "Allow",
        "Action": [
        "Resource": "arn:aws:s3:::mybucket"
        "Effect": "Allow",
        "Action": [
        "Resource": "arn:aws:s3:::mybucket/*"

Just for the record, this new policy adds support for the s3.GetBucketLocation, s3:ListBucketMultipartUploads Actions on your particular bucket, and the s3:ListMultipartUploadParts and s3:AbortMultipartUpload Actions on your bucket contents.