So you've been using Docker's Universal Control Plane (UCP) for some time, and it's going well. Of course it's going well, UCP is a great platform. It offers you a ton of features and a yuge amount of power. And incidentally, UCP also now supports Kubernetes.

Before I continue, I'd like to include something I said about Kubernetes...

Now, Kubernetes is like other container orchestration platforms, except that it has many layers of abstraction built into it with the idea that these abstractions allow you more headroom to scale. The thing is that these abstractions make Kubernetes far, far more complex than the vast majority of organizations need to leverage. Bear with me for a minute... Kubernetes was built by Google to orchestrate its services. But your company is not Google and there are many more important problems on the road between your company and Google. You do what whatever you want, but I don't spend time solving problems I don't have. And we haven't even touched on the problems you're creating by introducing so many layers of abstraction into your environment. At some point, many of the folks using Kubernetes will hit its complexity like a brick wall. They may not be able to extricate themselves because of sunk cost. Fortunately for them, UCP is in a position to serve as a lifeline for these people to automate away much of the complexity around Kubernetes and bring their environment and orchestration back under their control again. And of course UCP is excellent tooling by itself for orchestrating workloads being run on top of many nodes across a Docker Swarm.

Maybe you'll get something useful out of that. Anyway, talking about things we don't control... There are still times when circumstances are completely out of your control and chaos happens. For example, someone does you a favor by going through your UCP cluster and cleaning out old crufty services on a Friday afternoon. It's going well but suddenly, with the weekend on their mind, a catastrophic brainfart happens and they hose the ucp-agent service from your cluster. Nevermind that system services are prefixed with ucp-. Nevermind that every single node in your entire cluster runs a few containers with ucp-agent either in the container name or image name. What is this I don't even...

Let's start from there: It's 1630 and someone has just issued docker service rm ucp-agent from a manager node. Whoops!

So, what are the symptoms? Nothing obvious, actually. Your monitoring is green across the board. There is no outage. As it turns out, because UCP is built on top of Docker Swarm, your production payload is perfectly fine. Services will continue to be deployed even from the UCP web UI. Your monitoring won't freak out because traffic is still flowing from ingress across the overlay to your application backend service containers.

So what's the problem? Well, UCP can't monitor itself on any of your nodes, so if you need to reconfigure any of the nodes in your cluster (or add workers or managers) or UCP otherwise needs to maintain itself, it can't do that. All else being okay in the world, the first problem you'll have is your certificates expiring because the ucp-agent service is unable to spawn the ucp-reconcile process to ask for new certs. And then you'll have your outage.

So it's not an emergency, but yeesh, how do we replace this system service? We can manually recreate it. The service creation will look something like this:

#!/bin/bash

# probably best to "echo" this before you try to run it on your cluster.

# UCP_VERSION is something like "2.1.4"
UCP_VERSION=$( docker ps -a | grep -oe 'ucp-controller:[0-9]\.[0-9]\.[0-9]' | tr ':' ' ' | awk '{print $2}' )

# UCP_INSTANCE_ID is something like "JGNX:AKX2:5ZG3:4SNA:MS5V:LZ74:5NSL:O6TO:UFIH:A35M:G6R3:XMFV"
UCP_INSTANCE_ID=$(  docker container run -it --rm --name ucp -v /var/run/docker.sock:/var/run/docker.sock docker/ucp:2.1.4 id | awk 'NR%2==0' )

# SWARM_PORT and CONTROLLER_PORT are self-explanatory
SWARM_PORT=$( docker inspect ucp-controller --format '{{ .Args }}' | grep -m1 -oe '--swarm-url [^-]\+' | awk '{print $2}' | tr ':' ' ' | awk '{print $3}' )
CONTROLLER_PORT=$( docker inspect ucp-controller --format '{{ .Args }}' | grep -A1 -m1 -oe '--controller-port [0-9]\+' | awk '{print $2}' )

# try to grab DNS options
DNS=$( docker inspect ucp-controller --format '{{ .HostConfig.Dns }}' | sed 's/\[//; s/\]//' )
DNS_OPT=$( docker inspect ucp-controller --format '{{ .HostConfig.DnsOptions }}' | sed 's/\[//; s/\]//' )
DNS_SEARCH=$( docker inspect ucp-controller --format '{{ .HostConfig.DnsSearch }}' | sed 's/\[//; s/\]//' )

# try to grab KV data
KV_TIMEOUT=$( docker inspect ucp-reconcile --format '{{ .Args }}' | grep -oe '"Expected":{[^}]\+' | grep -oe '"KVTimeout":[0-9]\+' | tr ':' ' ' | awk '{print $2}' )
KV_SNAPSHOT_COUNT=$( docker inspect ucp-reconcile --format '{{ .Args }}' | grep -oe '"Expected":{[^}]\+' | grep -oe '"KVSnapshotCount":[0-9]\+' | tr ':' ' ' | awk '{print $2}' )

docker service create \
        --constraint "node.platform.os==linux" \
        --env "IMAGE_VERSION=${UCP_VERSION}" \
        --env "UCP_INSTANCE_ID=${UCP_INSTANCE_ID}" \
        --env "SWARM_PORT=${SWARM_PORT}" \
        --env "SWARM_STRATEGY=spread" \
        --env "CONTROLLER_PORT=${CONTROLLER_PORT}" \
        --env "DNS=${DNS}" \
        --env "DNS_OPT=${DNS_OPT}" \
        --env "DNS_SEARCH=${DNS_SEARCH}" \
        --env "KV_TIMEOUT=${KV_TIMEOUT}" \
        --env "KV_SNAPSHOT_COUNT=${KV_SNAPSHOT_COUNT}" \
        --env "EXTERNAL_SERVICE_LB=" \
        --env "DEBUG=1" \
        --label "com.docker.ucp.InstanceID=${UCP_INSTANCE_ID}" \
        --label "com.docker.ucp.version=${UCP_VERSION}" \
        --mode global \
        --mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock \
        --mount type=bind,source=/etc/docker,destination=/etc/docker \
        --name ucp-agent \
        --restart-max-attempts 0 \
        --update-delay 2s \
        --update-failure-action pause \
        --update-max-failure-ratio 0 \
        --update-parallelism 1 \
        docker/ucp-agent:${UCP_VERSION} agent

Like it says above, you should run this through echo before you blindly issue this on your cluster. Different environments use different versions of grep, and while grep -e is pretty consistent, you need to make sure that your UCP_VERSION, UCP_INSTANCE_ID, SWARM_PORT, CONTROLLER_PORT, DNS, DNS_OPT, DNS_SEARCH, KV_TIMEOUT, and KV_SNAPSHOT_COUNT values look sane. Unless you've configured special DNS settings, the DNS* options should be blank. The KV* options should be 2000 and 20000 by default respectively.

This will manually recreate the ucp-agent service on your cluster. Depending on how long this service was gone, you might see ucp-reconcile containers start to kick off and your other UCP system component containers restarted.