So you've been using Docker's Universal Control Plane (UCP) for some time, and it's going well. Of course it's going well, UCP is a great platform. It offers you a ton of features and a yuge amount of power -- and incidentally now supports Kubernetes -- but like the sudo command warns: With great power comes great responsibility. This being Planet Earth, it's just a matter of time before someone somewhere hangs the whole company with the rope they've been given. They think they're doing you a favor by cleaning out old crufty services. Nevermind that system services are prefixed with ucp-. Nevermind that every single node in your entire cluster runs a few containers with ucp-agent either in the container name or image name. Which, you know, would seem to be a fairly good indicator.

Let's start from there: It's 0230 and someone has issued docker service rm ucp-agent from a manager. Oops! And they're going off-shift because now it's your problem.

So, what are the symptoms? Nothing obvious, actually. Your monitoring is green across the board. There is no outage. As it turns out, because UCP is built on top of Docker Swarm, your production payload is perfectly fine. Services will continue to be deployed, even from the UCP web UI. Your monitoring won't freak out because traffic is still flowing from ingress across the overlay to your application backend service containers.

What's the problem, then? Well, UCP can't monitor itself on any of your nodes, so if you need to reconfigure any of the nodes in your cluster, or add workers or managers, or UCP otherwise needs to maintain itself, it can't. All else being okay in the world, the first problem you'll have is your certificates expiring because the ucp-agent service is unable to spawn ucp-reconcile to ask for new certs. And then you'll have your outage.

So it's not an emergency, but yeesh, how do we replace this system service? We can manually recreate it. The service creation will look something like this:

#!/bin/bash

# probably best to "echo" this before you try to run it on your cluster.

# UCP_VERSION is something like "2.1.4"
UCP_VERSION=$( docker ps -a | grep -oe 'ucp-controller:[0-9]\.[0-9]\.[0-9]' | tr ':' ' ' | awk '{print $2}' )

# UCP_INSTANCE_ID is something like "JGNX:AKX2:5ZG3:4SNA:MS5V:LZ74:5NSL:O6TO:UFIH:A35M:G6R3:XMFV"
UCP_INSTANCE_ID=$(  docker container run -it --rm --name ucp -v /var/run/docker.sock:/var/run/docker.sock docker/ucp:2.1.4 id | awk 'NR%2==0' )

# SWARM_PORT and CONTROLLER_PORT are self-explanatory
SWARM_PORT=$( docker inspect ucp-controller --format '{{ .Args }}' | grep -m1 -oe '--swarm-url [^-]\+' | awk '{print $2}' | tr ':' ' ' | awk '{print $3}' )
CONTROLLER_PORT=$( docker inspect ucp-controller --format '{{ .Args }}' | grep -A1 -m1 -oe '--controller-port [0-9]\+' | awk '{print $2}' )

# try to grab DNS options
DNS=$( docker inspect ucp-controller --format '{{ .HostConfig.Dns }}' | sed 's/\[//; s/\]//' )
DNS_OPT=$( docker inspect ucp-controller --format '{{ .HostConfig.DnsOptions }}' | sed 's/\[//; s/\]//' )
DNS_SEARCH=$( docker inspect ucp-controller --format '{{ .HostConfig.DnsSearch }}' | sed 's/\[//; s/\]//' )

# try to grab KV data
KV_TIMEOUT=$( docker inspect ucp-reconcile --format '{{ .Args }}' | grep -oe '"Expected":{[^}]\+' | grep -oe '"KVTimeout":[0-9]\+' | tr ':' ' ' | awk '{print $2}' )
KV_SNAPSHOT_COUNT=$( docker inspect ucp-reconcile --format '{{ .Args }}' | grep -oe '"Expected":{[^}]\+' | grep -oe '"KVSnapshotCount":[0-9]\+' | tr ':' ' ' | awk '{print $2}' )

docker service create \
        --constraint "node.platform.os==linux" \
        --env "IMAGE_VERSION=${UCP_VERSION}" \
        --env "UCP_INSTANCE_ID=${UCP_INSTANCE_ID}" \
        --env "SWARM_PORT=${SWARM_PORT}" \
        --env "SWARM_STRATEGY=spread" \
        --env "CONTROLLER_PORT=${CONTROLLER_PORT}" \
        --env "DNS=${DNS}" \
        --env "DNS_OPT=${DNS_OPT}" \
        --env "DNS_SEARCH=${DNS_SEARCH}" \
        --env "KV_TIMEOUT=${KV_TIMEOUT}" \
        --env "KV_SNAPSHOT_COUNT=${KV_SNAPSHOT_COUNT}" \
        --env "EXTERNAL_SERVICE_LB=" \
        --env "DEBUG=1" \
        --label "com.docker.ucp.InstanceID=${UCP_INSTANCE_ID}" \
        --label "com.docker.ucp.version=${UCP_VERSION}" \
        --mode global \
        --mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock \
        --mount type=bind,source=/etc/docker,destination=/etc/docker \
        --name ucp-agent \
        --restart-max-attempts 0 \
        --update-delay 2s \
        --update-failure-action pause \
        --update-max-failure-ratio 0 \
        --update-parallelism 1 \
        docker/ucp-agent:${UCP_VERSION} agent

Like it says above, you should run this through echo before you blindly run this on your cluster. Different environments use different versions of grep, and while grep -e is pretty consistent, you need to make sure that your UCP_VERSION, UCP_INSTANCE_ID, SWARM_PORT, CONTROLLER_PORT, DNS, DNS_OPT, DNS_SEARCH, KV_TIMEOUT, and KV_SNAPSHOT_COUNT values look sane. Unless you've configured special DNS settings, the DNS* options should be blank. The KV* options should be 2000 and 20000 by default respectively.

This will manually recreate the ucp-agent service on your cluster. Depending on how long this service was gone, you might see ucp-reconcile containers start to kick off, and your other UCP system component containers restarted.