Update and rollback without downtime in swarm mode

We don’t want to stop our services when updating one of Docker images in a cluster if possible. If the new image fails to start we should rollback the change soon. Docker swarm offers the functionality. What we should remember is that an image has many dependencies.

Our application has its dependencies
The dependencies have their dependencies
SDK that complies the application
Application platform where the application runs
Operating system

Needless to say, it’s better to keep our dependencies and operating system up to date for security reason but the updating the service every month is more expensive than twice a year. However, if we do it the service gets much healthier. Let’s see how Docker updates the services.

You can find the complete source code here

This is one of Docker learning series posts. If you want to learn Docker deeply, I highly recommend Learn Docker in a month of lunches.

Create yml file for swarm from docker-compose file

We need to create a compose file for docker stack command because it doesn’t support multiple compose files. I created docker compose files separately because I don’t want to have the same code in different files. The compose files look like following.

# docker-compose.yml
version: "3.7"

x-labels: &app-net
    networks:
        - app-net

services: 
    test-app:
        image: test-app:v1
        <<: *app-net

    poke-app:
        image: poke-app:v1
        environment: 
            - TEST_APP_URL=http://test-app
        <<: *app-net

# docker-compose-pro.yml
version: "3.7"

services: 
    test-app:
        ports: 
            - target: 80
              mode: host
        deploy:
            mode: global

    poke-app:
        ports:
            - "8888:80"
        deploy:
            replicas: 6

networks:
    app-net:
        name: update-rollback-network

There are two options that I haven’t used in previous Docker related blog posts.

mode: host in ports section means that the specified port number is directly bind to the host machine which means ingress is not used for it.
mode: global in deploy section means that only one container runs on every node.

Using ingress means that additional work is added there because ingress needs to pass the request to the other containers. If we think it’s enough to have one container per node this settings may help to improve performance. I didn’t specify the published port because it causes port conflict when updating the container without downtime.

Let’s create a merged compose file from the two files by docker-compose config command. I create a bat file to create several compose files.

cd update-rollback
# create multiple stack yaml files.
# one of them is following command
# docker-compose -f docker-compose.yml -f docker-compose-pro.yml config > stack-v1.yml
./create-stack.bat

Remark
If one of docker compose files contains depends_on section, the contents will be empty when using docker-compose 1.27.4. It means that the output file becomes invalid for docker stack deploy command. This issue is related GitHub issue.

Update services with default setting

Let’s start the service.

$ docker stack deploy -c stack-v1.yml update-rollback
Creating network update-rollback-network
Creating service update-rollback_poke-app
Creating service update-rollback_test-app

$ docker service ls
ID                  NAME                       MODE                REPLICAS            IMAGE               PORTS
4j9uwgieebye        update-rollback_poke-app   replicated          6/6                 poke-app:v1         *:8888->80/tcp
a7absegp0osq        update-rollback_test-app   global              1/1                 test-app:v1

$ docker stack ps update-rollback
ID                  NAME                                                 IMAGE               NODE                DESIRED STATE       CURRENT STATE            ERROR                       PORTS
zq4vznowm7wz        update-rollback_test-app.zwfh3t5x51nmlu0vgnyzn2j9q   test-app:v1         docker-desktop      Running             Running 33 seconds ago                               *:32774->80/tcp
w6iiaucooh7h        update-rollback_poke-app.1                           poke-app:v1         docker-desktop      Running             Running 24 seconds ago                 
lzeyrrosukih         \_ update-rollback_poke-app.1                       poke-app:v1         docker-desktop      Shutdown            Failed 31 seconds ago    "task: non-zero exit (6)"
34pb08jluc5o        update-rollback_poke-app.2                           poke-app:v1         docker-desktop      Running             Running 27 seconds ago                 
svosyp7zvi8e         \_ update-rollback_poke-app.2                       poke-app:v1         docker-desktop      Shutdown            Failed 33 seconds ago    "task: non-zero exit (6)"

I deleted lines to keep the result small. 6 replica of poke-app started after test-app started up because poke-app has dependency check command. So it showed shutdown status of poke-app for the reason. The necessary services are running correctly now. Let’s update the service with default setting. The compose file to update the version is simple.

# docker-compose-v2-1.yml
version: "3.7"

services: 
    test-app:
        image: test-app:v2

$ docker stack deploy -c stack-v2-1.yml update-rollback
Updating service update-rollback_poke-app (id: 4j9uwgieebyej4cd9gdabcshp)
image poke-app:v1 could not be accessed on a registry to record
its digest. Each node will access poke-app:v1 independently,
possibly leading to different nodes running different
versions of the image.

Updating service update-rollback_test-app (id: a7absegp0osqbrhytb3npvtww)
image test-app:v2 could not be accessed on a registry to record
its digest. Each node will access test-app:v2 independently,
possibly leading to different nodes running different
versions of the image.

$ docker stack ps update-rollback
ID                  NAME                                                     IMAGE               NODE                DESIRED STATE       CURRENT STATE             ERROR                       PORTS
9h6madlgvy7u        update-rollback_test-app.zwfh3t5x51nmlu0vgnyzn2j9q       test-app:v2         docker-desktop      Running             Starting 1 second ago              
zq4vznowm7wz         \_ update-rollback_test-app.zwfh3t5x51nmlu0vgnyzn2j9q   test-app:v1         docker-desktop      Shutdown            Shutdown 1 second ago              
m5bf7wix4b08        update-rollback_poke-app.1                               poke-app:v1         docker-desktop      Running             Running 1 second ago               
w6iiaucooh7h         \_ update-rollback_poke-app.1                           poke-app:v1         docker-desktop      Shutdown            Failed 9 seconds ago      "task: non-zero exit (1)"
lzeyrrosukih         \_ update-rollback_poke-app.1                           poke-app:v1         docker-desktop      Shutdown            Failed 2 minutes ago      "task: non-zero exit (6)"

$  docker service ls
ID                  NAME                       MODE                REPLICAS            IMAGE               PORTS
4j9uwgieebye        update-rollback_poke-app   replicated          6/6                 poke-app:v1         *:8888->80/tcp
a7absegp0osq        update-rollback_test-app   global              1/1                 test-app:v2

The test-app was updated but there was service down during the update because poke-app tried to send a request to test-app but test-app was not ready then. Docker swarm shutdown the target container first and then starts the new container by default. It’s working now but we wanted to avoid the downtime. To avoid this downtime, we should start the new container first and then shutdown the old container.

Start new container first

I configured it in docker-compose-v2-2.yml.

version: "3.7"

services:
    test-app:
        image: test-app:v2
        deploy:
            update_config:
                order: start-first

Let’s update it from version 1 again.

# remove the current services
$ docker stack rm update-rollback
$ docker stack deploy -c stack-v1.yml update-rollback
# wait until 6 replicas start
$ docker service ls
ID                  NAME                       MODE                REPLICAS            IMAGE               PORTS
j4wf6v1cgbna        update-rollback_poke-app   replicated          6/6                 poke-app:v1         *:8888->80/tcp
o3q6ki34flqk        update-rollback_test-app   global              1/1                 test-app:v1

$ docker stack deploy -c stack-v2-2.yml update-rollback
$ docker stack ps update-rollback
ID                  NAME                                                     IMAGE               NODE                DESIRED STATE       CURRENT STATE            ERROR                       PORTS
lv2itvmlza4t        update-rollback_test-app.zwfh3t5x51nmlu0vgnyzn2j9q       test-app:v2         docker-desktop      Running             Running 1 second ago                                 *:32777->80/tcp
cx155jq6eo0q         \_ update-rollback_test-app.zwfh3t5x51nmlu0vgnyzn2j9q   test-app:v1         docker-desktop      Shutdown            Running 1 second ago               
yzfgenbepg5v        update-rollback_poke-app.1                               poke-app:v1         docker-desktop      Running             Running 43 seconds ago             
qhwb1m35ofwn         \_ update-rollback_poke-app.1                           poke-app:v1         docker-desktop      Shutdown            Failed 49 seconds ago    "task: non-zero exit (6)"
shebw2kig9rk        update-rollback_poke-app.2                               poke-app:v1         docker-desktop      Running             Running 39 seconds ago             
bfzpl2r8f2do         \_ update-rollback_poke-app.2                           poke-app:v1         docker-desktop      Shutdown            Failed 47 seconds ago    "task: non-zero exit (6)"
rz0d2xjy4wyl        update-rollback_poke-app.3                               poke-app:v1         docker-desktop      Running             Running 39 seconds ago

There is no additional shutdown status with exit code 1 this time because new container started before shutting down the old container. Good!

Update multiple containers step by step

Next, we will update poke-app. If we update all replicas at once downtime may happen because new container may not work as expected. It’s better to update it step by step. If it fails to update the container we should rollback the container. I configured it in docker-compose-v3-bad.yml. I configured HEALTH=BAD in order to make it fail.

version: "3.7"

services:
    test-app:
        image: test-app:v2

    poke-app:
        image: poke-app:v2
        environment:
            - HEALTH=BAD
        deploy:
            update_config:
                parallelism: 2
                monitor: 60s
                failure_action: rollback
                order: start-first

parallelism: number of replicas that are updated at once.
monitor: the monitoring time to treat as unhealthy state. It should be longer than the total amount of time of health check. If the status becomes unhealthy within this time it triggers the action defined in failure_action.
failure_action: this action is triggered if the container becomes unhealthy within the monitor time. Default action is pause. continue is another option but I think it’s risky.
order: what to do first. Stop the old container or start the new container. The default is stop-first

Let’s update with the configuration.

$ docker stack deploy -c stack-v3-bad.yml update-rollback
$ docker stack ps update-rollback
ID                  NAME                                                     IMAGE               NODE                DESIRED STATE       CURRENT STATE             ERROR                          PORTS
lv2itvmlza4t        update-rollback_test-app.zwfh3t5x51nmlu0vgnyzn2j9q       test-app:v2         docker-desktop      Running             Running 8 minutes ago                                    *:32777->80/tcp
cx155jq6eo0q         \_ update-rollback_test-app.zwfh3t5x51nmlu0vgnyzn2j9q   test-app:v1         docker-desktop      Shutdown            Shutdown 7 minutes ago             
re3j916rcz2b        update-rollback_poke-app.1                               poke-app:v2         docker-desktop      Shutdown            Shutdown 12 seconds ago            
yzfgenbepg5v         \_ update-rollback_poke-app.1                           poke-app:v1         docker-desktop      Running             Running 8 minutes ago              
qhwb1m35ofwn         \_ update-rollback_poke-app.1                           poke-app:v1         docker-desktop      Shutdown            Failed 8 minutes ago      "task: non-zero exit (6)"

It failed to update the container. poke-app version 2 status is shutdown and version 1 is still running. If we see the rollback state it looks like this below.

$ docker service inspect --pretty update-rollback_poke-app

ID:             nnd66g9ytq4eozu9zq7mi5tg0
Name:           update-rollback_poke-app
Labels:
 com.docker.stack.image=poke-app:v1
 com.docker.stack.namespace=update-rollback
Service Mode:   Replicated
 Replicas:      6
UpdateStatus:
 State:         rollback_completed
 Started:       2 minutes ago
 Message:       rollback completed

I deleted the output under Message. We could confirm that the rollback was done correctly. Let’s update it again with correct configuration with HEALTH=GOOD defined in docker-compose-v3-good.yml.

$ docker stack deploy -c stack-v3-good.yml update-rollback
$ docker service ls
ID                  NAME                       MODE                REPLICAS            IMAGE               PORTS
nnd66g9ytq4e        update-rollback_poke-app   replicated          8/6                 poke-app:v2         *:8888->80/tcp
xoieulklrhig        update-rollback_test-app   global              1/1                 test-app:v2

$ docker service ls
ID                  NAME                       MODE                REPLICAS            IMAGE               PORTS
nnd66g9ytq4e        update-rollback_poke-app   replicated          6/6                 poke-app:v2         *:8888->80/tcp
xoieulklrhig        update-rollback_test-app   global              1/1                 test-app:v2

$ docker stack ps update-rollback | grep Running
lv2itvmlza4t        update-rollback_test-app.zwfh3t5x51nmlu0vgnyzn2j9q       test-app:v2         docker-desktop      Running             Running 2 hours ago                                          *:32777->80/tcp
8wkk4cdptztt        update-rollback_poke-app.1                               poke-app:v2         docker-desktop      Running             Running 1 second ago               
z0jvq5v8a9yj         \_ update-rollback_poke-app.1                           poke-app:v1         docker-desktop      Shutdown            Running 1 second ago               
0fso1613w5or        update-rollback_poke-app.2                               poke-app:v2         docker-desktop      Running             Running 14 seconds ago             
694w4odop160         \_ update-rollback_poke-app.2                           poke-app:v1         docker-desktop      Shutdown            Running 14 seconds ago             
ej0eyzfd4xkr        update-rollback_poke-app.3                               poke-app:v2         docker-desktop      Running             Running 1 second ago               
z7lt5ozffjb3         \_ update-rollback_poke-app.3                           poke-app:v1         docker-desktop      Shutdown            Running 1 second ago               
qf6rnqjk12a1        update-rollback_poke-app.4                               poke-app:v2         docker-desktop      Running             Running 25 seconds ago             
vtkfwqhi11zr        update-rollback_poke-app.5                               poke-app:v2         docker-desktop      Running             Running 25 seconds ago             
qx3hi66v64y5        update-rollback_poke-app.6                               poke-app:v2         docker-desktop      Running             Running 14 seconds ago             
jh3v2f6v92f3         \_ update-rollback_poke-app.6                           poke-app:v1         docker-desktop      Shutdown            Running 14 seconds ago

The number of replicas is 8 at first because of start-first policy. However the number of replicas is 6 after a while and application version is v2. It succeeded to update the container version this time. The timing to update the containers is bit different because parallelism is 2. If we want to have more delay between update we can configure it by specifying delay option in update_config section.

Rollback the service by hand

Rollback can be done by hand as well in case the service doesn’t work as expected although the health check says it’s healthy.

$ docker service update --rollback update-rollback_poke-app
update-rollback_poke-app

Conclusion

This update mechanism makes our work easy especially when we have multiple servers. Docker takes care of many things. Key point here is to define the update config in order to avoid downtime. If a server has enough resource to start the new container start-first policy is good but if not, it may not be good. We should choose best way depending on our situation.