Using Docker Swarm for Production

We have decided recently to run Docker Swarm in production as a very straight-forward orchestrator with consistent tooling and an easy learning curve. However, I have quickly come to realise that to get the best-practice deployments in place, there are certain things that cannot be done out-of-the box and which need additional tooling.

So one really important part of production devops is to be able to handle Canary deployments. These are simply the ability to deploy multiple (usually 2) versions of an application side-by-side and to send some proportion of these to the new version until you are comfortable that everything is working, at which point you increase the mix until 100% goes to the new version and you can eventually delete the old deployment.

Docker swarm doesn't handle this. You can only do rolling deployments, although you can provide a basic workaround where a service update can take X seconds between updating each service task so you could, for example do an update with 3600 seconds between task to get a similar function.

There are other issues though. Service names need to be unique so if I want to deploy a new version of a service, I can either use the workaround above and not really have control over the lifetime of the versions or I need to use unique names per service like service1_1_0_0 but this will then cause proxy issues with dns names being changed. There is also the issue of secrets, which independent from the app version, need to be attached correctly to the application when deployed. If we can't delete secrets that are in-use, we can either 1) accept downtime while we delete and re-add them or 2) have unique names for secrets that have similar problems to the service update.

We also have a large issue of how we tidy up old images/containers, which can quickly add up to many GB of storage space.

The problem with lots of this complexity is that even if we don't need all of this functionality from day 1, we need to make sure that how we will eventually use it is practical and actually works, otherwise we might have major headaches after a few weeks of production use, realising we need to change everything!

So for maximum flexibility, I am trying to deploy services with unique names, attached to secrets with unique names as well. This led to the question of how DNS will work. This is important both for an edge proxy letting traffic into the cluster and also for service-to-service communication where a big part of microservices is service discovery. This is easy if my service is always called service1 but if it might be service1_1_0_0 or service1_1_0_1 then DNS needs to update via a virtual IP.

(Makes you wonder why I don't just use Kubernetes then! I am tempted but there are other dragons there too)

Introducing Envoy Proxy

So the answer has to be in a data-plane. The data-plane manages how data moves around the network and I found a very impressive, fast and open source solution in Envoy, which was developed at Lyft. I think it is always best to have a product that 1) was developed by a company for whom it has to work perfectly because of their traffic levels and 2) is supported by a commercial company that won't just turn around and decide they don't want to maintain it any more (at least not yet!) which sometimes happens with libraries that are maintained by 1 or 2 people.

Envoy is written in C++ and runs on a number of platforms including Docker, so far, so good.

If you have used nginx or haproxy, this provides similar functionality but a whole lot more. Not only does it support edge proxy mode, but you can also install it as a sidecar in a container so that outgoing requests can be routed using the same logic that routes external traffic. If, for example, I have the dns name service1 pointing to service1_1_0_0 then the entire cluster gets to make use of that, even if that points to a different ip address internally than it does externally.

Envoy also provides rate limiting, loads of http functionality, circuit breakers, rate limits, connection limits etc. etc.

One of the most obvious problems is the fact that everything is driven from configuration and there currently no way to chop this config into separate files like you can in nginx, although you can dynamically look up certain sections of it, which means in a real system, your static config might be manageable. It does however mean that the yaml files become very seriously indented with the massively hierarchical configuration layout.

Definitely an example of making small steps to test it.

Basic Configuration

So what are the basics? I started with an edge proxy. Basically accept a request for a named domain and pass it to the relevant service. Since I am mixing environments, for example, webdev-service1 will point to the development version and webstaging-service1 will point to the staging. Production services use 3 tasks minimum whereas dev and staging only use 1 but Docker doesn't care and will work the same way. Initially, I have ignored https since this should be terminated by an upstream proxy in production.

The basics of the configuration are listeners, filters, routes and clusters. The listener is bound to an ip address and port. In most cases, you are likely to use a single listener but of course, at extremes, you might run out of connections on a single ip or port so you can choose to have multiple listeners. Also, the filter chains are different for each listener so you might want an additional listener for e.g. APIs that use a CORs filter at proxy level (easier than configuring it in hundreds of applications!)

(Basic example of static config here: https://www.envoyproxy.io/docs/envoy/latest/configuration/overview/examples)

HTTP work is handled by a filter, rather than in the config itself and this is because you can proxy more than just HTTP with Envoy. It supports things like mysql and mongodb out of the box and the semantics for each is different. 

The httpconnectionmanager filter, therefore is where you configure things related to http, https and routing. Some of these I have set from "best practices" that are described on the Envoy documentation site but the main part of this is the route_config.

The route config can contain multiple virtual hosts and each virtual host, like nginx and haproxy, can be matched to a specific domain and/or path selector. You can share TLS settings so if you are terminating TLS on your proxy, you might have service1.domain1.com and service2.domain1.com sharing a listener on separate virtual hosts linked to a *.domain1.com cert.

If you are not listening on a standard port, you need to ensure that the domains list includes the port e.g.:

domains: ["proxy_test:10000"]

Also note that some forms of wildcard are supported in the domain matching. You can also provide a matching pattern to the path of the request so that you might match the path /results separately than the path /query and direct them to separate services.

You then specify a route to point to a cluster (load balanced service) or otherwise you can do things like redirect or return something static.

Static example:

- name: proxy_test
              domains: ["proxy_test:10000"]
              routes:
              - match: { prefix: "/" }
                direct_response: { "status": 200, "body": {"inline_string": "Proxy OK"} }

The static example is really useful because if you are like me and still struggle sometimes to get your head around Docker networking, you want to test each step so that the number of variables is reduced for your problem solving.

Cluster example:

- name: webdev-files
              domains: ["webdev-files.example.io:10000"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: webdev_files }

The last thing to setup is the clusters.

A cluster can be more complicated than I have setup since it supports regions and localities, something that is likely to be very important for people like Lyft who have a global presence and who might prefer clusters located near to the proxy being accessed. In my case, I have 3 clusters all for the same service in dev, staging and production. These are all largely identical but point to different docker dns names and ports.

- name: webdev_files
    connect_timeout: 15s
    per_connection_buffer_limit_bytes: 32768 # 32 KiB
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: webdev_files
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: tasks.files-webdev-1_0_9.
                port_value: 80

The first important note is that these are currently hard-coded for my testing. Clearly, I cannot update my service to version 1.0.10 without breaking this config. Also, the port number is randomnly chosen by Docker, which is important because I might have multiple versions of a service and cannot use fixed ports for that reason!

Some of the limits are what I have taken from best-practice on the Envoy docs, you will need to tweak these to suit your own use-cases.

Addressing and DNS

The type is important. You can specify IP addresses statically but strict DNS is most useful for relatively small clusters like mine. In this example, it will use asynchronous DNS lookups for the address and use the result as the definitive list of valid servers. Logical dns is for very large clusters where you won't necessarily get all of the IPs returned from DNS each time so once it gets a value, it keeps it for existing connections and only cares about the dns response when creating new connections.

The policy also allows a number of different patterns for balancing, the default and most obvious is round robin which literally gives each request to the next service in the cluster. There are others that might work better for some workloads. For example, "weighted least request" allows you to choose the backend with the least current requests but also weighted, if necessary, for example to allow some beefy servers (or newer/older versions) to take more workload than others.

The hardest part for me getting this working was understanding how to specify the service address and port. When using docker swarm mode, all services get added to an ingress network by default and are given a dns name that matches the service name and also, importantly, a virtual ip (VIP) that is returned when you query the service name. The VIP is load balanced by Docker to all of the service implementations. It is also important to know that any service that has a port published gets the published port exposed on the node and it is this that maps the port to the internal container port. In other words, if you can access the container directly, you need to use the container port and NOT the mapped port.

Unfortunately, when you get this wrong, you are likely to get a number of weird networking errors from Docker like "connection refused", which should really say, "there is nothing running on this port".

Another important piece of information is that the special dns names tasks.servicename returns all individual ips for the containers rather than the single VIP. Why is this important? I want to use Envoy to load balance, there is no value to passing a call from Envoy into the Docker LB to do the same thing again. In other words, I want to point Envoy to each of the running containers and not to the ingress network of the nodes. Using the tasks.servicename pattern and the container port, I can then access the containers directly.

What you will need to do is ensure you have networks created and assigned to Envoy and the containers it is balancing so that it can route. The ingress is designed for incoming traffic, not service-to-service traffic. You can make these overlay networks internal as well, which means you cannot route to them externally, a nice security touch.

Deploying it to a cluster

So imagine you have a config setup, how do we actually use this? There are two main primitives in Docker Swarm: Stacks and Services. A service is really a single application/microservice that is replicated and can be added to networks etc. If you ever need to do much more than this, it is more likely you will need to use a Stack which is basically a docker-compose file that gets deployed with docker stack deploy.

We also need to work out how to apply our config to envoy. This second bit is easy. The dockerfile contents:

FROM envoyproxy/envoy-alpine:v1.18-latest AS base

COPY envoy.yaml /etc/envoy/envoy.yaml
RUN chmod go+r /etc/envoy/envoy.yaml

The base images have a number of flavours but Alpine is nice and small so if you are not adding a lot of your own stuff that might not work on Alpine, it is a good choice. The second and third lines copy in our local yaml file into the docker image and then changes the permissions to world readable since Envoy have correctly setup permissions in their image to run envoy as the envoy user. If you prefer, you can set the owner etc. of the file so it is readable.

In terms of the compose file, we are not doing too much. I am pointing to the image that we have built and pushed to our own Docker repository. We are exposing the default ports (9901 and 10000 to the same ports on the host). I am then attaching it to all 3 environment internal networks so it can proxy for all environments.

A key property is setting mode to global so it installs one per node and also setting the placement constraints to (since my managers don't run any containers):

placement:
        constraints:
        - node.platform.os == linux
        - node.role == worker

We then specify that the networks are external so that docker does not create them for our stack.

Once this is done, we can easily deploy this on our swarm with

docker stack deploy -c ./envoy-compose.yml --with-registry-auth envoy

In our case we need with registry auth to pass the registry credentials to the nodes so they can pull the private images. With stacks, you can update the compose file and run deploy again and it will update it. You can also docker stack rm stackname and then deploy it again if you need to.

Debugging

If you took my top tip earlier and added a static proxy test endpoint, you can test this to make sure the services are running and it is at least listening but remember you might need a hosts entry so you can use the domain name you chose and you need to use the port you assigned to the listener in the envoy config.

If you need to debug what is happening, the easiest way is to enable tracing for envoy, which is written to stdout by default. You can then use grep and the like to grab the logs.

To enable tracing, you can modify your docker compose file and add a command: entry which will overwrite the command that runs by default and sets --loglevel trace. You can do this with some code like that below (I got the command by running docker container ls on one of the worker nodes and looking at the command column for envoy):

["/docker-entrypoint.sh",  "envoy", "-c", "/etc/envoy/envoy.yaml", "--log-level", "trace"]

Once you have re-deployed the proxy and checked it definitely ran up OK with your proxy test, you can then point a test at a specific node and then grep its logs. For example, I kept getting errors that were reported in the browser with weird content like "docker network delayed connection error: 111" and "connection refused". These were caused by 1) using the ingress port number instead of the container one and 2) My config was trying to use http2 to connect to a server that only supported http1.1.

You can call a command like the following using your own container id:

 docker container logs fcd9dcdd9a26 2>&1 | grep connection

Note that if you try and access a site then immediately run this, you should see a timestamp around the connection error, which you then might want to use to re-grep so you can get the full flow of things:

 docker container logs fcd9dcdd9a26 2>&1 | grep "12:40:41"

If you then follow these logs, as well as lots of code tracing, you should see some specific errors. For example, the http2 one was very obvious, it said, "blah, blah, blah, maybe your service doesn't support http2". You will also see what ip/port the proxy is actually trying to connect to, which gave me a clue to the fact I was trying to connect to the ingress VIP instead of the container port.

e.g. [2021-04-22 12:56:12.458][20][debug][connection] [source/common/network/connection_impl.cc:860] [C1] connecting to 10.0.3.3:80

You should ideally setup the tracing to trace to a centralised server for debugging problems after something has gone wrong but I haven't one that yet.

Conclusion

This has been a lot of work so far but seems to be moving in the right direction. Although there are control-planes that can manage envoy for you, they seem to all be Kubernetes only, which is annoying. My next job is to work out how to read the config from a dynamic location, probably an API endpoint, and therefore how to update the API from our deployment system so that it knows when a new version has been deployed.