Design production microservices environment

One of the problems with many training courses is that they start simple and become more and more complicated. By the time you finish, you might well end up much more complicated that you actually want and if you knew that, you would have made different decisions earlier on.

With microservices, the potential size and cost is a significant risk in production so how do you choose the correct architecture to avoid surprises?

If you choose managed container services (Azure, Google, AWS etc.) how will their price scale? Will they be cheaper with a smaller system or do they become cheaper at scale?
What is a realistic size of your production cluster on a busy day now and in a year? Can your chosen architecture scale invisibly?
How much can I demonstrate now as a Proof of Concept, knowing that it will scale?

It is the last part I want to look at. It is very easy to create a Docker swarm or even a managed kubernetes (K8S) cluster (Docker desktop can even create you a Kubernetes cluster); it is very easy to deploy a web application calling a web service via a Docker or K8S overlay network using the built-in DNS, it all works really well and makes you think that we are ready for a deployment architecture to go-live only to be stung later on with a difficult or expensive problem.

There are two very important aspects which you should consider and build-out before you commit to your chosen architecture. Although you can change certain things quite easily later on, if you do not work out how these parts will work in-advance, you might end up having to build a completely separate second cluster which you then have to swap to at a later date or even to migrate to a completely different framework e.g. Docker swarm to Kubernetes which isn't as easy as it sounds.

Service Mesh

Both Docker and K8S contain overlay networks and it is relatively easy to call another service by its name and get routed automatically. However, a very common requirement for deployments is the "Canary" deployment, where you deploy a small number of new services to run alongside the old versions, transparently moving users across to the newer services over a period of time. The new ones act as a Canary in the coal mine to point you to any problems before they affect everybody, you could then back out the new services transparently. How would this work with routing? If you are using Kubernetes, this functionality is built-in, the service can target multiple pods with the same name but different versions. However, in Docker swarm, this is not possible without adding something external in the form of some kind of front-proxy. But if you have a proxy that load balances across services then 1) How does internal traffic also hit the load balanced endpoint and 2) How do you update that proxy when you make a canary deployment so that the load balancer always targets the correct versions?

Even if you have Kubernetes with canary deployments, you still don't have built-in rate-limiting, authorisation, circuit-breakers, TLS termination etc. without adding various other services.

A service mesh can give you all of this.

There are a number of service meshes available, istio and linkerd are the two most famous. What you should know is that istio has a lot more features than linkerd but is thought to be much more resource intensive. If you are Google would you care about a few more VMs? No. But most of us don't want a 10%+ overhead on our system just to get the service mesh.

The idea is that (in these two cases) you add a "sidecar" proxy to your containers, which ensures all of the network traffic goes via the service mesh, which ensures both correct routing of load-balanced traffic and also the added value of things like monitoring/TLS encryption which can all be handled transparently to the application.

I won't go into details of how to install and configure them but needless to say, they install as containers or services depending on the target framework and the important point is that you need to understand how these will work in your production environment. It is no use creating an all-singing, all-dancing service mesh if you have to manually run scripts to make any changes - you need to be able to plug it into your devops automation and the easier the better. If you spend a load of time building e.g. a Docker swarm network and then find out that setting up a service mesh is really difficult, you might have to start again. If you build it all in from the beginning and you know how it works, you can derisk your deployment architecture.

Monitoring and Logging

I was trying to setup Envoy proxy with Docker swarm and it was a pig. I really was trying to us ebaby steps! I got the basic proxy working with a static response. I then got some statically configured clusters working, but even this took time. I eventually tried to build my own webservice control plane for Envoy using grpc on dotnet core. Although, the web service seemed really easy to create, whenever I tried to wire it up to Envoy, it didn't work, just got a really generic error message.

What I realised (again!) was how important it is to 1) Make sure you are logging useful information in your own application code and 2) Make sure you know how to read logs from your services, especially when setting them up and not knowing what you are doing.

I initially started on a swarm trying to change things, deploy them, see a container shutdown, show the container logs, try and think of a fix, change and redeploy. Of course, if the problem was in the dockerfile for the image itself, I then had to change it locally, push it through the CI/CD mechanism and try again!

The first thing was to go back to a local compose file and try and build it back up from scratch. I worked out how to enable logging on the dotnet service and finally saw the problem, the call from Envoy sent scheme: http, which was not valid on a TLS connection over HTTP2. It turns out Envoy had a bug! I rolled back to an earlier version and tried the same thing, I could see the webservice being called but the returned cluster was not applied to Envoy and Envoy was still logging problems with the grpc call without enough detail.

If I had done the logging correctly in the first-place, not only would have saved a lot of time but also would have realised that rolling your own control-plane to use with Envoy on a docker swarm is far too risky and lacks the helpful logging etc. that might tell you what is wrong.

The same will happen at runtime. Have you ever seen random slowdowns or dropped connections and not known why? Monitoring is about what is happening now. What is the CPU/network etc. doing right now? If there are lots of users causing lots of CPU usage then that explains the slowdown. If not, perhaps it is database connections, leaked threads or memory. Monitoring can be done with prometheus and grafana relatively easily but, again, you need to invest time and get those things setup from day one. No-one wants to debug a problem you get before you enable monitoring!

Logs are kept in the background and usually sent to something like Elasticsearch or one of the SaaS logging products. Logs are for going back into a problem service at a later date to find out why Customer X got an internal server error when saving a document. Good monitoring requires a request ID to trace a single request which should correlate with the error log and/or the error screen you show to your customer. The richer you make your logs the better, especially for random errors. Maybe they put a weird character into their filename or tried to save two files at the same time.

Logs are very large over time, so in most cases you should delete logs older than, perhaps, 2 months or even 1 month when you are first starting out. Cloud storage is great for this if possible, being very cheap and in some cases, gives you the ability to delete very old data automatically.

Conclusion

Before you commit to an architecture, make sure you have worked out how to deploy, how to do canary deployments, how to configure your service mesh and how to handle monitoring and logging.

Once you get those correct, the rest should be relatively easy!