How to detect drained Kubernetes nodes manually

Load balancer updating

If you have some kind of all-singing, all-dancing Kubernetes monitoring tool or are using the cloud, the chances are that you are not that bothered about updating nodes. You tell your magic tool to update the node and it either tells the load balancer to remove the node from the backend set and then updates it or even just runs up a newer version of the Linux container that the node is based on and switched over.

I am not doing this.

We have a relatively manual installation of Kubernetes with RKE as the engine (Rancher if you don't know). This works OK except for one main thing, the load balancer is not controller by Rancher and has no API for us to automate taking nodes out of the load-balanced set when restarting nodes for updates.

The Load Balancer does have a healthcheck function which is how it decides whether the node (or whatever it is) is online and although this has some functions, it is really quite basic. I assumed, however, that this must be really simple to solve since surely there is something I can call on each node to find out whether the node is drained (or schedulable or something similar) but, apparently not!

What doesn't work

I found an endpoint on kubelet, which is the per-node API that sounds like it could be useful but it only returns "OK" if the node is running and doesn't reflect its availability. Of course, that is normally OK since if the engine is running, it will redirect network calls to other nodes but it is of no use at all for us to decide whether to take the node out of the LB set.

Although the K8S API does allow us to query the state of a node, there are a number of problems. Firstly, if we called it directly from the load balancer, we would need to authenticate, which is not possible and also, the load balancer can only send the same request to each node so we couldn't call e.g. /status/node1 on node 1 and /status/node2 on node2.

The Daemonset

So then I got to thinking about running some kind of Pod on each node and use this as the healthcheck endpoint. Definitely sounded doable. I eventually worked out that I should NOT use a service with the daemonset (which works like normal services and can pass traffic east-west even though each node runs a pod!) but instead I should use a hostPort, which allows me to directly access each pod via the node ip address. The yaml looks like this:

 kind: DaemonSet   
 spec:   
  updateStrategy:   
   type: RollingUpdate   
  selector:   
   matchLabels:   
    Octopus.Kubernetes.DeploymentName: node-healthcheck   
  template:   
   metadata:   
    labels:   
     Octopus.Kubernetes.SelectionStrategyVersion: "SelectionStrategyVersion2"   
     Octopus.Project.Id: #{Octopus.Project.Id}   
     Octopus.Action.Id: #{Octopus.Action.Id}   
     Octopus.Deployment.Id: #{Octopus.Deployment.Id}   
     Octopus.RunbookRun.Id: ""   
     Octopus.Step.Id: #{Octopus.Step.Id}   
     Octopus.Environment.Id: #{Octopus.Environment.Id}   
     Octopus.Deployment.Tenant.Id: "untenanted"   
     Octopus.Kubernetes.DeploymentName: "node-healthcheck"   
   spec:   
    containers:   
    - name: node-healthcheck   
     image: acme.azurecr.io/acme.nodehealthcheck:#{Octopus.Release.Number}   
     ports:   
     - name: http   
      containerPort: 80   
      hostPort: 8080  
      protocol: TCP   
     env:   
     - valueFrom:   
       fieldRef:   
        fieldPath: spec.nodeName   
      name: "MY_NODE_NAME"   
    imagePullSecrets:   
    - name: octopus-feedcred-feeds-azure-container-registry   
    dnsPolicy: Default   
    hostNetwork: false   
    serviceAccountName: print-region   
  revisionHistoryLimit: 1   
 apiVersion: apps/v1   
 metadata:   
  name: node-healthcheck   
  labels:   
   Octopus.Kubernetes.SelectionStrategyVersion: "SelectionStrategyVersion2"   
   Octopus.Project.Id: #{Octopus.Project.Id}   
   Octopus.Action.Id: #{Octopus.Action.Id}   
   Octopus.Deployment.Id: #{Octopus.Deployment.Id}   
   Octopus.RunbookRun.Id: ""   
   Octopus.Step.Id: #{Octopus.Step.Id}   
   Octopus.Environment.Id: #{Octopus.Environment.Id}   
   Octopus.Deployment.Tenant.Id: "untenanted"   
   Octopus.Kubernetes.DeploymentName: "node-healthcheck"

You will notice all of the Octopus labels, which are simply to make sure that when we deploy with Octopus, that it will match it to previous deployments etc. There might also be values that are set to their defaults and are not needed but this was because I copied this YAML from a previous deployment log since Octopus doesn't seem to support hostPort, which is tres annoying.

Although there is quite a lot in this file, which I will discuss later, the important parts are "daemonset" so that it runs one pod per node and hostPort under ports that allows us to access the pod directly via each node - the load balancer can make sure that it is hitting the relevant pod.

Authentication

The image in our specific deployment is held on Azure Container Registry so you will see a secrets specification that uploads the username/password to each node so they can pull the image. Also, we inject the Octopus release number since in our case, it always follows the image version so we can easily use the Octopus variable for the tag.

The Downward API

We are still left with a challenge. How can I make sure that the pod running on e.g. node1 only returns the status of node1? By default, the pod is unaware of its node context, deliberately to avoid people breaking the elastic nature of Kubernetes by tying pods to nodes directly. However, you can now expose some basics and the downward api can be seen in action in the env section of the yaml above.

Instead of simply NAME=VALUE, the downward API allows you to specify the "fieldPath" source of the environment variable, in our case spec.NodeName and we can assign that to a variable of our choice, in my case, I just used MY_NODE_NAME.

The Web Service

So we are partially there. Now how can I get my pod, whatever it is, to read the node name, query the kubernetes API for the status of that node and return a suitable status?

For random reasons, I chanced upon a great blog post about making a tiny web server in Docker, after all, I was doing something really simple and didn't exactly want 200MB images.

There were two great ideas which I would not have automatically done myself. The first was to use a compiled language, Go in this example, to avoid the need for large web server libraries or installations (I wouldn't need all the wonderful glory of nginx for such a simple and local web service). The second was to use the scratch base image and only copy in the executable for the Go server, since it wouldn't need much else. I did initially change scratch to Alpine while I was debugging so mine is a less impressive 45MB and that goes down to about 40MB with scratch, which is probably related to the Kubernetes stuff I need to import. Before that, it was closer to 20MB so I might be able to optimise that a bit.

Anyway, I started with a simple Go web server that I could work out the Dockerfile and stuff for. I can't the original article but it looks very similar to the Go help docs for http. This example shows all of the Kubernetes stuff I added later as well but as with all of my forays into the unknown, each of these steps was small and tested. e.g. basic build, return a file, add a simple API handler etc.

 package main  
 import (  
      "context"  
      "fmt"  
      "log"  
      "os"  
      "net/http"  
      metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"  
      "k8s.io/client-go/kubernetes"  
      "k8s.io/client-go/rest"  
 )  
 func main() {  
      config, err := rest.InClusterConfig()  
      if err != nil {  
           panic(err.Error())  
      }  
      // creates the clientset  
      clientset, err := kubernetes.NewForConfig(config)  
      if err != nil {  
           panic(err.Error())  
      }  
      fileServer := http.FileServer(http.Dir("./"))  
      http.Handle("/", fileServer)  
      http.HandleFunc("/nodename", func(w http.ResponseWriter, r *http.Request) {  
           fmt.Fprintf(w, "Node: %q", os.Getenv("MY_NODE_NAME"))  
      })  
      http.HandleFunc("/nodecheck", func(w http.ResponseWriter, r *http.Request) {  
           node, err := clientset.CoreV1().Nodes().Get(context.TODO(), os.Getenv("MY_NODE_NAME"), metav1.GetOptions{})  
           if err != nil {  
                panic(err.Error())  
           }  
           if node.Spec.Unschedulable == false {  
                fmt.Fprint(w, "OK")  
           } else {  
                fmt.Fprint(w, "Down")  
           }  
      })  
      fmt.Printf("Starting server at port 80\n")  
      if err := http.ListenAndServe(":80", nil); err != nil {  
           log.Fatal(err)  
      }  
 }

You can see that I left the default file server in there so I can do a simple check that the pod is running and that each node has one. I also added nodename, which shows how the environment variable is set with the downwards API by Kubernetes and I can return this. This was super helpful because I assumed a daemonset service would simply forward the request to whichever node you hit and this node name helped me see that this wasn't true and I needed to use hostport instead!

/nodecheck is the bit I will use for my healthcheck. It uses the kubernetes go client library (which isn't super-well documented but I got there relatively easily!). By using the environment variable, I can query a named node and if it is not unschedulable, I return OK, otherwise I return Down. As with all small steps, I deployed this to prove that a) it could access the K8S API and b) that it correctly detected drained nodes (which it did).

Dockering it together

The blog about the tiny web server gave me most of my clues but we are simply doing a 2-stage build. Build from the golang image and then take the statically built output and copy it into a nearly empty image with an html file for luck and that's about it.

 FROM scratch as base  
 EXPOSE 80  
   
 ### build stage ###  
 FROM golang:1.16 AS builder  
 WORKDIR /app  
 RUN go mod init healthcheck   
 RUN go get k8s.io/client-go@latest  
 COPY server.go .  
 RUN go get -d ./...  
 RUN go build \  
  -ldflags "-linkmode external -extldflags -static" \  
  -a server.go  
   
 ### run stage ###  
 FROM base as final  
 COPY --from=builder /app/server ./server  
 COPY index.html index.html  
 CMD ["./server"]

There were some gotchas. Go changed the way that modules are specified (I think from v16) and initialised but I ended up working it out somehow! There are some flags to go build which tell it to statically link everything which makes it nice and portable but is probably why my image has ballooned.

So now I can get the LB to call the pod and only mark the node as healthy if it returns "OK". Simples!