CNI Demystified: The Backbone of Kubernetes Networking

People can get puzzled when they need to choose one of the available networking solutions for Kubernetes. As you can see there are a lot of solutions.

Most of the mentioned solutions include container network interface (CNI) plug-ins. These are the cornerstones of Kubernetes networking, and it is essential to understand them to make an informed decision which networking solution to chose. It is also useful to know some details about the internals of your preferred networking solution. This way, you will be able to choose what Kubernetes networking features you need, analyze networking performance / security / reliability, and troubleshoot low-level issues.

Some basics behind CNI

When Kubernetes starts up your pod (a logical group of containers), something it will do is create a “infra container” – this is a container that has a shared network namespace (among other namespaces) that all the containers in the pod share.

This means that any networking elements you create in that infra container will be available to all the containers in the pod. This also means that as containers come and go within that pod, the networking stays stable.

If you have a running Kubernetes (which has some pods running), you can perform a docker ps and see containers that often running with gcr.io/google_containers/pause-amd64 image, and they’re running a command that looks like /pause.In theory this is a lightweight enough container that it “shouldn’t really die” and should be available to all containers within that pod.

As Kubernetes creates this infra container, it also will call an executable as specified in the /etc/cni/net.d/*conf files.

If you want even more detail, you can checkout the CNI specification itself.

The official documentation outlines a number of requirements that any CNI plugin implementation should support. Rephrasing it in a slightly different way, a CNI plugin must provide at least the following two things:

Connectivity - making sure that a Pod gets its default eth0 interface with IP reachable from the root network namespace of the hosting Node.
Reachability - making sure that Pods from other Nodes can reach each other directly (without NAT)

Configuring a CNI plug-in

While it’s not required – you probably want to have a Kubernetes environment setup for yourself where you can experiment with deploying the plugins in Kubernetes. In my case, I used a two EC2 instances on AWS with a master and worker node. Both are of the type: t2medium

master[ip-10-0-0-210] ; worker[ip-10-0-10-51]

Using kubeadm to deploy Kubernetes

We will be using kubeadm to configure and run Kubernetes components on our EC2. You can follow the script from the Github to setup the cluster. Once done, to check what subnets are allocated from the pod network range to both master and worker nodes

$ kubectl describe node ip-10-0-0-210 | grep PodCIDR
PodCIDR:                     10.244.0.0/24

$ kubectl describe node ip-10-0-10-51| grep PodCIDR
PodCIDR:                     10.244.1.0/24

As you can see from the output, the whole pod network range (10.244.0.0./20) has been divided into small subnets, and each of the nodes received its own subnets. This means that the master node can use any of the 10.244.0.0–10.244.0.255 IPs for its Pods, and the worker node uses 10.244.1.0–10.244.1.255 IPs.

NAME            STATUS   ROLES           AGE   VERSION
ip-10-0-0-210   NotReady    control-plane   34m   v1.31.5
ip-10-0-10-51   NotReady    <none>          20m   v1.31.5

Step 1 : Creation of the plugin configuration file

The first thing you should do is create the plug-in configuration. Save the following file as /etc/cni/net.d/10-bash-cni-plugin.conf

{
        "cniVersion": "0.4.0", #  version of the CNI specification
        "name": "mynet",       #  name of the plugin
        "type": "bash-cni",    #  plugin written in Bash
        "network": "10.244.0.0/20",
        "subnet": "10.244.0.0/24" # node-cidr-range
}

This must be done on both master and worker nodes. Don’t forget to replace <node-cidr-range> with 10.244.0.0/24 for the master and 10.244.1.0./24 for the worker. It is also very important that you put the file into the /etc/cni/net.d/ folder.

kubelet uses /etc/cni/net.d/ to discover CNI plug-ins.

The first three parameters in the configuration (cniVersion, name, and type) are mandatory.

cniVersion is used to determine the CNI version used by the plugin
name is just the network name
type refers to the file name of the CNI plug-in executable.

The network and subnet parameters are our custom parameters, they are not mentioned in the CNI specification, and later we will see how exactly they are used by the bash-cni network plug-in.

Step 2 : Preparation of a network bridge on both master and worker VMs

The bridge network is a device that aggregates network packets from multiple network interfaces. A bridge is analogous to a network switch.

The bridge can also have its own MAC and IP addresses, so each container sees the bridge as another device plugged into the same network. We reserve the 10.244.0.1 IP address for the bridge on the master VM and 10.244.1.1 for the bridge on the worker VM. The following commands can be used to create and configure the bridge with the cni0 name:

$ sudo brctl addbr cni0
$ sudo ip link set cni0 up
$ sudo ip addr add <bridge-ip>/24 dev cni0

These commands create the bridge, enable it, and then assign an IP address to it. The last command also implicitly creates a route, so that all traffic with the destination IP belonging to the pod CIDR range, local to the current node, will be redirected to the cni0 network interface.

Step 3 : Creation of the plugin binary

The plug-in’s binary format must be placed in the /opt/cni/bin/ folder, its name must be exactly the same as the type parameter in the plug-in configuration (bash-cni), and its contents can be found in the Github-repo After you put the plug-in in the correct folder, don’t forget to make it executable by running sudo chmod +x bash-cni.) This should be done on both master and worker VMs.

The /opt/cni/bin folder stores all the CNI plug-ins.

The CNI plugin binary has 5 main commands :

ADD: This command is invoked by the container runtime to create a new network interface for a container.
DEL: This command is used to delete a network interface when a container is removed.
CHECK: This command checks whether a given configuration is valid and operational.
VERSION: This command retrieves the version of the CNI plugin being used.
GC (Garbage Collection): This command is used to clean up unused resources or stale configurations.

Step 4 : Testing the plugin

Now, if you execute the kubectl get node command, you can see that both nodes should go to the “Ready” state. So, let’s try to deploy an application and see how it works. But before we’re able to do this, we should “untaint” the master node. By default, the scheduler will not put any pods on the master node, because it is “tainted.” But we want to test cross-node container communication, so we need to deploy some pods on the master, as well as on the worker. The taint can be removed using the following command.

$ kubectl taint nodes ip-10-0-0-210 node-role.kubernetes.io/control-plane:NoSchedule-
ip-10-0-0-210 untainted

NOTE : Make sure you install nmap on both of your nodes.

Next, let’s use this test deployment to validate the CNI plug-in.

$ kubectl apply -f https://github.com/Imlucky883/cni-plugin/blob/main/manifests/master-deploy.yaml

Here, we are deploying four simple pods. Two goes on the master and the remaining two on the worker. (Pay attention to how we are using the nodeName property to tell the pod, where it should be deployed.) On both master we have two pods running NGINX while on the worker nodes we have two pods running BUSYBOX with sleep command. Now, let’s run kubectl get pod to make sure that all pods are healthy and then get the pods IP addresses. As I checked the containers were in the PendingState, after I described the pods I got to know that the error was with CNI.

I was getting the below error when I checked the error log of the plugin at /var/log/bash-cni.log

Error: any valid address is expected rather than "(10.244.0.1)".
CNI command: DEL
stdin: {"cniVersion":"0.4.0","name":"mynet","network":"10.244.0.0/20","subnet":"10.244.0.0/24","type":"bash-cni"}
CNI command: ADD
stdin: {"cniVersion":"0.4.0","name":"mynet","network":"10.244.0.0/20","subnet":"10.244.0.0/24","type":"bash-cni"}
Allocated container IP: 10.244.0.61

After surfing the online for more than 1 hour, I found that the way nmap generates IP addresses includes extra characters like parentheses. I added tr -d '()' to the plugin script which actually fixed the issue and now IP’s were allocated to the Pods. Below is the command that I added in the /opt/cni/bin/bash-cni

all_ips=($(nmap -sL $subnet | grep "Nmap scan report" | awk '{print $NF}' | tr -d '()'))

Now I checked the Pod status and both the pods are successfully been allocated a IP.

$ kubectl get pods -o wide

busybox-deployment-787d986855-dxq7z   1/1     Running   0          22s   10.244.1.2   ip-10-0-10-51   <none>           <none>
busybox-deployment-787d986855-jxtx9   1/1     Running   0          22s   10.244.1.3   ip-10-0-10-51   <none>           <none>
nginx-deployment-6b874d4659-bmnk9     1/1     Running   0          30m   10.244.0.3   ip-10-0-0-210   <none>           <none>
nginx-deployment-6b874d4659-pkn2n     1/1     Running   0          30m   10.244.0.2   ip-10-0-0-210   <none>           <none>

Now the objective is to test the connectivity between pod-pod, pod-node & pod-external connectivity.

$ kubectl exec -it busybox-deployment-787d986855-dxq7z -- sh

/ ping 10.0.10.51 # can ping own host
PING 10.0.10.51 (10.0.10.51): 56 data bytes
64 bytes from 10.0.10.51: seq=0 ttl=64 time=0.070 ms
64 bytes from 10.0.10.51: seq=1 ttl=64 time=0.071 ms

$ ping 10.0.0.210 # can’t ping different host 
PING 10.0.0.210 (10.0.0.210): 56 data bytes

$ ping 10.244.0.2 # can ping a pod on the same host
64 bytes from 10.244.0.2: seq=0 ttl=64 time=0.092 ms
64 bytes from 10.244.0.2: seq=1 ttl=64 time=0.055 ms

$ ping 10.244.1.3 # can’t ping a pod on a different host
PING 10.244.1.3 (10.244.1.3): 56 data bytes

$ ping 108.177.121.113 # can’t ping any external address
PING 110.234.16.223 (110.234.16.223): 56 data bytes

As you can see, the only thing that actually works is a pod to pod and Pod to host communication.

Case 1 : Can’t ping external address

The fact that our container can’t reach the Internet should be no surprise for you. Our containers are located in a private subnet (10.244.0.0/24).

In order to fix this, we should set up a network address translation (NAT) on the host VM. NAT is a mechanism that replaces a source IP address in the outcoming package with the IP address of the host VM. The original source address is stored somewhere else in the TCP packet. When the response arrives to the host VM, the original address is restored, and a package is forwarded to the container network interface. You can easily setup NAT using the following two commands:

$ sudo iptables -t nat -A POSTROUTING -s 10.244.0.0/24 ! -o cni0 -j MASQUERADE # on master

$ sudo iptables -t nat -A POSTROUTING -s 10.244.1.0/24 ! -o cni0 -j MASQUERADE # on worker

Pay attention that here we are NATing only packages with the source IP belonging to the local pod subnet, which are not meant to be sent to the cni0 bridge. The ! -o cni0 condition in the iptables rule ensures that NAT is not applied to traffic intended to stay within the cluster (i.e., traffic sent to other pods via the cni0 bridge). Only traffic leaving the cluster and heading to external destinations is NATed.

Case 2 : Can’t ping pod on the other host

Pods on different host can’t talk to each other. If you think about it, this makes a perfect sense. If we are sending a request from the 10.244.0.4 Pod to the 10.244.1.3 Pod, we never specified that the request should be routed through the 10.0.10.51 host. Usually, in such cases, we can rely on the ip route command to setup some additional routes for us. If we carried out this experiment on some bare-metal servers directly connected to each other, we could do something like this:

$ ip route add 10.244.1.0/24 via 10.0.10.51 dev enX0 # run on master 
$ ip route add 10.244.0.0/24 via 10.0.0.210 dev enX0 # run on worker

Conclusion

We have done a great job writing our own CNI plug-in and implementing it into a Kubernetes cluster. In the real world, nobody writes a CNI plug-in from scratch like we did, everybody utilizes default CNI plug-ins instead and redirects all the job to them. We can rewrite our plug-in to do so, as well.

We can replace our IP management with some of the host-local ipam plug-ins.
We can also replace the rest of our plug-in with the bridge CNI plug-in. It is actually working in exactly the same way as our own one.

There are couple of things that I would like to highlight :-

When you first intialize the cluster, you will notice that all the systemcomponents pods are in running state eventhough the nodes are in “NotReady” state, its because the networking for this components is not managed by CNI. They share the network interface with the host VM.[hostNetwork: true]

Github Project

https://github.com/Imlucky883/cni-plugin

References :-

https://www.altoros.com/blog/kubernetes-networking-writing-your-own-simple-cni-plug-in-with-bash/#comments

https://github.com/containernetworking/cni#running-the-plugins