I’ve been working on documenting one of the popular use cases for Cilium: high-performance networking. Cilium can replace kube-proxy
and leverages eBPF to achieve a faster network path.
There are actually a couple implementations of kube-proxy
: one based on iptables
(a +20 year-old networking and security Linux utility and the default option) and one based on the better-performing system IPVS
. From what I’ve read and seen, eBPF performs better than either, but I think the difference is especially evident between eBPF and iptables-based kube-proxy
.
To demonstrate why Cilium provides better performances, I had to look inside my nodes, to really understand the impact of iptables
in a non-Cilium based environment.
I deployed a cluster in AKS and accessed my nodes, following the Azure docs.
I quickly hit a roadblock though:
nicovibert:~$ kubectl debug node/aks-nodepool1-20100607-vmss000000 -it --image=mcr.microsoft.com/dotnet/runtime-deps:6.0
Creating debugging pod node-debugger-aks-nodepool1-20100607-vmss000000-28fw8 with container debugger on node aks-nodepool1-20100607-vmss000000.
If you don't see a command prompt, try pressing enter.
root@aks-nodepool1-20100607-vmss000000:/#
root@aks-nodepool1-20100607-vmss000000:/# iptables -t nat -L
bash: iptables: command not found
Even when using chroot /host
, I got similar results:
root@aks-nodepool1-20100607-vmss000000:/# chroot /host
# iptables -L
iptables v1.6.1: can't initialize iptables table `filter': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.
# sudo iptables -L
iptables v1.6.1: can't initialize iptables table `filter': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.
A Google Search led me in the right direction and this particular post provided me with the answer: kubectl debug
starts a privileged container on the node but it does not give it the capability I needed to run iptables
: NET_ADMIN
.
The work-around was to use kubectl-exec
to SSH to the AKS nodes.
This tool/script creates a pod with a privileged container in the node and uses nsenter
to execute a shell into the Kubernetes nodes.
By default, the additional capabilities added to the container are limited to SYS_PTRACE
, which I didn’t need for my use case. I replaced it with NET_ADMIN
.
#nsenter JSON overrrides
OVERRIDES="$(cat <<EOT
{
"spec": {
"nodeName": "$NODE",
"hostPID": true,
"containers": [
{
"securityContext": {
"privileged": true,
"capabilities": {
"add": [ "NET_ADMIN" ] # SYS_PTRACE replaced by NET_ADMIN
}
},
"image": "$IMAGE",
"name": "nsenter",
"stdin": true,
"stdinOnce": true,
"tty": true,
"command": [ "nsenter", "--target", "1", "--mount", "--uts", "--ipc", "--net", "--pid", "--", "bash", "-l" ]
}
]
}
}
EOT
)"
I launched the tool and I was good to go. I can use iptables
command to visualize the impact of kube-proxy.
nicovibert:~$ for x in {1..2}; do yq -i ' .metadata.name = "nginx-svc-'$x'" ' service.yaml | kubectl apply -f service.yaml ;done
service/nginx-svc-1 created
service/nginx-svc-2 created
nicovibert:~$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 2d5h
nginx-svc ClusterIP 10.0.205.134 <none> 80/TCP 2d5h
nginx-svc-1 ClusterIP 10.0.122.159 <none> 80/TCP 17s
nginx-svc-2 ClusterIP 10.0.93.31 <none> 80/TCP 16s
nicovibert:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-20100607-vmss000000 Ready agent 23h v1.23.8
aks-nodepool1-20100607-vmss000001 Ready agent 23h v1.23.8
nicovibert:~$ kubectl-exec aks-nodepool1-20100607-vmss000000
Kuberetes client version is 1.25. Generator will not be used since it is deprecated.
creating pod "aks-nodepool1-20100607-vmss000000-exec-20552" on node "aks-nodepool1-20100607-vmss000000"
If you don't see a command prompt, try pressing enter.
root@aks-nodepool1-20100607-vmss000000:/# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes health check service ports */
KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */
KUBE-FIREWALL all -- anywhere anywhere
Chain FORWARD (policy ACCEPT)
target prot opt source destination
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */
DROP tcp -- anywhere 168.63.129.16 tcp dpt:http
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
KUBE-FIREWALL all -- anywhere anywhere
Chain KUBE-EXTERNAL-SERVICES (2 references)
target prot opt source destination
Chain KUBE-FIREWALL (2 references)
target prot opt source destination
DROP all -- anywhere anywhere /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
DROP all -- !127.0.0.0/8 127.0.0.0/8 /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT
Chain KUBE-FORWARD (1 references)
target prot opt source destination
DROP all -- anywhere anywhere ctstate INVALID
ACCEPT all -- anywhere anywhere /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT all -- anywhere anywhere /* kubernetes forwarding conntrack rule */ ctstate RELATED,ESTABLISHED
Chain KUBE-KUBELET-CANARY (0 references)
target prot opt source destination
Chain KUBE-NODEPORTS (1 references)
target prot opt source destination
Chain KUBE-PROXY-CANARY (0 references)
target prot opt source destination
Chain KUBE-SERVICES (2 references)
target prot opt source destination
root@aks-nodepool1-20100607-vmss000000:/#
Let’s create 100 services instead of just 2 (inspired by a script found on this great post):
nicovibert:~$ for x in {1..100}; do yq -i ' .metadata.name = "nginx-svc-'$x'" ' service.yaml | kubectl apply -f service.yaml ;done
service/nginx-svc-1 unchanged
service/nginx-svc-2 unchanged
service/nginx-svc-3 created
service/nginx-svc-4 created
service/nginx-svc-5 created
service/nginx-svc-6 created
service/nginx-svc-7 created
service/nginx-svc-8 created
service/nginx-svc-9 created
[]
service/nginx-svc-89 created
service/nginx-svc-90 created
service/nginx-svc-91 created
service/nginx-svc-92 created
service/nginx-svc-93 created
service/nginx-svc-94 created
service/nginx-svc-95 created
service/nginx-svc-96 created
service/nginx-svc-97 created
service/nginx-svc-98 created
service/nginx-svc-99 created
service/nginx-svc-100 created
Let’s check the iptables now. It’s pretty insane how many internal rules are created for each service.
root@aks-nodepool1-20100607-vmss000000:/# iptables-save | grep -c KUBE-SEP
432
root@aks-nodepool1-20100607-vmss000000:/# iptables-save | grep -c KUBE-SVC
423
I’ll keep the rest of my observations in an upcoming kube-proxy replacement blog post, on isovalent.com.
Thanks for reading.