CGroups, Kubernetes and Reliable Nodes

Occasionally, especially under high load, I have noticed kubernetes nodes go offline. They exhibit very strange behavior, such as terminal session freezing when issuing simple commands like ls or even complicated ones like top. For a while I had a suspicion the underlying issue was related to improper cgroup configuration. I have tried a few times to configure kubelet to properly work with these and things have gotten better however I felt like I was still missing a link. This resolved things like ssh sessions failing but not the commands issued within. At best, I have a cogent argument but nothing better to deal with the resource starvation.

Recently I stumbled across Prevent resource starvation of critical System and Kubernetes Services which at least speaks towards proper configuration. Effectively the articles essence is:

CGroups are the kernel mechanism for resource allocation which is used by kubelet to manage pod resource usage.
A properly configured kubernetes node will have the following cgroup slices.
- /system.slice – Created via the SystemD system itself
- /podruntime.slice – Should contain kubelet and the container runtime environment, which is containerd for me. This must be created by the user.
- /kubepods.slice – Is where work loads are actually scheduled and controlled by kubelet.
Kubelet should be configured with the following as hinted at by the design document
- systemReservedCGroup should be /system.slice
- kubeletCGroups should be /podruntime.slice
- runtimeCGroups should be /kubepods.slice
The following additional configuration need to occur within your host to make this viable:
- /etc/systemd/system/podruntime.slice needs to contain the following:

[Unit]
Description=Limited resources slice for Kubernetes services
Documentation=man:systemd.special(7)
DefaultDependencies=no
Before=slices.target
Requires=-.slice
After=-.slice

/etc/systemd/system/kubelet.service.d/10-cgroup.conf needs to contain the following:

[Service]
CPUAccounting=true
MemoryAccounting=true
Slice=podruntime.slice

/etc/systemd/system/containerd.service.d/10-cgroup.conf needs to contain the following:

[Service]
Slice=podruntime.slice

Very helpful were the debugging tools. In particular the systemd-cgls to list the CGroup hierarchy and the path /sys/fs/cgroup. Here I was able to verify kubelet was running under the /system.slice which is definitely not the intended segment. The files within /sys/fs/cgroup appear unit-less in most places. I stumbled across a Facebook post with a helpful reference which is a summary of information from the Linux Kernel. All memory units are in bytes and CPU units are in microseconds (1000 microseconds to 1 millisecond).

Next steps

Looks like I need to move the Pod runtime, both kubelet and containerd to work under /podruntime.slice and increase reserved resources for the base system daemons. Considering the ramifications of possibly providing reasonable minimums for the /system.slice too, so I can log into the box when kubelet dies off.