CGroups, Kubernetes and Reliable Nodes
Categories: programming
Occasionally, especially under high load, I have noticed kubernetes nodes go offline. They exhibit very strange
behavior, such as terminal session freezing when issuing simple commands like ls
or even complicated ones like top
.
For a while I had a suspicion the underlying issue was related to improper cgroup configuration. I have tried a few
times to configure kubelet to properly work with these and things have gotten better however I felt like I was still
missing a link. This resolved things like ssh
sessions failing but not the commands issued within. At best, I have a
cogent argument but nothing better to deal with the resource starvation.
Recently I stumbled across Prevent resource starvation of critical System and Kubernetes Services which at least speaks towards proper configuration. Effectively the articles essence is:
- CGroups are the kernel mechanism for resource allocation which is used by
kubelet
to manage pod resource usage. - A properly configured kubernetes node will have the following cgroup slices.
- /system.slice – Created via the SystemD system itself
- /podruntime.slice – Should contain
kubelet
and the container runtime environment, which iscontainerd
for me. This must be created by the user. - /kubepods.slice – Is where work loads are actually scheduled and controlled by kubelet.
- Kubelet should be configured with the following as hinted at by the design document
systemReservedCGroup
should be/system.slice
kubeletCGroups
should be/podruntime.slice
runtimeCGroups
should be/kubepods.slice
- The following additional configuration need to occur within your host to make this viable:
/etc/systemd/system/podruntime.slice
needs to contain the following:
[Unit]
Description=Limited resources slice for Kubernetes services
Documentation=man:systemd.special(7)
DefaultDependencies=no
Before=slices.target
Requires=-.slice
After=-.slice
/etc/systemd/system/kubelet.service.d/10-cgroup.conf
needs to contain the following:
[Service]
CPUAccounting=true
MemoryAccounting=true
Slice=podruntime.slice
/etc/systemd/system/containerd.service.d/10-cgroup.conf
needs to contain the following:
[Service]
Slice=podruntime.slice
Very helpful were the debugging tools. In particular the systemd-cgls
to list the CGroup hierarchy and the path
/sys/fs/cgroup
. Here I was able to verify kubelet
was running under the /system.slice
which is definitely not the
intended segment. The files within /sys/fs/cgroup
appear unit-less in most places. I stumbled across a Facebook
post with a helpful reference which is a summary of information from the Linux Kernel.
All memory units are in bytes and CPU units are in microseconds (1000 microseconds to 1 millisecond).
Next steps
Looks like I need to move the Pod runtime, both kubelet
and containerd
to work under /podruntime.slice
and
increase reserved resources for the base system daemons. Considering the ramifications of possibly providing reasonable
minimums for the /system.slice
too, so I can log into the box when kubelet
dies off.