Stream of Consciousness

Mark Eschbach's random writings on various topics.

Power Monitoring of the Home Lab

Categories: tech

Tags: kubernetes homelab longhorn

So the time has come: monitoring the power usage of my homelab. I will be introducing two S31’s flashed with Tasmata to better understand the draw of my cluster. These are setup to integrate with my Home Assistant instance.

My home lab is on two plugs. For the first plug I have my two k8s nodes. To safely handle the transition I’ll mark both for draining and quarantine. [kubectl drain --ignore-daemonsets --delete-emptydir-data <node>](https://kubernetes. io/docs/tasks/administer-cluster/safely-drain-node/) . Some complaints were issued around Longhorn . I am hoping the system comes back on-line without much issue.

So the machines came back up with seemingly no problem. Hard part being the machines communicate with the kubernetes cluster to record telemetry and centralize logs. So I will not known until I bring both nodes fully online. At least the operating systems came up.

kubectl uncordon <node> is the next step. Watching dmesg shows the XFS configuration of Longhorn is subject to the 2038 time issue. Based on [a Stackoverflow response] (https://askubuntu.com/questions/1302943/xfs-filesystem-being-mounted-at-disk-supports-timestamps-until-2038-0x7fffffff) this appears like a trivial problem.

Storage Eventual Consistency?

So most of the cluster came back online as expected. Longhorn did get stuck at one point. Marked molly as unschedulable for some reason but was not very clear. Originally I was thinking this was one of the many race conditions identified in the change log. After deleting pods failed I tried upgrade from 1.5.1 to 1.6.1 which went smoothly and got a majority of pods online.

Upon further investigation I discover only some of the device nodes for Longhorn were mounting. First suggestion for Longhorn is to check multipathd did not grab the device node. I guess it has been pestering Redhat users for a while. Overall it sounds like multipathd might be a great utility in the future of Longhorn, as supposedly helps manage pathing for iSCSI.

Resolving the issue involved editing the file /etc/multipath.conf and adding the following stanza at the bottom of the file:

blacklist {
    devnode "^sd[a-z0-9]+"
}

This required a restart of the service systemctl restart multipath. multipath -t provides a simple method to verify the changes have been picked up. lsblk will also show devices managed by multipath which was released.

EtcD upgrade woes!

An instance of EtcD I use within the cluster was failing to come up. Images associated with the helm chart will not run on arm64 for version 6.10.0 of the Bitnami’s chart. An attempt to upgrade to 10.0.2 resulted in a demand for auth.rbac.rootPassword to be set despite auth.rbac.enabled being set to false. Sadly the change also results in changing immutable Statefulset fields.

I really need to revisit this anyway as I am not sure having three zones for split horizon DNS is as appropriate as I thought. Especially since the new Gateway API might resolve the issue anyway. So I added the following stanza to ensure the pod is only assigned to an amd64 node.

nodeSelector:
  kubernetes.io/arch: amd64

Unfortunately the chart 6.10.0 no longer exists. The oldest version I could find was 8.11.4. This applied without a problem. The referenced version of etcd does include an arm64 image. So I removed the stanza anyway.

Reflection

The power cross over was not particularly painful. Next time I think I should do an upgrade audit prior to restarting everything. I am hoping this will shake out many of the problems ahead of time. In general, I would like to find a way to automate notifications of upgrade availability. Internal Ingress definitely needs to be revisited.