Stream of Consciousness

Mark Eschbach's random writings on various topics.

Ollama on k8s

Categories: tech

Tags: kubernetes ollama llms

Recently I’ve been playing with Ollama. Time to look into deploying it on k8s. My home lab has a bunch of heterogeneous hardware. Ranging from Raspberry Pi’s to 1 liter computers to a nearly 15-year-old server. I have a feeling the last one might be an issue and tends to have most software deployed to it.

First Draft

Based on the Ollama official documentation it looks like it might be fairly straight forward to just pull an image. Let’s draft it up in a deployment quickly.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  labels:
    app: ollama
spec:
  serviceName: "ollama"
  replicas: 1
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          # https://hub.docker.com/r/ollama/ollama/tags
          image: "ollama/ollama:0.13.5"
          ports:
            - containerPort: 11434
              name: inference
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
spec:
  ports:
    - port: 11434
      name: inference
  selector:
    app: ollama

Ideally, we’ll have a volume mounted to avoid pulling down the models every time the pod restarts. For now let’s just see if it starts up!

Testing

Setting OLLAMA_HOST=ollama.workshop-ollama-test.svc.cluster.local:11434 in the environment variables on my local machine showed ollama ls nothing. ollama pull qwen3:8b shows a nice 928Mb/s download speed!

As I suspected the pod got scheduled to kal, my 12 core server with plently of RAM but old CPUs. Token generation rate is abysmally low saturating all 11 of 12 cores for 1m49s . Memory consumption was at around 10GBs of WSS for the entire run. Consuming around 23 watts of electrical power !

Caching models

Models are cached at /root/.ollama/models. Easy enough to update the volume mount point to persist this data. This requires the StatefulSet to be recreated though.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  labels:
    app: ollama
spec:
  serviceName: "ollama"
  replicas: 1
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          # https://hub.docker.com/r/ollama/ollama/tags
          image: "ollama/ollama:0.13.5"
          ports:
            - containerPort: 11434
              name: inference
          volumeMounts:
            - name: model-cache
              mountPath: /root/.ollama/models
  volumeClaimTemplates:
    - metadata:
        name: model-cache
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: "longhorn"
        resources:
          requests:
            storage: 32Gi

Scheduling to faster devices?

Leftover from when I had kubevirt installed, all of my nodes have labels attached with CPU features. I figured I could use this to schedule on nodes with better extensions. Specifically cpu-feature.node.kubevirt.io/avx2=true.

Looks something like this in my StatefulSet spec:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
spec:
  template:
    spec:
      containers:
        - name: ollama
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - preference:
                matchExpressions:
                  - key: cpu-feature.node.kubevirt.io/avx2
                    operator: In
                    values:
                      - "true"

Unsurprisingly, this resulted in a significant decrease in runtime down to 38.0s. A decrease of 34% in time. Nice!

Memory consumption is around 5.55GBs of WSS. I am willing to chalk this up to algorithms which scale towards the total amount of RAM available to the container.

Electrical power consumption clocked in around 6 watts. A reduction of 26% which is not bad! Interestingly, the unlike when the pod was scheduled to kal the power consumption has not dropped. Fans definitely spin up on the device while the queries are running.

Conclusion

Overall, it was fairly simple to get Ollama up and running on k8s. Unfortunately running the inference server on a CPU consumes a lot of power and produces a measly 3 tokens a second. In comparison, my current M4 laptop can generate 43.84 tokens a second, a nearly 10x improvement; although power consumption is unknown.

Changing to tinyllama:1.1b produces an amazing 23.64 tokens a second. However the output is very interesting!

Alternatives

  • ollama-helm chart is an unofficial helm chart which looks heavily integrated with GPU support. More interesting to me is the knative support which I would like to look into at some point.