Ollama on k8s
Categories: tech
Tags: kubernetes ollama llms
Recently I’ve been playing with Ollama. Time to look into deploying it on k8s. My home lab has a bunch of heterogeneous hardware. Ranging from Raspberry Pi’s to 1 liter computers to a nearly 15-year-old server. I have a feeling the last one might be an issue and tends to have most software deployed to it.
First Draft
Based on the Ollama official documentation it looks like it might be fairly straight forward to just pull an image. Let’s draft it up in a deployment quickly.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
labels:
app: ollama
spec:
serviceName: "ollama"
replicas: 1
revisionHistoryLimit: 3
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
# https://hub.docker.com/r/ollama/ollama/tags
image: "ollama/ollama:0.13.5"
ports:
- containerPort: 11434
name: inference
---
apiVersion: v1
kind: Service
metadata:
name: ollama
spec:
ports:
- port: 11434
name: inference
selector:
app: ollama
Ideally, we’ll have a volume mounted to avoid pulling down the models every time the pod restarts. For now let’s just see if it starts up!
Testing
Setting OLLAMA_HOST=ollama.workshop-ollama-test.svc.cluster.local:11434 in the environment variables on my local
machine showed ollama ls nothing. ollama pull qwen3:8b shows a nice 928Mb/s download speed!
As I suspected the pod got scheduled to kal, my 12 core server with plently of RAM but old CPUs. Token generation
rate is abysmally low saturating all 11 of 12 cores for 1m49s . Memory consumption was at around 10GBs of WSS for
the entire run. Consuming around 23 watts of electrical power !
Caching models
Models are cached at /root/.ollama/models. Easy enough to update the volume mount point to persist this data. This
requires the StatefulSet to be recreated though.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
labels:
app: ollama
spec:
serviceName: "ollama"
replicas: 1
revisionHistoryLimit: 3
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
# https://hub.docker.com/r/ollama/ollama/tags
image: "ollama/ollama:0.13.5"
ports:
- containerPort: 11434
name: inference
volumeMounts:
- name: model-cache
mountPath: /root/.ollama/models
volumeClaimTemplates:
- metadata:
name: model-cache
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "longhorn"
resources:
requests:
storage: 32Gi
Scheduling to faster devices?
Leftover from when I had kubevirt installed, all of my nodes have labels attached with CPU features. I figured I
could use this to schedule on nodes with better extensions. Specifically cpu-feature.node.kubevirt.io/avx2=true.
Looks something like this in my StatefulSet spec:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
spec:
template:
spec:
containers:
- name: ollama
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: cpu-feature.node.kubevirt.io/avx2
operator: In
values:
- "true"
Unsurprisingly, this resulted in a significant decrease in runtime down to 38.0s. A decrease of 34% in time. Nice!
Memory consumption is around 5.55GBs of WSS. I am willing to chalk this up to algorithms which scale towards the total amount of RAM available to the container.
Electrical power consumption clocked in around 6 watts. A reduction of 26% which is not bad! Interestingly, the
unlike when the pod was scheduled to kal the power consumption has not dropped. Fans definitely spin up on the
device while the queries are running.
Conclusion
Overall, it was fairly simple to get Ollama up and running on k8s. Unfortunately running the inference server on a CPU consumes a lot of power and produces a measly 3 tokens a second. In comparison, my current M4 laptop can generate 43.84 tokens a second, a nearly 10x improvement; although power consumption is unknown.
Changing to tinyllama:1.1b produces an amazing 23.64 tokens a second. However the output is very interesting!
Alternatives
- ollama-helm chart is an unofficial helm chart which looks heavily integrated
with GPU support. More interesting to me is the
knativesupport which I would like to look into at some point.