Skip to main content

2 posts tagged with "kind"

View All Tags

· 2 min read

Don't you just love it when you submit a PR and it turns out that no code is needed? That's exactly what happened when I tried add GPU support to Kind.

In this blog post you will learn how to configure Kind such that it can use the GPUs on your device. Credit to @klueska for the solution.

Install the NVIDIA container toolkit by following the official install docs.

Configure NVIDIA to be the default runtime for docker:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml:

sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml

Create a Kind Cluster:

kind create cluster --name substratus --config - <<EOF
kind: Cluster
- role: control-plane
image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
# required for GPU workaround
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all

Workaround for issue with missing required file /sbin/ldconfig.real:

docker exec -ti substratus-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real

Install the K8s NVIDIA GPU operator so K8s is aware of your NVIDIA device:

helm repo add nvidia || true
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator --set driver.enabled=false

You should now have a working Kind cluster that can access your GPU. Verify it by running a simple pod:

kubectl apply -f - << EOF
apiVersion: v1
kind: Pod
name: cuda-vectoradd
restartPolicy: OnFailure
- name: cuda-vectoradd
image: ""
limits: 1

· 3 min read
kubectl notebook

A Llama 13B parameter model running on a laptop with a mere RTX 2060?! Yes, it all ran surprisingly well at around 7 tokens / sec. Follow along and learn how to do this on your environment.

My laptop setup looks like this:

  • Kind for deploying a single node K8s cluster
  • AMD Ryzen 7 (8 threads), 16 GB system memory, RTX 2060 (6GB GPU memory)
  • Llama.cpp/GGML for fast serving and loading larger models on consumer hardware

You might be wondering: How can a model with 13 billion parameters fit into a 6GB GPU? You'd expect it to need about 13GB, especially if it's running in 4-bit mode, right? Yes it should because 13 billion * 4 bytes / (32 bits / 4 bits) = 13 GB. But thanks to Llama.cpp, we can load only parts of the model into the GPU. Plus, Llama.cpp can run efficiently just using the CPU.

Want to try this out yourself? Follow a long for a fun ride.

Create Kind K8s cluster with GPU support

Install the NVIDIA container toolkit for Docker: Install Guide

Use the convenience script to create a Kind cluster and configure GPU support:

bash <(curl

Or inspect the script and run the steps one by one.

Install Substratus

Install the Substratus K8s operator which will orchestrate model loading and serving:

kubectl apply -f

Load the Llama 2 13b chat GGUF model

Create a Model resource to load the Llama 2 13b chat GGUF model

kind: Model
name: llama2-13b-chat-gguf
image: substratusai/model-loader-huggingface
name: substratusai/Llama-2-13B-chat-GGUF
files: "model.bin"
kubectl apply -f

The model is being downloaded from HuggingFace into your Kind cluster.

Serve the model

Create a Server resource to serve the model: embedmd:# ( yaml)

kind: Server
name: llama2-13b-chat-gguf
image: substratusai/model-server-llama-cpp:latest-gpu
name: llama2-13b-chat-gguf
n_gpu_layers: 30
count: 1
kubectl apply -f

Note in my case 30 out of 42 layers loaded into GPU is the max, but you might be able to load all 42 layers into the GPU if you have more GPU memory.

Once the model is ready it will start serving an OpenAI compatible API endpoint.

Expose the Server to a local port by using port forwarding:

kubectl port-forward service/llama2-13b-chat-gguf-server 8080:8080

Let's throw some prompts at it:

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{ "prompt": "Who was the first president of the United States?", "stop": ["."]}'

Checkout the full API docs here: http://localhost:8080/docs

You can play around with other models. For example, if you have a 24 GB GPU card you should be able to run Llama 2 70B in 4 bit mode by using llama.cpp.

Support the project by adding a star on GitHub! ❤️ Star