Serving Models

The Substratus Server resource lets you serve models that were loaded into Substratus. You can use one of Substratus provided Server images or create your own image.

Substratus provides the following images:

substratusai/model-server-basaran: This image can serve most HuggingFace models. It uses Basaran to provide an OpenAI compatible API endpoint and also a Web UI with streaming support.
substratusai/model-server-llama-cpp: This image can serve GGML models that are supported by llama.cpp

Tutorial: Creating a Server for falcon-7b-instruct

Prerequisites:

The falcon-7b-instruct model was loaded by following the loading models walkthrough

Run the following command to satisfy the prerequisites:

kubectl apply -f https://raw.githubusercontent.com/substratusai/substratus/main/examples/falcon-7b-instruct/base-model.yaml

Create the Server resource by running:

kubectl apply -f https://raw.githubusercontent.com/substratusai/substratus/main/examples/falcon-7b-instruct/server.yaml

The following Server resource is used:

apiVersion: substratus.ai/v1
kind: Server
metadata:
  name: falcon-7b-instruct
spec:
  image:
    name: substratusai/model-server-basaran
  model:
    name: falcon-7b-instruct
  resources:
    gpu:
      type: nvidia-l4
      count: 1

In the Model resource spec the following things are configured:

image.name: This is the image published by Substratus that can serve models.
model.name: Refers to the name of the model that was loaded earlier in this tutorial
resources: These specify what kind of resources are needed to serve the model. The Falcon-7b model requires GPUs to perform decently. In this case, 1 NVidia L4 GPU is requested.

It takes about 5 minutes to pull the container, load the model into GPU memory and being ready to serve requests. You can check if the Server is ready by running:

kubectl describe server falcon-7b-instruct

#  Name:         falcon-7b-instruct
#  Namespace:    default
#  Labels:       <none>
#  Annotations:  <none>
#  API Version:  substratus.ai/v1
#  Kind:         Server
#  Metadata:
#    Creation Timestamp:  2023-07-17T06:37:26Z
#    Generation:          1
#    Resource Version:    15962533
#    UID:                 a25eae87-c17b-40df-9e1e-7ccaff0f8a2e
#  Spec:
#    Image:
#      Name:  substratusai/model-server-basaran
#    Model:
#      Name:  falcon-7b-instruct
#    Resources:
#      Cpu:   2
#      Disk:  10
#      Gpu:
#        Count:  1
#        Type:   nvidia-l4
#      Memory:   10
#  Status:
#    Conditions:
#      Last Transition Time:  2023-07-17T06:42:01Z
#      Message:               
#      Observed Generation:   1
#      Reason:                DeploymentReady
#      Status:                True
#      Type:                  Deployed
#    Ready:                   true
#  Events:                    <none>

By default Substratus creates a K8s Service to expose the Server, however this Service is of type ClusterIP, which means you can not directly access it over the internet. So let's use K8s Port Forwarding to access the server.

Run the following command to forward your local 8080 port to the Server port 8080:

kubectl port-forward service/falcon-7b-instruct-server 8080:8080

You should now be able to access the web interface of the Server by going to http://localhost:8080

Serving Models

Tutorial: Creating a Server for falcon-7b-instruct​

Tutorial: Creating a Server for falcon-7b-instruct