Host Text-Generator.io on Kubernetes
Kubernetes is a convenient way to setup Text Generator to also provide autoscaling and zero downtime deploys
Requirements
- Docker
- Cuda 11+
- Nvidia Docker
- Kubernetes and kubectl
- Kubernetes cluster setup e.g. via AWS EKS or Google Cloud GKE
Quickstart
Download the docker zip file from your account, this requires a self hosted subscription
Load the container then tag and push to a registry
Create a docker image registry you own e.g. AWS ECR or Google Container Registry, then push the image to it
For example if your repository is
us.gcr.io/PROJECT/REPO/
docker load -i text-generator.tar
docker tag text-generator-customer:v1 us.gcr.io/PROJECT/REPO/prod-app:v1
docker push us.gcr.io/PROJECT/REPO/prod-app:v1
Setup kubernetes infrastructure
The service can run more cost effectively on AWS, via a g4dn.2xlarge or on a Google Cloud machine with a NVIDIA A100 40GB GPU (as 24GB VRAM is the minimum)
Create a node with a GPU on it, ensure it has enough disk space for 40G of models. The storage location should be ideally a fast disk (SSD)
Then create a pod to be scheduled on the node.
Example kubernetes service file deploy-gpu.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prod-app
spec:
replicas: 1 # Note that doing a zero downtime rolling deployment with 2 replicas would require subscribing to run 2 instances concurrently
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
selector:
matchLabels:
app: prod-app
template:
metadata:
labels:
app: prod-app
spec:
# Necessary to have enough shared memory
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: prod-app
image: us.gcr.io/PROJECT/REPO/prod-app:v1 # todo use your image name here
imagePullPolicy: IfNotPresent
env:
- name: API_KEY
value: "TEXT_GENERATOR_API_KEY"
# Ensure that the node has a GPU
resources:
requests:
cpu: 1500m
memory: 30G
limits:
nvidia.com/gpu: "1"
# Necessary to have enough shared memory
volumeMounts:
- mountPath: /models
name: dshm
livenessProbe:
failureThreshold: 3 # 2 min for recovering
httpGet:
scheme: HTTP
path: /liveness_check
port: 8080
initialDelaySeconds: 10
timeoutSeconds: 10
periodSeconds: 240
readinessProbe:
failureThreshold: 10 # 10*30s = 5 min startup time
httpGet:
scheme: HTTP
path: /liveness_check
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 10
Apply the changes in kubernetes
kubectl apply -f deploy-gpu.yaml
Autoscaling
Autoscaling can be setup via a kubernetes horizontal pod autoscaler, for example on AWS EKS
Example hpa.yaml file
apiVersion: v1
items:
- apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hello-app-hpa
namespace: default
resourceVersion: "664"
spec:
maxReplicas: 10
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hello-app
targetCPUUtilizationPercentage: 50 # tries to maintain 50 percent cpu usage, it scales up if it is consistently over and scales back down if under
kind: List
metadata: {}
resourceVersion: ""
selfLink: ""
Apply the changes in kubernetes
kubectl apply -f hpa.yaml
Exposing the service to the web
This example exposes a service running on GKE cluster to be able to query it from the outside world
This requires a service and in ingress (route to the service)
Example service.yaml file
apiVersion: v1
kind: Service
metadata:
name: gke-gpu-service2
labels:
app: prod-app
annotations:
cloud.google.com/neg: '{"ingress": true}'
spec:
type: LoadBalancer
ports:
- name: sentiment
port: 80
targetPort: 8080
protocol: TCP
selector:
app: prod-app
Example ingress.yaml file
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prod-ingress
spec:
rules:
- http:
paths:
- path: /*
pathType: ImplementationSpecific
backend:
service:
name: gke-gpu-service2 # this refers to the service name
port:
number: 80
Apply the changes in kubernetes
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
Guidance
If your not sure about hosting your own Text Generator try the online playground to find use cases, prove out the system and then transition later to hosting yourself to further save costs.
If your using another provider like OpenAI hosting yourself can provide large cost savings and changing over the API is an easy migration from OpenAI