Self Hosted Neural Text Generator on Kubernetes Clusters

Text Generator > Docs > Kubernetes

Host Text-Generator.io on Kubernetes

Kubernetes is a convenient way to setup Text Generator to also provide autoscaling and zero downtime deploys

Requirements

Docker
Cuda 11+
Nvidia Docker
Kubernetes and kubectl
Kubernetes cluster setup e.g. via AWS EKS or Google Cloud GKE

Quickstart

Download the docker zip file from your account, this requires a self hosted subscription

Load the container then tag and push to a registry

Create a docker image registry you own e.g. AWS ECR or Google Container Registry, then push the image to it

For example if your repository is

us.gcr.io/PROJECT/REPO/

docker load -i text-generator.tar
docker tag text-generator-customer:v1 us.gcr.io/PROJECT/REPO/prod-app:v1
docker push us.gcr.io/PROJECT/REPO/prod-app:v1

Setup kubernetes infrastructure

The service can run more cost effectively on AWS, via a g4dn.2xlarge or on a Google Cloud machine with a NVIDIA A100 40GB GPU (as 24GB VRAM is the minimum)

Create a node with a GPU on it, ensure it has enough disk space for 40G of models. The storage location should be ideally a fast disk (SSD)

Then create a pod to be scheduled on the node.

Example kubernetes service file deploy-gpu.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prod-app
spec:
  replicas: 1 # Note that doing a zero downtime rolling deployment with 2 replicas would require subscribing to run 2 instances concurrently
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: prod-app
  template:
    metadata:
      labels:
        app: prod-app
    spec:
      # Necessary to have enough shared memory
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
      containers:
        - name: prod-app
          image: us.gcr.io/PROJECT/REPO/prod-app:v1 # todo use your image name here
          imagePullPolicy: IfNotPresent
          env:
            - name: API_KEY
              value: "TEXT_GENERATOR_API_KEY"
          # Ensure that the node has a GPU
          resources:
            requests:
              cpu: 1500m
              memory: 30G
            limits:
              nvidia.com/gpu: "1"
          # Necessary to have enough shared memory
          volumeMounts:
            - mountPath: /models
              name: dshm
          livenessProbe:
            failureThreshold: 3 # 2 min for recovering
            httpGet:
              scheme: HTTP
              path: /liveness_check
              port: 8080
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 240
          readinessProbe:
            failureThreshold: 10 # 10*30s = 5 min startup time
            httpGet:
              scheme: HTTP
              path: /liveness_check
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 30
            timeoutSeconds: 10

Apply the changes in kubernetes

kubectl apply -f deploy-gpu.yaml

Autoscaling

Autoscaling can be setup via a kubernetes horizontal pod autoscaler, for example on AWS EKS

Example hpa.yaml file

apiVersion: v1
items:
- apiVersion: autoscaling/v1
  kind: HorizontalPodAutoscaler
  metadata:
    name: hello-app-hpa
    namespace: default
    resourceVersion: "664"
  spec:
    maxReplicas: 10
    minReplicas: 1
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: hello-app
    targetCPUUtilizationPercentage: 50 # tries to maintain 50 percent cpu usage, it scales up if it is consistently over and scales back down if under
kind: List
metadata: {}
resourceVersion: ""
selfLink: ""

Apply the changes in kubernetes

kubectl apply -f hpa.yaml

Exposing the service to the web

This example exposes a service running on GKE cluster to be able to query it from the outside world

This requires a service and in ingress (route to the service)

Example service.yaml file

apiVersion: v1
kind: Service
metadata:
  name: gke-gpu-service2
  labels:
    app: prod-app
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
spec:
    type: LoadBalancer
    ports:
    - name: sentiment
      port: 80
      targetPort: 8080
      protocol: TCP
    selector:
        app: prod-app

Example ingress.yaml file

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prod-ingress
spec:
  rules:
  - http:
      paths:
      - path: /*
        pathType: ImplementationSpecific
        backend:
          service:
            name: gke-gpu-service2 # this refers to the service name
            port:
              number: 80

Apply the changes in kubernetes

kubectl apply -f service.yaml
kubectl apply -f ingress.yaml

Guidance

If your not sure about hosting your own Text Generator try the online playground to find use cases, prove out the system and then transition later to hosting yourself to further save costs.

If your using another provider like OpenAI hosting yourself can provide large cost savings and changing over the API is an easy migration from OpenAI

Text Generator Playground

Sign in

Create your account

Host Text-Generator.io on Kubernetes

Requirements

Quickstart

Load the container then tag and push to a registry

Setup kubernetes infrastructure

Autoscaling

Exposing the service to the web

Guidance