Deploying JupyterHub with Kubernetes: A Step-by-Step Guide

Author: Harsh Patel

JupyterHub is a powerful tool for deploying and managing Jupyter Notebooks at scale. With JupyterHub, you can provide multiple users with access to a shared Jupyter Notebook server. This can be useful in a variety of settings, such as classrooms, research groups, or companies that use Jupyter Notebooks for data analyses and modeling.

Kubernetes is an open-source container orchestration platform that can help you manage and scale your JupyterHub deployment. By using Kubernetes to deploy JupyterHub, you can easily scale your deployment up or down as needed while ensuring it is highly available and resilient.

Below we’ll walk you through each step of deploying JupyterHub with Kubernetes so that you can get the most out of your deployment.

Prerequisites

Before we get started, you’ll need to have the following:

A running Kubernetes cluster
Docker Desktop, installed. Verify that kubectl is also installed. You will be running docker and kubectl commands from your machine.
A DockerHub account (or another container registry where you can store your JupyterHub Docker images)
Basic knowledge of Kubernetes concepts, such as Pods, Deployments, and Services. For more information, see the documentation (https://kubernetes.io/docs/concepts/)

Step 1: Create a JupyterHub Configuration File

The first step is to create a configuration file that tells JupyterHub how to set itself up. You can use a default configuration file as a starting point and modify it as needed. Here’s an example configuration file:

auth:
  type: dummy

hub:
  cookie_secret: "YOUR_SECRET_KEY"
  db:
    url: postgresql://jupyterhub:jupyterhub@jupyterhub-db/jupyterhub
  service:
    type: ClusterIP
  url: http://jupyterhub:8000

proxy:
  secretToken: "YOUR_SECRET_TOKEN"

singleuser:
  image:
    name: "YOUR_JUPYTER_NOTEBOOK_IMAGE"
    tag: "latest"
  storage:
    type: none

In this configuration file, we’re using a dummy authentication system (which allows user access with any combination of username and password) for simplicity, but you can use any authentication system that JupyterHub supports, such as OAuth or LDAP. We’re also using a PostgreSQL database for storing user information, so you’ll need to set up a PostgreSQL database separately (more on this later).

Step 2: Create a Docker Image for JupyterHub

The next step is to create a Docker image for JupyterHub that includes your configuration file. Here’s an example Dockerfile:

FROM jupyterhub/jupyterhub:1.4

COPY jupyterhub_config.yaml /srv/jupyterhub/jupyterhub_config.yaml

This Dockerfile starts with the official JupyterHub Docker image and copies your configuration file to the appropriate location (/srv/jupyterhub/jupyterhub_config.yaml). You can build this image and push it to your container registry:

docker build -t YOUR_IMAGE_NAME .
docker push YOUR_IMAGE_NAME

Step 3: Set up a PostgreSQL Database

As mentioned earlier, we’re using a PostgreSQL database to store user information. You’ll need to set up a PostgreSQL database separately and create a user and database for JupyterHub. Here’s an example kubectl command for creating a PostgreSQL database:

kubectl create -f postgres.yaml
 
# postgres.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: jupyterhub-db
spec:
  ports:
    - name: postgresql
      port: 5432
  selector:
    app: postgres
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:13
          env:
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: jupyterhub-db-secrets
                  key: username
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: jupyterhub-db-secrets
                  key: password
            - name: POSTGRES_DB
              value: jupyterhub
          volumeMounts:
            - name: postgres-pvc
              mountPath: /var/lib/postgresql/data
      volumes:
        - name: postgres-pvc
          persistentVolumeClaim:
            claimName: postgres-pvc
---
apiVersion: v1
kind: Secret
metadata:
  name: jupyterhub-db-secrets
type: Opaque
data:
  username: <base64-encoded-postgres-username>
  password: <base64-encoded-postgres-password>

In this example, we’re using a Persistent Volume Claim to create persistent storage for our database. We’ve also created a Service and Deployment for the PostgreSQL database.

`base64-encoded-postgres-username` refers to a PostgreSQL database username that has been encoded in Base64 format. Base64 encodes binary data as ASCII text, making it possible to transmit data over text-based channels such as email, chat, and HTTP.

To encode a PostgreSQL username in Base64, you can use a command-line tool or an online Base64 encoder. Here is an example of how to encode a PostgreSQL username using the `base64` command in a Linux terminal:

$ echo -n "postgres_username" | base64

This command will output the Base64-encoded version of the PostgreSQL username, which you can then use in your configuration files or scripts that require this value.

Note that the -n option is used with the echo command to prevent adding a newline character at the end of the string, which could cause issues when decoding the value later on.

Step 4: Deploy JupyterHub

Now that you have a JupyterHub Docker image and a PostgreSQL database set up, you can deploy JupyterHub to your Kubernetes cluster. Here’s an example kubectl command for deploying JupyterHub:

kubectl create -f jupyterhub.yaml
 
# jupyterhub.yaml
apiVersion: v1
kind: Service
metadata:
  name: jupyterhub
spec:
  type: NodePort
  ports:
    - port: 80
      targetPort: 8000
  selector:
    app: jupyterhub
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyterhub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyterhub
  template:
    metadata:
      labels:
        app: jupyterhub
    spec:
      containers:
        - name: jupyterhub
          image: YOUR_IMAGE_NAME
          imagePullPolicy: Always
          env:
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: jupyterhub-db-secrets
                  key: username
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: jupyterhub-db-secrets
                  key: password
            - name: POSTGRES_HOST
              value: jupyterhub-db
          command: ["jupyterhub"]
          args: ["--config", "/etc/jupyterhub/jupyterhub_config.py"]
          volumeMounts:
            - name: jupyterhub-cfg
              mountPath: /etc/jupyterhub/
            - name: jupyterhub-data
              mountPath: /data
      volumes:
        - name: jupyterhub-cfg
          configMap:
            name: jupyterhub-config
        - name: jupyterhub-data
          persistentVolumeClaim:
            claimName: jupyterhub-pvc

In this example, we’ve created a Service and Deployment for JupyterHub. The Deployment specifies the Docker image to use, sets the environment variables for the PostgreSQL database connection, and mounts the configuration file and data volume.

Step 5: Configure JupyterHub

You’ll need to configure JupyterHub to use the PostgreSQL database for user authentication. To do this, create a configuration file called jupyterhub_config.py and mount it to the JupyterHub container. Here’s an example jupyterhub_config.py file:

c.JupyterHub.authenticator_class = 'jupyterhub.auth.PAMAuthenticator'

# Use the Postgres database for authentication.
c.JupyterHub.db_url = 'postgresql://jupyterhub:jupyterhub@jupyterhub-db/jupyterhub'

# Use the DockerSpawner to start user containers.
from jupyterhub.spawner import DockerSpawner

class MyDockerSpawner(DockerSpawner):
    def _options_form_default(self):
        return '''
        <label for="cpu">CPU limit (in cores):</label>
        <input type="text" name="cpu" placeholder="1">
        <br>
        <label for="mem">Memory limit (in GB):</label>
        <input type="text" name="mem" placeholder="1">
        <br>
        <label for="gpu">GPU count:</label>
        <input type="text" name="gpu" placeholder="0">
        <br>
        <label for="image">Docker image:</label>
        <input type="text" name="image" value="jupyter/minimal-notebook">
        <br>
        <label for="name">Container name:</label>
        <input type="text" name="name" placeholder="{username}-notebook">
        '''

c.JupyterHub.spawner_class = MyDockerSpawner

# Set the hub IP address for use in the singleuser server.
c.JupyterHub.hub_ip = 'jupyterhub'

In this example, we’re using the PAMAuthenticator to authenticate users against the PostgreSQL database. We’re also using DockerSpawner to start user containers, and we’ve customized the spawner with some additional options for CPU and memory limits, GPU counts, and Docker image and container names.

Note that the db_url option in the configuration file should match the database connection URL that you specified in the POSTGRES_HOST environment variable in the JupyterHub deployment.

Step 6: Start JupyterHub

Now that you have everything set up, you can start JupyterHub by running the following command:

kubectl apply -f jupyterhub.yaml

This will create a Kubernetes Service and Deployment for JupyterHub, which will automatically start a single-user Jupyter Notebook server for each user that logs in.

Conclusion

Deploying JupyterHub on Kubernetes can be a powerful way to provide a collaborative data science environment for your team or organization. With Kubernetes, you can easily scale JupyterHub to handle a large number of users and keep your environment up and running even in the face of hardware failures or other issues.

By following the steps outlined in this guide, you should be able to deploy JupyterHub on Kubernetes and start using it to enable collaborative data science in your organization. If you want to learn more about this or other data science related topics, follow our blog or contact us today.