Rob Oakes
Dec 02, 2019

Spark on Kubernetes: Jupyter and Beyond

After you've got Spark running on Kubernetes, how can you integrate the runtime with applications like Jupyter?

In many organizations, Apache Spark is the computational engine that powers big data. Spark, a general-purpose unified analytics engine built to transform, aggregate, and analyze large amounts of information, has become the de-facto brain behind large scale data processing, machine learning, and graph analysis.

When it was released, Apache Spark 2.3 introduced native support for running on top of Kubernetes. Spark 2.4 further extended the support and brought integration with the Spark shell. In a previous article, we showed the preparations and setup required to get Spark up and running on top of a Kubernetes cluster.

In this article, we'll take the next logical step and show how to run more complex analytic environments such as Jupyter so that it is also able to take advantage of the cluster for data exploration, visualization, or interactive prototyping.

This is Part 2 of a larger series on how to use containers and Kubernetes for Data Science. If you have not already done so, please check out Part 1 which shows how to run Spark applications inside of Kubernetes.

Jupyter

Jupyter allows you to work interactively work with a live running server and iteratively execute logic which remains persistent as long as the kernel is running. It is used to combine live-running code alongside images, data visualization, and other interactive elements such as maps. It has become a de-facto standard for exploratory data analysis and technical communication.

Jupyter Lab: Notebook Interface

Overview

In this article, we will:

  1. Extend the "driver" container in the previous article to include Jupyter and integrate the traditional Python shell with PySpark so that it can run large analytic workloads on the cluster.
  2. Configure S3Contents, a drop-in replacement for the standard filesystem-backed storage system in Jupyter. Using S3Contents is desirable because containers are fragile. They will often crash or disappear, and when that happens the content of their filesystems is lost. When running in Kubernetes, it is therefore important to provide an external storage that will remain available if the container disappears.
  3. Create an "ingress" that allows for the Jupyter instance to be accessed from outside of the cluster.

A copy of all of the Dockerfiles and other resources used in this article can be found in the Oak-Tree DataOps Examples found in the GitLab repository. Copies of the container images are available from the Oak-Tree DataOps Container Registry.

Packaging Jupyter

As a Python application, Jupyter can be installed with either pip or conda. We will be using pip.

The container images we created previously (spark-k8s-base and spark-k8s-driver) both have pip installed. For that reason, we can extend them directly to include Jupyter and other Python libraries.

The code listing below shows a Dockerfile where we install Jupyter, S3Contents, and a small set of other common data science libraries including:

  • NumPy: A library which implements efficient N-dimensional arrays, tools for manipulating data inside of NumPy arrays, interfaces for integrating C/C++ and Fortran code with Python, and a library of linear algebra, Fourier transform, and random number capabilities.
  • Matplotlib: A popular data visualization library designed to create charts and graphs.
  • Seaborn: A set of additional tools based on matplotlib which extends the basic interfaces for creating statistical charts intended for data exploration.
  • scikit-learn: A machine learning library for Python that provides simple tools for data mining and analysis; preprocessing and model selection, as well as implementations of classification, regression, clustering, and dimensionality reduction models.

In addition to the Data Science libraries, the Dockerfile also configures a user for Jupyter and a working directory. This is done because it is (generally) a bad idea to run a Dockerized application as root. This may seem an arbitrary concern as the container will be running as a privileged user inside of the Kubernetes cluster and will have the ability to spawn other containers within its namespace. One of the things that Jupyter provides is a shell interface. By running as a non-privileged user, there is some degree of isolation in case the notebook server becomes compromised.

The dependency installation is split over multiple lines in order to decrease the size of the layers. Large Docker image layers may experience timeouts or other transport issues. This makes container design something of an art. It's a good idea to keep container images as small as possible with as few layers as possible, but you still need to provide the tools to ensure that the container is useful.

FROM code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-driver


# Install Jupyter and Data Science libraries
USER root
RUN ln -sv /usr/bin/pip3 /usr/bin/pip \
	&& pip install numpy pandas 
RUN pip install notedown plotly seaborn matplotlib 
RUN pip install bokeh xlrd yellowbrick
RUN pip install scikit-learn scikit-image
RUN pip install scipy 
RUN pip install jupyterlab s3contents \
	&& mkdir -p /home/public && chmod 777 /home/public
RUN pip install py4j \
	&& ln -s /opt/spark/python/pyspark /usr/local/lib/python3.7/dist-packages/pyspark \
 	&& ln -s /opt/spark/python/pylintrc /usr/local/lib/python3.7/dist-packages/pylintrc

# Install Jupyter Spark extension
RUN pip install jupyter-spark \
	&& jupyter serverextension enable --py jupyter_spark \
	&& jupyter nbextension install --py jupyter_spark \
	&& jupyter nbextension enable --py jupyter_spark \
	&& jupyter nbextension enable --py widgetsnbextension

# Configure Jupyter User
ARG NB_USER="jovyan"
ARG NB_UID="1000"
ARG NB_GROUP="analytics"
ARG NB_GID="777"
RUN groupadd -g $NB_GID $NB_GROUP \
	&& useradd -m -s /bin/bash -N -u $NB_UID -g $NB_GID $NB_USER \
	&& mkdir -p /home/$NB_USER/work \
	&& mkdir -p /home/$NB_USER/.jupyter \
	&& chown -R $NB_USER:$NB_GROUP /home/$NB_USER


# Configure Working Directory
USER $NB_USER
WORKDIR /home/$NB_USER/work

Build and tag the image.

Testing the Image Locally

When the container finishes building, we will want to test it locally to ensure that the application starts.

# Build and tag the jupyter
docker build -f Dockerfile.jupyter \
    -t code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter .

# Test the container image locally to ensure that it starts as expected.
# Jupyter Lab/Notebook is started using the command jupyter lab.
# We provide the --ip 0.0.0.0 so that it will bind to all interfaces.
docker run -it --rm -p 8888:8888 \
    code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter \
    jupyter lab --ip 0.0.0.0

Once the program starts, you will see an entry in the logs which says, "To access the notebook ... copy and paste one of these URLs ...". Included at the end of the URL is a "token" value that is required to authenticate to the server. Copy this to your system clipboard.

Figure: Jupyter Startup Logs Including Token Value
Included in the Jupyter startup logs will be an access URL that includes a "token". This value is required for authentication to the server.

Upon visiting the URL you will be prompted for the token value (unless you copied the entire access URL and pasted that into the navigation bar of the browser). Paste the token into the authentication box and click "Log In." You will be taken to the Jupyter Dashboard/Launcher.

Figure: Jupyter Lab Launcher

Testing the Container in Kubernetes

Once you've verified that the container image works as expected in our local environment, we need to validate that it also runs in Kubernetes. This involves three steps:

  • Pushing the container image to a public repository so that it can be deployed onto the cluster
  • Launching an instance inside of Kubernetes using kubectl run
  • Connecting to the container instance by mapping a port from the pod to the local environment using kubectl port-forward

To perform these steps, you will need two terminals. In the first, you will run the following two commands:

# Push the Jupyter container image to a remote registry
docker push code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter

# Start an instance of the container in Kubernetes
kubectl run jupyter-test-pod --generator=run-pod/v1 -it --rm=true \
  --image=code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter \
  --serviceaccount=spark-driver \
  --command -- jupyter lab --ip 0.0.0.0

When the server is active, note the access URL and token value. Then, in the second terminal, run kubectl port-forward to map a local port to the container.

# Forward a port in the local environment to the pod to test the runtime
kubectl port-forward pod/jupyter-test-pod 8888:8888

With the port-forward running, open a browser and navigate to the locally mapped port (8088 in the example command above). Provide the token value and click "Log In." Like the local test, you should be routed to the Jupyter dashboard. Seeing the dashboard gives some confidence that the container image works as expected, but that doesn't test the Spark integration.

To test Spark, we need to do two things:

  • Create a service so that executor pods are able to connect to the driver. Without a service, the executors will be unable to report their task progress to the driver and tasks will fail.
  • Open a Jupyter Notebook, and initialize a SparkContext.

First, let's create the service.

kubectl expose pod --type=ClusterIP --cluster-ip=None

With the service in place, let's initialize the SparkContext. From the launcher, click on the "Python 3" link under "Notebook." This will start a new Python 3 kernel and open the Notebook interface.

Figure: Jupyter Lab Notebook Interface
To test the Spark connection, we need to intialize a SparkContext.

Copy the code from the listing below into the notebook and execute the cell. The code defines the parameters needed by Spark to connect to the cluster and launch worker instances. It defines the URL to the Spark master, the container image that should be used for launching workers, the location of the authentication certificate and token, the service account which should be used by the driver instance, and the driver host and port. Specific values may need to be modified for your environment. For details on the parameters, refer to Part 1 of this series.

import pyspark

conf = pyspark.SparkConf()

# Kubernetes is a Spark master in our setup. 
# It creates pods with Spark workers, orchestrates those 
# workers and returns final results to the Spark driver 
# (“k8s://https://” is NOT a typo, this is how Spark knows the “provider” type). 
conf.setMaster("k8s://https://kubernetes.default:443") 

# Worker pods are created from the base Spark docker image.
# If you use another image, specify its name instead.
conf.set(
    "spark.kubernetes.container.image", 
    "code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-base") 

# Authentication certificate and token (required to create worker pods):
conf.set(
    "spark.kubernetes.authenticate.caCertFile", 
    "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt")
conf.set(
    "spark.kubernetes.authenticate.oauthTokenFile", 
    "/var/run/secrets/kubernetes.io/serviceaccount/token")

# Service account which should be used for the driver
conf.set(
    "spark.kubernetes.authenticate.driver.serviceAccountName", 
    "spark-master") 

# 2 pods/workers will be created. Can be expanded for larger workloads.
conf.set("spark.executor.instances", "2") 

# The DNS alias for the Spark driver. Required by executors to report status.
conf.set(
    "spark.driver.host", "jupyter-test-pod") 

# Port which the Spark shell should bind to and to which executors will report progress
conf.set("spark.driver.port", "29413") 

# Initialize spark context, create executors
sc = pyspark.SparkContext(conf=conf)

When the cell finishes executing, add the following code to a second cell and execute that. If successful, it will verify that Jupyter, Spark, Kubernetes, and the container images are all configured correctly.

# Create a distributed data set to test to the session
t = sc.parallelize(range(10))

# Calculate the approximate sum of values in the dataset
r = t.sumApprox(3)
print('Approximate sum: %s' % r)

Cloud Native Applications

While the tests at the end of the previous section give us confidence that the Jupyter container image works as expected, we still don't have a robust "cloud native" application that we would want to deploy on a permanent basis.

  • The first major problem is that the container storage will be transient. If the container instance restarts or gets migrated to a new host, any saved notebooks will be lost.
  • The second major problem also arises in the context of a container restart. At the time it starts, the container looks for a token or password and generates a new random one if it is absent. This means that if the container gets migrated, the previous token will no longer work and the user will need to access the pod logs to learn what the new value is. That would require giving all users access to the cluster
  • The third problem is that there is no convenient way to access the instance from outside of the cluster. Using kubectl to forward the application ports works great for testing, but there should be a more proper way to access the resource for users who lack administrative Kubernetes access.

The first two problems can be mediated by configuring resources for the container and injecting them into the pod as part of its deployment. The third problem can be solved by creating an ingress,.

S3Contents: Cloud Storage for Jupyter

Of the three problems, the most complex to solve is the first: Dealing with the transient problem. There are a number of approaches we might take:

  • Creating a pod volume that mounts when the container starts that is backed by some type of PersistentVolume
  • Deploying the application as part of resource that can be tied to physical storage on one of the hosts
  • Using an external storage provider such as object storage

Of these three options, using an object storage is the most robust. Object storage servers such as Amazon S3 and MinIO have become the de-facto hard drives for storing data in cloud native applications, machine learning, and many other areas of development. They are an ideal place to store binary and blog information when using containers because they are redundant, have high IO throughput, and can be accessed by many containers simultaneously (which facilitates high availability).

As mentioned earlier in the article, there is a file plugin called S3Contents that can be used to save Jupyter files to object storage providers which implement the Amazon S3 API. We installed the plugin as part of building the container image.

To have Jupyter use an object store, we need to inject a set of configuration parameters into the container at the time it starts. This is usually done through a file called jupyter_notebook_config.py saved in the user's Jupyter folder (~/.jupyter). The code listing below shows an example what the resulting configuration of S3Contents might look like for MinIO.

from s3contents import S3ContentsManager

c = get_config()

# Tell Jupyter to use S3ContentsManager for all storage.
c.NotebookApp.contents_manager_class = S3ContentsManager
c.S3ContentsManager.access_key_id = "Q3AM3UQ867SPQQA43P2F"
c.S3ContentsManager.secret_access_key = "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG"
c.S3ContentsManager.endpoint_url = "http://play.minio.io:9000"
c.S3ContentsManager.bucket = "s3contents-demo"
c.S3ContentsManager.prefix = "notebooks/test"

Injecting Configuration Values Using ConfigMaps

Kubernetes ConfigMaps can be used to store configuration information about a program in a central location. When a pod starts, this data can then be injected as environment variables or mounted as a file. This provides a convenient way of ensuring that configuration values - such as those we'll need to get the external storage in Jupyter working or the authentication token/password - are the same for every pod instance that starts.

ConfigMaps are independent objects in Kubernetes. They are created outside of pods, deployments, or stateful sets, and their data is associated by reference. After a ConfigMap is defined, it is straightforward to include the needed metadata in the pod manifest.

We will use a single ConfigMap to solve the first two problems we described above. The code listing below shows a ConfigMap which both configures an S3 contents manager for Jupyter and provides a known password to the application server at startup.

The setup of an object storage such as Amazon S3, MinIO, OpenStack Swift is beyond the scope of this article. For information about which parameters are needed by specific services for S3Contents, refer to the README file available in the project's GitHub repository.

apiVersion: v1
kind: ConfigMap
metadata:
  name: jupyter-notebook-config
data:
  app_configuration.py: |
    from s3contents import S3ContentsManager
    from IPython.lib import passwd
    c = get_config()
    # Startup auth Token
    c.NotebookApp.password = passwd("secret-password@jupyter.oak-tree.office")
    # S3 Object Storage Configuration
    from s3contents import S3ContentsManager
    c.NotebookApp.contents_manager_class = S3ContentsManager
    c.S3ContentsManager.access_key_id = "oaktree"
    c.S3ContentsManager.secret_access_key = "secret-key@object-storage.oak-tree.office"
    c.S3ContentsManager.endpoint_url = "http://object-storage:9000"
    c.S3ContentsManager.bucket = "jupyter.oak-tree.tech"
    c.S3ContentsManager.prefix = "notebooks"

The YAML below shows how to reference the ConfigMap as a volume for a pod. The manifest in the listing roughly recreates the kubectl run command used earlier with the additional configuration required to access the ConfigMap. From this point forward, the configuration of the Jupyter application has become complex enough that we will use manifests to show its structure.

The ConfigMap data will be mounted at ~/.jupyter/jupyter_notebook_config.py, the path required by Jupyter in order to leverage the contents manager. The fsGroup option is used under the securityContext so that it can be read by a member of the analytics group (gid=777).

apiVersion: v1
kind: Pod
metadata:
  name: jupyter-test-pod
  labels:
    app: jupyter-test-pod
    environment: test

spec:
  serviceAccountName: spark-driver
  
  securityContext:
    fsGroup: 777
  
  containers:
  - name: jupyter-test-pod
    image: code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter
    imagePullPolicy: Always
    command: ["jupyter", "lab", "--ip", "0.0.0.0"]
    volumeMounts:
    - name: storage-config-volume
      mountPath: /home/jovyan/.jupyter/jupyter_notebook_config.py
      subPath: app_configuration.py
  
  volumes:
  - name: storage-config-volume
    configMap:
      name: jupyter-object-storage
  
  restartPolicy: Always

To work correctly with Spark, the pod needs to be paired with a service in order for executors to spawn and communicate with the driver successfully. The code listing below shows what this service manifest would look like.

apiVersion: v1
kind: Service
metadata:
  name: jupyter-test-pod

spec:
  clusterIP: None
  selector:
    app: jupyter-test-pod
    environment: test
  ports:
  - protocol: TCP
    port: 8888
    targetPort: 8888

With the ConfigMap in place, you can launch a new pod/service and repeat the connection test from above. Upon stopping and re-starting the pod instance, you should notice that any notebooks or other files you add to the instance survive rather than disappear. Likewise, on authenticating to the pod, you will be prompted for a password rather than needing to supply a random token.

Enabling External Access

The final piece needed for our Spark-enabled Jupyter instance is external access. Again, while there are several options on how this might be configured such as a load-balanced service, perhaps the most robust is via a Kubernetes Ingress. Ingress allows for HTTP and HTTPS routes from outside the cluster to be forwarded to services inside the cluster.

It provides a host of benefits including:

  • Externally reachable URLs
  • Load-balanced traffic
  • SSL/TLS termination
  • Named based virtual hosting

While the specific configuration of these options is outside the scope of this article, providing Ingress using the NGINX controller offers a far more robust way to access the Jupyter instance than kubectl port-forward.

The code listing below shows an example of a TLS-terminated Ingress controller that will forward to the pod/service created earlier. The TLS certificate is provisioned using cert-manager. For details on cert-manager, see the project's homepage.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: oaktree-jupyter-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    kubernetes.io/ingress.class: "nginx"
    ingress.kubernetes.io/force-ssl-redirect: "true"
    ingress.kubernetes.io/proxy-body-size: "800m"
    nginx.ingress.kubernetes.io/proxy-body-size: "800m"
    certmanager.k8s.io/issuer: "letsencrypt-prod"

spec:
  tls:
  - hosts:
    - jupyter.example.com
    secretName: oaktree-jupyter-tls
  rules:
  - host: jupyter.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: jupyter-test-pod
          servicePort: 8888

Once it has been created and the certificates issued, the Jupyter instance should now be available outside the cluster at https://jupyter.example.com.

To Infinity and Beyond

At this point, we have configured a Jupyter instance with a full complement of Data Science libraries able to launch Spark applications on top of Kubernetes. It is configured to read and write its data to an object storage, and integrate with a host of powerful visualization frameworks. We've tried to make it as "Cloud Native" as possible, and could be run on our server instance in a highly available configuration if desired.

That is a powerful set of tools begging to be used and ready to go!

In future articles of this series, we'll start to put these tools into action to understand how best to work with large structured data sets, train machine learning models, work with graph databases, and analyze streaming datasets.

Rob Oakes Dec 02, 2019
More Articles by Rob Oakes

Loading

Unable to find related content

Comments

Loading
Unable to retrieve data due to an error
Retry
No results found
Back to All Comments