Spark on Kubernetes: Jupyter and Beyond
In many organizations, Apache Spark is the computational engine that powers big data. Spark, a general-purpose unified analytics engine built to transform, aggregate, and analyze large amounts of information, has become the de-facto brain behind large scale data processing, machine learning, and graph analysis.
When it was released, Apache Spark 2.3 introduced native support for running on top of Kubernetes. Spark 2.4 further extended the support and brought integration with the Spark shell. In a previous article, we showed the preparations and setup required to get Spark up and running on top of a Kubernetes cluster.
In this article, we'll take the next logical step and show how to run more complex analytic environments such as Jupyter so that it is also able to take advantage of the cluster for data exploration, visualization, or interactive prototyping.
This is Part 2 of a larger series on how to use containers and Kubernetes for Data Science. If you have not already done so, please check out Part 1 which shows how to run Spark applications inside of Kubernetes.
Jupyter
Jupyter allows you to work interactively work with a live running server and iteratively execute logic which remains persistent as long as the kernel is running. It is used to combine live-running code alongside images, data visualization, and other interactive elements such as maps. It has become a de-facto standard for exploratory data analysis and technical communication.
Overview
In this article, we will:
- Extend the "driver" container in the previous article to include Jupyter and integrate the traditional Python shell with PySpark so that it can run large analytic workloads on the cluster.
- Configure S3Contents, a drop-in replacement for the standard filesystem-backed storage system in Jupyter. Using S3Contents is desirable because containers are fragile. They will often crash or disappear, and when that happens the content of their filesystems is lost. When running in Kubernetes, it is therefore important to provide an external storage that will remain available if the container disappears.
- Create an "ingress" that allows for the Jupyter instance to be accessed from outside of the cluster.
A copy of all of the Dockerfiles and other resources used in this article can be found in the Oak-Tree DataOps Examples found in the GitLab repository. Copies of the container images are available from the Oak-Tree DataOps Container Registry.
Packaging Jupyter
As a Python application, Jupyter can be installed with either pip
or conda
. We will be using pip
.
The container images we created previously (spark-k8s-base
and spark-k8s-driver
) both have pip
installed. For that reason, we can extend them directly to include Jupyter and other Python libraries.
The code listing below shows a Dockerfile where we install Jupyter, S3Contents, and a small set of other common data science libraries including:
- NumPy: A library which implements efficient N-dimensional arrays, tools for manipulating data inside of NumPy arrays, interfaces for integrating C/C++ and Fortran code with Python, and a library of linear algebra, Fourier transform, and random number capabilities.
- Matplotlib: A popular data visualization library designed to create charts and graphs.
- Seaborn: A set of additional tools based on matplotlib which extends the basic interfaces for creating statistical charts intended for data exploration.
- scikit-learn: A machine learning library for Python that provides simple tools for data mining and analysis; preprocessing and model selection, as well as implementations of classification, regression, clustering, and dimensionality reduction models.
In addition to the Data Science libraries, the Dockerfile also configures a user for Jupyter and a working directory. This is done because it is (generally) a bad idea to run a Dockerized application as root. This may seem an arbitrary concern as the container will be running as a privileged user inside of the Kubernetes cluster and will have the ability to spawn other containers within its namespace. One of the things that Jupyter provides is a shell interface. By running as a non-privileged user, there is some degree of isolation in case the notebook server becomes compromised.
The dependency installation is split over multiple lines in order to decrease the size of the layers. Large Docker image layers may experience timeouts or other transport issues. This makes container design something of an art. It's a good idea to keep container images as small as possible with as few layers as possible, but you still need to provide the tools to ensure that the container is useful.
FROM code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-driver # Install Jupyter and Data Science libraries USER root RUN ln -sv /usr/bin/pip3 /usr/bin/pip \ && pip install numpy pandas RUN pip install notedown plotly seaborn matplotlib RUN pip install bokeh xlrd yellowbrick RUN pip install scikit-learn scikit-image RUN pip install scipy RUN pip install jupyterlab s3contents \ && mkdir -p /home/public && chmod 777 /home/public RUN pip install py4j \ && ln -s /opt/spark/python/pyspark /usr/local/lib/python3.7/dist-packages/pyspark \ && ln -s /opt/spark/python/pylintrc /usr/local/lib/python3.7/dist-packages/pylintrc # Install Jupyter Spark extension RUN pip install jupyter-spark \ && jupyter serverextension enable --py jupyter_spark \ && jupyter nbextension install --py jupyter_spark \ && jupyter nbextension enable --py jupyter_spark \ && jupyter nbextension enable --py widgetsnbextension # Configure Jupyter User ARG NB_USER="jovyan" ARG NB_UID="1000" ARG NB_GROUP="analytics" ARG NB_GID="777" RUN groupadd -g $NB_GID $NB_GROUP \ && useradd -m -s /bin/bash -N -u $NB_UID -g $NB_GID $NB_USER \ && mkdir -p /home/$NB_USER/work \ && mkdir -p /home/$NB_USER/.jupyter \ && chown -R $NB_USER:$NB_GROUP /home/$NB_USER # Configure Working Directory USER $NB_USER WORKDIR /home/$NB_USER/work
Build and tag the image.
Testing the Image Locally
When the container finishes building, we will want to test it locally to ensure that the application starts.
# Build and tag the jupyter docker build -f Dockerfile.jupyter \ -t code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter . # Test the container image locally to ensure that it starts as expected. # Jupyter Lab/Notebook is started using the command jupyter lab. # We provide the --ip 0.0.0.0 so that it will bind to all interfaces. docker run -it --rm -p 8888:8888 \ code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter \ jupyter lab --ip 0.0.0.0
Once the program starts, you will see an entry in the logs which says, "To access the notebook ... copy and paste one of these URLs ...". Included at the end of the URL is a "token" value that is required to authenticate to the server. Copy this to your system clipboard.
Upon visiting the URL you will be prompted for the token value (unless you copied the entire access URL and pasted that into the navigation bar of the browser). Paste the token into the authentication box and click "Log In." You will be taken to the Jupyter Dashboard/Launcher.
Testing the Container in Kubernetes
Once you've verified that the container image works as expected in our local environment, we need to validate that it also runs in Kubernetes. This involves three steps:
- Pushing the container image to a public repository so that it can be deployed onto the cluster
- Launching an instance inside of Kubernetes using
kubectl run
- Connecting to the container instance by mapping a port from the pod to the local environment using
kubectl port-forward
To perform these steps, you will need two terminals. In the first, you will run the following two commands:
# Push the Jupyter container image to a remote registry docker push code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter # Start an instance of the container in Kubernetes kubectl run jupyter-test-pod --generator=run-pod/v1 -it --rm=true \ --image=code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter \ --serviceaccount=spark-driver \ --command -- jupyter lab --ip 0.0.0.0
When the server is active, note the access URL and token value. Then, in the second terminal, run kubectl port-forward
to map a local port to the container.
# Forward a port in the local environment to the pod to test the runtime kubectl port-forward pod/jupyter-test-pod 8888:8888
With the port-forward
running, open a browser and navigate to the locally mapped port (8088 in the example command above). Provide the token value and click "Log In." Like the local test, you should be routed to the Jupyter dashboard. Seeing the dashboard gives some confidence that the container image works as expected, but that doesn't test the Spark integration.
To test Spark, we need to do two things:
- Create a service so that executor pods are able to connect to the driver. Without a service, the executors will be unable to report their task progress to the driver and tasks will fail.
- Open a Jupyter Notebook, and initialize a
SparkContext
.
First, let's create the service.
kubectl expose pod --type=ClusterIP --cluster-ip=None
With the service in place, let's initialize the SparkContext
. From the launcher, click on the "Python 3" link under "Notebook." This will start a new Python 3 kernel and open the Notebook interface.
Copy the code from the listing below into the notebook and execute the cell. The code defines the parameters needed by Spark to connect to the cluster and launch worker instances. It defines the URL to the Spark master, the container image that should be used for launching workers, the location of the authentication certificate and token, the service account which should be used by the driver instance, and the driver host and port. Specific values may need to be modified for your environment. For details on the parameters, refer to Part 1 of this series.
import pyspark conf = pyspark.SparkConf() # Kubernetes is a Spark master in our setup. # It creates pods with Spark workers, orchestrates those # workers and returns final results to the Spark driver # (“k8s://https://” is NOT a typo, this is how Spark knows the “provider” type). conf.setMaster("k8s://https://kubernetes.default:443") # Worker pods are created from the base Spark docker image. # If you use another image, specify its name instead. conf.set( "spark.kubernetes.container.image", "code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-base") # Authentication certificate and token (required to create worker pods): conf.set( "spark.kubernetes.authenticate.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt") conf.set( "spark.kubernetes.authenticate.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token") # Service account which should be used for the driver conf.set( "spark.kubernetes.authenticate.driver.serviceAccountName", "spark-master") # 2 pods/workers will be created. Can be expanded for larger workloads. conf.set("spark.executor.instances", "2") # The DNS alias for the Spark driver. Required by executors to report status. conf.set( "spark.driver.host", "jupyter-test-pod") # Port which the Spark shell should bind to and to which executors will report progress conf.set("spark.driver.port", "29413") # Initialize spark context, create executors sc = pyspark.SparkContext(conf=conf)
When the cell finishes executing, add the following code to a second cell and execute that. If successful, it will verify that Jupyter, Spark, Kubernetes, and the container images are all configured correctly.
# Create a distributed data set to test to the session t = sc.parallelize(range(10)) # Calculate the approximate sum of values in the dataset r = t.sumApprox(3) print('Approximate sum: %s' % r)
Cloud Native Applications
While the tests at the end of the previous section give us confidence that the Jupyter container image works as expected, we still don't have a robust "cloud native" application that we would want to deploy on a permanent basis.
- The first major problem is that the container storage will be transient. If the container instance restarts or gets migrated to a new host, any saved notebooks will be lost.
- The second major problem also arises in the context of a container restart. At the time it starts, the container looks for a token or password and generates a new random one if it is absent. This means that if the container gets migrated, the previous token will no longer work and the user will need to access the pod logs to learn what the new value is. That would require giving all users access to the cluster
- The third problem is that there is no convenient way to access the instance from outside of the cluster. Using
kubectl
to forward the application ports works great for testing, but there should be a more proper way to access the resource for users who lack administrative Kubernetes access.
The first two problems can be mediated by configuring resources for the container and injecting them into the pod as part of its deployment. The third problem can be solved by creating an ingress,.
S3Contents: Cloud Storage for Jupyter
Of the three problems, the most complex to solve is the first: Dealing with the transient problem. There are a number of approaches we might take:
- Creating a pod volume that mounts when the container starts that is backed by some type of PersistentVolume
- Deploying the application as part of resource that can be tied to physical storage on one of the hosts
- Using an external storage provider such as object storage
Of these three options, using an object storage is the most robust. Object storage servers such as Amazon S3 and MinIO have become the de-facto hard drives for storing data in cloud native applications, machine learning, and many other areas of development. They are an ideal place to store binary and blog information when using containers because they are redundant, have high IO throughput, and can be accessed by many containers simultaneously (which facilitates high availability).
As mentioned earlier in the article, there is a file plugin called S3Contents that can be used to save Jupyter files to object storage providers which implement the Amazon S3 API. We installed the plugin as part of building the container image.
To have Jupyter use an object store, we need to inject a set of configuration parameters into the container at the time it starts. This is usually done through a file called jupyter_notebook_config.py
saved in the user's Jupyter folder (~/.jupyter
). The code listing below shows an example what the resulting configuration of S3Contents might look like for MinIO.
from s3contents import S3ContentsManager c = get_config() # Tell Jupyter to use S3ContentsManager for all storage. c.NotebookApp.contents_manager_class = S3ContentsManager c.S3ContentsManager.access_key_id = "Q3AM3UQ867SPQQA43P2F" c.S3ContentsManager.secret_access_key = "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG" c.S3ContentsManager.endpoint_url = "http://play.minio.io:9000" c.S3ContentsManager.bucket = "s3contents-demo" c.S3ContentsManager.prefix = "notebooks/test"
Injecting Configuration Values Using ConfigMaps
Kubernetes ConfigMaps can be used to store configuration information about a program in a central location. When a pod starts, this data can then be injected as environment variables or mounted as a file. This provides a convenient way of ensuring that configuration values - such as those we'll need to get the external storage in Jupyter working or the authentication token/password - are the same for every pod instance that starts.
ConfigMaps are independent objects in Kubernetes. They are created outside of pods, deployments, or stateful sets, and their data is associated by reference. After a ConfigMap is defined, it is straightforward to include the needed metadata in the pod manifest.
We will use a single ConfigMap to solve the first two problems we described above. The code listing below shows a ConfigMap which both configures an S3 contents manager for Jupyter and provides a known password to the application server at startup.
The setup of an object storage such as Amazon S3, MinIO, OpenStack Swift is beyond the scope of this article. For information about which parameters are needed by specific services for S3Contents, refer to the README file available in the project's GitHub repository.
apiVersion: v1 kind: ConfigMap metadata: name: jupyter-notebook-config data: app_configuration.py: | from s3contents import S3ContentsManager from IPython.lib import passwd c = get_config() # Startup auth Token c.NotebookApp.password = passwd("secret-password@jupyter.oak-tree.office") # S3 Object Storage Configuration from s3contents import S3ContentsManager c.NotebookApp.contents_manager_class = S3ContentsManager c.S3ContentsManager.access_key_id = "oaktree" c.S3ContentsManager.secret_access_key = "secret-key@object-storage.oak-tree.office" c.S3ContentsManager.endpoint_url = "http://object-storage:9000" c.S3ContentsManager.bucket = "jupyter.oak-tree.tech" c.S3ContentsManager.prefix = "notebooks"
The YAML below shows how to reference the ConfigMap as a volume for a pod. The manifest in the listing roughly recreates the kubectl run
command used earlier with the additional configuration required to access the ConfigMap. From this point forward, the configuration of the Jupyter application has become complex enough that we will use manifests to show its structure.
The ConfigMap data will be mounted at ~/.jupyter/jupyter_notebook_config.py
, the path required by Jupyter in order to leverage the contents manager. The fsGroup
option is used under the securityContext
so that it can be read by a member of the analytics
group (gid=777).
apiVersion: v1 kind: Pod metadata: name: jupyter-test-pod labels: app: jupyter-test-pod environment: test spec: serviceAccountName: spark-driver securityContext: fsGroup: 777 containers: - name: jupyter-test-pod image: code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-jupyter imagePullPolicy: Always command: ["jupyter", "lab", "--ip", "0.0.0.0"] volumeMounts: - name: storage-config-volume mountPath: /home/jovyan/.jupyter/jupyter_notebook_config.py subPath: app_configuration.py volumes: - name: storage-config-volume configMap: name: jupyter-object-storage restartPolicy: Always
To work correctly with Spark, the pod needs to be paired with a service in order for executors to spawn and communicate with the driver successfully. The code listing below shows what this service manifest would look like.
apiVersion: v1 kind: Service metadata: name: jupyter-test-pod spec: clusterIP: None selector: app: jupyter-test-pod environment: test ports: - protocol: TCP port: 8888 targetPort: 8888
With the ConfigMap in place, you can launch a new pod/service and repeat the connection test from above. Upon stopping and re-starting the pod instance, you should notice that any notebooks or other files you add to the instance survive rather than disappear. Likewise, on authenticating to the pod, you will be prompted for a password rather than needing to supply a random token.
Enabling External Access
The final piece needed for our Spark-enabled Jupyter instance is external access. Again, while there are several options on how this might be configured such as a load-balanced service, perhaps the most robust is via a Kubernetes Ingress. Ingress allows for HTTP and HTTPS routes from outside the cluster to be forwarded to services inside the cluster.
It provides a host of benefits including:
- Externally reachable URLs
- Load-balanced traffic
- SSL/TLS termination
- Named based virtual hosting
While the specific configuration of these options is outside the scope of this article, providing Ingress using the NGINX controller offers a far more robust way to access the Jupyter instance than kubectl port-forward
.
The code listing below shows an example of a TLS-terminated Ingress controller that will forward to the pod/service created earlier. The TLS certificate is provisioned using cert-manager. For details on cert-manager, see the project's homepage.
apiVersion: extensions/v1beta1 kind: Ingress metadata: name: oaktree-jupyter-ingress annotations: nginx.ingress.kubernetes.io/rewrite-target: / kubernetes.io/ingress.class: "nginx" ingress.kubernetes.io/force-ssl-redirect: "true" ingress.kubernetes.io/proxy-body-size: "800m" nginx.ingress.kubernetes.io/proxy-body-size: "800m" certmanager.k8s.io/issuer: "letsencrypt-prod" spec: tls: - hosts: - jupyter.example.com secretName: oaktree-jupyter-tls rules: - host: jupyter.example.com http: paths: - path: / backend: serviceName: jupyter-test-pod servicePort: 8888
Once it has been created and the certificates issued, the Jupyter instance should now be available outside the cluster at https://jupyter.example.com
.
To Infinity and Beyond
At this point, we have configured a Jupyter instance with a full complement of Data Science libraries able to launch Spark applications on top of Kubernetes. It is configured to read and write its data to an object storage, and integrate with a host of powerful visualization frameworks. We've tried to make it as "Cloud Native" as possible, and could be run on our server instance in a highly available configuration if desired.
That is a powerful set of tools begging to be used and ready to go!
In future articles of this series, we'll start to put these tools into action to understand how best to work with large structured data sets, train machine learning models, work with graph databases, and analyze streaming datasets.
Comments
Loading
No results found