Rob Oakes
Feb 27, 2020

JupyterHub/Kubernetes: Accessing the Spark UI via the Hub Proxy

To work efficiently with spark, users need access to the Spark UI, a web page (usually running on port 4040) that displays important information about the running Spark application. It includes a list of scheduler stages and tasks, a summary of RDD sizes and memory usage, environmental information, and information about the executors and their status. Traditionally, this interface is accessed by opening http://driver-node:4040 in a web browser. If there are multiple SparkContexts running on the same machine, the web UI will bind to successive ports.

For Jupyter applications running within Kubernetes, accessing the Spark UI is difficult. With most deployments, there is a service or ingress that proxies traffic to the user's Jupyter pod, and the configuration only allows access to the main user interface with no obvious way to access the Spark UI running on port 4040.

Workaround: Utilize jupyter-server-proxy

If the Jupyter environment is managed via JupyterHub (an application that provides support for spawning notebook servers for multiple users), there is an option that can allow you access to the Spark UI (with a little help from a plugin). There is a JupyterLab extension called jupyter-server-proxy that allows you to run arbitrary external processes in the same container as your notebook, and utilize Jupyter itself to proxy traffic to a specific port via proxy endpoint. jupyter-server-proxy supports authentication, which is provided via JupyterHub and its configurable HTTP proxy.

To make this work requires two steps:

  1. Extend the Jupyter Docker container image to include the jupyter-server-proxy extension from pip
  2. Configure the Spark driver so that it is able to work with the proxy.

Extend the Docker Image

The jupyer-server-proxy package can be installed either using pip or conda (in the source listing example, we do so using pip). After install, it needs to be activated via the jupyter serverextension enable ... command.

The code listing only contains the commands needed to deploy and configure the proxy. For a complete reference showing the creation and deployment of a complete Jupyter Image for Kubernetes, please refer to Part 2, Part 3, and Part 4 of Jupyter on Kubernetes. For a pre-configured environment, refer to the Oak-Tree Big Data Docker Lab.

# Note: the following lines should be added to an existing Dockerfile

# Install jupyter-server-proxy
RUN pip3 install jupyter-server-proxy \
  && jupyter serverextension enable --sys-prefix jupyter_server_proxy

Spark Configuration

To configure the application and Spark UI to work with the proxy, there are a number of settings which have to be provided to the runtime. These include configuration parameters passed to the SparkContext and options added to the environment. The code listing below shows an example SparkConf that provides the additional details. The example code is intended to be added to the top of a Jupyter Notebook and can be broadly re-used after modifying they options under the "General Settings" section.

Logic:

  • To make the code as reusable as possible os and socket are used to lookup (or set) information within the environment. These include parameters such as the pod's IP address.
  • For the driver to start, the Spark configuration needs to have the Spark Master (Kubernetes API), namespace, executor container image, and service account. Of special import are the Kubernetes namespace and service account, the application will not be able to start unless you have configured a service account with the admin RBAC.
# Import OS and Spark Dependencies
import os, posixpath, socket
from pyspark import SparkConf, SparkContext


# General Settings
SPARK_EXECIMAGE = 'code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark245-k8s-minio-base'
K8S_SERVICEACCOUNT = 'spark-driver'
K8S_NAMESPACE = 'jupyterhub'
SPARK_EXECUTORS = 2
    

# Set python versions explicitly using the PySpark environment variables
# to prevent the executors from using the wrong version of Pytho9
os.environ['PYSPARK_PYTHON'] = 'python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3'

# Spark Configuration
conf = SparkConf()

# Set Spark Master to the local Kubernetes Driver
conf.setMaster('k8s://https://kubernetes.default.svc')

# Configure executor image and runtime options
conf.set('spark.kubernetes.container.image', SPARK_EXECIMAGE)
conf.set("spark.kubernetes.authenticate.driver.serviceAccountName", K8S_SERVICEACCOUNT) 
conf.set("spark.kubernetes.namespace", K8S_NAMESPACE)

# Spark on K8s works ONLY in client mode (driver runs on client)
conf.set('spark.submit.deployMode', 'client')

# Executor instances and settings
conf.set('spark.executor.instances', SPARK_EXECUTORS)

# Application Name
conf.setAppName('sparkui-k8s-example')

# Set IP of driver. This is always the user pod. The socket methods
# dynamically fetch the IP address.
conf.set('spark.driver.host', socket.gethostbyname(socket.gethostname()))


# Initialize the SparkContext
sc = SparkContext(conf=conf)

Access the Spark Web UI

After the Spark interface is running you can access the web interface via JupyterHub using a browser and a URL similar to the one below:

http://{{ jupyerhub-domain }}/user/{{ username }}/proxy/4040/jobs/

Example:

https://hub.example.com/user/roakes/proxy/4040/jobs

Note the trailing slash after jobs/. If not present, JupyterHub will not route the request correctly and you will get a 404 error.

Figure: Jupyter includes an optional proxy that can be installed to access other processes running  in a container on Kubernetes
When installed, jupyter-server-proxy allows you to access supporting processes like the Spark Web UI from a JupyterHub URL.
Rob Oakes Feb 27, 2020
More Articles by Rob Oakes

Loading

Unable to find related content

Comments

Loading
Unable to retrieve data due to an error
Retry
No results found
Back to All Comments