Kubernetes

Scala

Java

Spark

Cluster Computing

Big Data Processing

Extending JupyterLab With New Kernels

One of the most powerful features of Jupyter is its ability to run code from different programming languages and environments. This article looks at how to install support for Scala and Java to JupyterLab through the use of "Kernels."

Within Jupyter, kernels are separate processes that are started by the server. They are the magical piece of Jupyter that allows you to run your code in different programming languages and environments. JupyterLab (the newest iteration of the Jupyter user interface) supports scores of kernels and provides support for nearly every major programming language including Python, Java, Scala, R, C++, Julia, Matlab, Octave, Scheme, Processing, and many more. Out of the box, however, Jupyter will only run the IPython kernel.

In this article, we will look at extending the Jupyter/Spark Docker images we have created previously earlier to support two additional kernels, Scala and Java.

Scala is a popular programming language that combines object-oriented and functional programming idioms into a concise, high-level language. It utilizes static typing to help avoid bugs in complex applications, and it provides a large ecosystem of libraries that allows for developers to productively work with many major data processing systems (such as Spark).
Java is a general purpose, strongly typed, class-based object-oriented language. It is often compiled to a bytecode instruction set and executed on top of the Java Virtual Machine (JVM). One of the most successful programming languages of all time, Java is used extensively in enterprise and data processing systems. Many of the most important data infrastructure systems including Kafka, ElasticSearch, and Spark use Java extensively as part of their core systems.

We will look at the steps required to deploy the Scala and Java runtimes, download and deploy the BeakerX kernel (which provides support for both Scala and Java), and register the kernel with JupyterLab.

This is Part 4 of an ongoing series on how to use containers and Kubernetes for Data Science. For background and context, please refer to:

Part 1 which shows how to run Spark applications inside of Kubernetes
Part 2 which demonstrates the deployment of Jupyter and base (IPython) configuration
Part 3 which introduces how MinIO (an S3 self-hosted object storage) can be integrated and consumed from applications running inside of JupyterLab

BeakerX

BeakerX is a collection of kernels and extensions to Jupyer that provide a number of powerful enhancements. These include JVM support, improved Spark cluster support, polyglot programming (the practice of writing code in multiple languages in order to capture functionality and efficiency not available in a single language), interactive plotting for JVM based languages, tables, forms, and dynamic publishing.

In a single pass, installing BeakerX will allow us to install kernels providing Groovy (a JVM based scripting language), Clojure, Kotlin, and SQL in addition to Java and Scala. It will also allow us to greatly enhance the support available in our JupyterLab instance for working with Spark structures. We will look at some of BeakerX's features later in the article.

Packaging BeakerX

BeakerX is available from both pip and conda, and only depends upon a Java runtime environment. As in the other articles in this series, we will be using pip. For that reason, we can be up and running with it quickly. We need to:

Install the beakerx package using pip
Run the beakerx command from the package, which will register the Kernels
Integrate the Jupyter lab extensions and build the dependencies. Installing lab extensions requires npm. If you are using a Dockerfile similar to the example below, npm will be already be available as part of the base image.

The example Dockerfile utilizes the spark-k8s-kafka image available as part of theOak-Tree DataOps Examples repository for its base. spark-k8s-kafka includes a number of Data Science, GIS, and deep learning libraries that are used as part of the Oak-Tree Data Science JupyterHub environment.

Because Jupyter Lab is installed globally in the container, we switch to the root user in order to install the beakerx CLI tool and associated kernels (pip install beakerx && beakerx install). Then we activate the new Kernels by installing the beakerx-jupyterlab extension.

Next, we install Apache Maven, a package manager for Java and Scala. If mvn is available in the container image, it is possible to install packages dynamically through the use of a special "Cell Magic" %%classpath. Finally, we set the runtime uid back to 1000 (jovyan), and update the PATH.

FROM code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark245-k8s-kafka:latest

USER root 

# Install BeakerX
RUN pip install --no-cache-dir beakerx \
  && beakerx install

# Install JupyterLab extensions for BeakerX
RUN jupyter labextension install @jupyter-widgets/jupyterlab-manager \
  && jupyter labextension install beakerx-jupyterlab 

# Install Maven
RUN apt-get install -y maven

# Set User back to jovyan/1000
USER 1000
ENV PATH "/opt/kafka/bin:/opt/spark/bin:/opt/maven/bin:${PATH}"

A copy of the above Dockerfile is available from the DataOps example repository as Dockerfile.hub-beakerx.

Testing the Image Locally

When the container finishes building, we will want to test it locally to ensure that everything installed correctly and ensure that Jupyter Lab starts.

# Build and tag the image
docker build -f Dockerfile.hub-beakerx \
    -t code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-beakerx .
    
# Test the container image locally.
# Jupyter Lab/Notebook is started using the command: jupyter lab
# The --ip 0.0.0.0 option is provided so that it will bind to all interfaces of the
# Docker host.
docker run -it --rm -p 8888:8888 \
    code.oak-tree.tech:5005/courseware/oak-tree/dataops-examples/spark-k8s-beakerx \
    jupyter lab --ip 0.0.0.0

Once the program starts, you will see the access URL for the notebook. Copy the URL (including the token) and paste into the browser. This will open the Jupyter Dashboard/Launcher.

Figure: Jupyter Startup Logs Including Token Value — Included in the Jupyter startup logs will be an access URL that includes a "token". This value is required for authentication to the server.

If BeakerX installed correctly, you will see six additional options under the "Notebook" and "Console" sections including Scala, Java, Clojure, Groovy, and Kotlin.

Figure: Jupyter Lab Launcher — The default Jupyter Lab configuration only includes a single kernel for IPython.

Figure: Jupyter Lab Notebook Interface With Multiple Kernels — With plugins such as BeakerX installed, additional kernels become available including Java and Scala.

Testing the Container in Kubernetes

Once you've verified that the container works locally, we should also validate that it runs in Kubernetes. That involves:

Pushing the container image to a public repository for deployment
Launching the instance inside of Kubernetes using kubectl run
Connecting to the container instance by mapping a port to the pod using kubectl port-forward

Please refer to Part 2 of this series for detailed instructions and sample kubectl commands.

Test Driving BeakerX

One of the best ways to get a feel for BeakerX is to see it in action. In this section we'll look at three of BeakerX's marquee features:

Example 0: Dependency Management
Example 1: Spark Integration
Example 2: Plotting and Visualization

BeakerX exposes much of its functionality in the form of "magics." Magics are handy commands built into Jupyter that make it easier to perform specific tasks. They are exposed as macros that can work on a single line (called a line magic) and a complete Jupyter cell (called a cell magic). Jupyter includes a whole host of builtin magics that can be used to do many useful things. Line magics start with a percent character % and cell magics start with two, %%.

These examples are adapted from the notebooks included in the BeakerX documentation, available from the project's GitHub repository.

Example 0: Dependency Management

BeakerX includes a %classpath magic which can be used to add jars to the classpath of the JVM. %classpath can be used with Maven to dynamically fetch and load packages along with their dependencies. It can also be used with jar files.

%classpath can be combined with %import (which causes a class to be visible) and %unimport (which will cause a class to be hidden).

// Example 1: Load JAR directly
%classpath add jar /path/to/jar/demo.jar

// Example 2: Load Spark SQL and Spark Kubernetes using Maven

%%classpath add mvn
org.apache.spark spark-sql_2.11 2.4.5
org.apache.spark spark-kubernetes_2.11 2.4.5

Figure: BeakerX includes a magic to dynamically add dependencies to the JVM classpath — BeakerX includes a magic to dynamically add dependencies to the JVM classpath

Example 1: Spark Integration

The %%spark magic allows for you to load GUIs to manage Spark's configuration and cluster connection, check on job status, inspect data with an interactive widget, and forward kernel interrupt messages to the cluster so that you can stop a job without leaving a notebook. It is compatible with Spark 2.x.

Manage Connection Parameters

The magic can be run all by itself in a cell, which will produce a dialog that allows you to fill out connection options to connect to your cluster.

Alternatively, it can be combined with a SparkSession.builder method call to manually configure components of the session.

// Configure session options

%%spark
SparkSession.builder()
    .appName("Spark-K8s Connection Demo")
    .master("k8s://https://kubernetes.default:443")

// Load Spark dialog and configure connection
%%spark

FIgure: BeakerX includes a magic to manage and configure Spark options

Tracking Job Progress

After a job has started, a three-way progress widget will automatically appear that shows the progress of the job and how many tasks are waiting, running, completed, or canceled.

Figure: As part of its integration with Spark, BeakerX includes a job tracking widget — As part of its integration with Spark, BeakerX includes a job tracking widget

Example 2: Plotting and Visualization

BeakerX provides a large set of special widgets for creating charts, graphs, and visualizations. These include interactive time-series, scatter plots, histograms, heatmaps, and treemaps. The widgets remain interactive in notebooks that are exported for hosting on other platforms and include features such as zooming or exporting to other formats.

Plot widgets in BeakerX are built in to the kernels, which means they can be accessed efficiently without needing to import additional libraries. The examples below how to create a bar chart and line chart.

Example 1: Bar Chart

new Plot { title="Example 1: Bar Chart" }
    .add(new Bars{ y = Seq(3, 5, 2, 3, 8) })

Example 2: Line Chart

new Plot { title="Example 2: Line"}
    .add(new Line {
        x = 1 to 3
        y = 5 to 15 by 5
    })

Just Getting Started

Kernels greatly expand the analytic power of Jupyter. They are a one important reason why Jupyter has become the de-facto standard for nearly many aspects of Data Science. The BeakerX collection of kernels provide many powerful features that greatly expand the capabilities of Jupyter, allowing it to work seamlessly with JVM based languages.

Kubernetes

Scala

Java

Spark

Cluster Computing

Big Data Processing

Extending JupyterLab With New Kernels

BeakerX

Packaging BeakerX

Testing the Image Locally

Testing the Container in Kubernetes

Test Driving BeakerX

Example 0: Dependency Management

Example 1: Spark Integration

Manage Connection Parameters

Tracking Job Progress

Example 2: Plotting and Visualization

Example 1: Bar Chart

Example 2: Line Chart

Just Getting Started

Rob Oakes Feb 26, 2020

Categories

Big Data

Containers

Data Science

DevOps

Loading

Unable to find related content

Comments

Loading

No results found