Brandon Harper
Feb 11, 2020

A Docker Based Lab for Big Data

As organizations become increasingly reliant on data services for their core missions, having access to systems such as storage, cluster computing, and streaming technologies becomes increasingly important. It allows for the use and analysis of data which would be very difficult to work with using other technologies.

There are three broad categories of data systems:

  • Storage. Data needs a place to live. Storage systems such as HDFS or Object Storage (such as Amazon S3), distributed relational databases like Galera and Citus, and NoSQL provide the medium for information to be consumed for analytics and machine learning. Because such systems frequently require that data be processed in large volume, they also redundantly store information so that it can be read and written in parallel.
  • Compute. Compute provides the brains of Big Data. Compute systems, like Hadoop MapReduce and Spark, provide the toolbox for performing scalable computation. Such resources can be used for performing analytics, stream processing, machine learning, and others.
  • Streaming. Streaming systems like Apache Kafka provide ways to integrate Big Data technologies using a common interchange. This allows for systematic enrichment and analysis; and the integration of data from multiple sources.

Experimenting With Integrated Technologies

In this guide, we will walk you through deployment of a Docker based environment that provides:

  • Python 3.7, the standard Python based data libraries, Spark (which has been integrated with Python so that you are able to use the pyspark shell), TensorFlow, PyTorch, and Jupyter (along with its new integrated development environment, JupyterLab)
  • MinIO, an S3 compatible object storage which is often used for cloud-native data storage
  • Apache Kafka, a high performance distributed streaming platform that is used for enabling the exchange of data between systems and bulding real-time streaming data pipelines that are used for powering analytics/machine learning
  • ZooKeeper, a runtime dependency of Apache Kafka

The deployment files and a set of example notebooks which demonstrate the capabilities of the environment are located in the Oak-Tree PySpark public repository and have been validated to work on Ubuntu 18.04.2 LTS. In this article you will perform the following steps required to deploy the environment:

  • Installing and using Git
  • Installing and deploying Docker and Docker-Compose
  • Deploying and utilizing Spark, Jupyter, Kafka, and ZooKeeper as a set of integrated containers
Before following this guide, you will need access to a Ubuntu 18.04.2 LTS or above machine.

Task: Install the Prerequisites and Deploy JupyterLab

All code text that looks like this represents terminal input/output. Any code that starts with $ signifies a command you must enter in the command line. Example command and resulting output:

$ input command

output resulting from command

1. Open up a terminal and type in the following commands to install git.

$ sudo apt update

$ sudo apt install git

If you are following this course as part of a course, the instructor will provide the password needed for the sudo password prompt. When prompted to continue [Y/n], press Y and hit enter.

When git has finished installing, your terminal will look similar to the below image:

gitinstall-output


2. Install docker and docker-compose to your machine.

Update the package manager and install docker.

$ sudo apt-get upgrade

$ sudo apt-get install docker.io


Verify that docker was installed properly.

$ sudo systemctl status docker

The output will look similar to below:

docker-status ouput

Press Q to exit this output.


Install curl.

$ sudo apt-get install curl


Download docker-compose and save it as docker-compose in your home directory.

$ cd ~

$ sudo curl -L https://oak-tree.tech/documents/101/docker-compose-Linux-x86_64 -o ./docker-compose


Set executable permissions for docker-compose so that it can be used from the command line:

$ sudo chmod +x docker-compose


Test if docker-compose works:

$ ./docker-compose --version

docker-compose version 1.21.2, build a133471


Move the local file to /usr/local/bin

$ sudo mv docker-compose /usr/local/bin


Validate docker-compose is in that directory:

$ which docker-compose

/usr/local/bin/docker-compose


Execute a docker-compose command to verify it has been installed properly.

$ docker-compose --version

docker-compose version 1.21.2, build a133471

3. Accessing JupyterLab

In the home directory, use git to clone the example-files to your machine:

$ cd ~

$ git clone https://code.oak-tree.tech/courseware/oak-tree/pyspark-examples.git


Navigate inside the repository and start-up the docker-compose deployment:

$ cd pyspark-examples/


Execute the docker-compose.yaml file. This file will deploy a ZooKeeper, Spark, Jupyter, and Kafka instances. After execution of the command, you will see a lot of logs run through your terminal, these are the instances being deployed.

$ sudo docker-compose up


Once the deployments have finished initializing, search the console output for the Jupyter URL.

You can use the Find function by pressing CTRL+SHIFT+F in your terminal.

Search for 127.0.0.1:8888 and you will find the entire Jupyter URL to access the hub.

find-jupyter


The complete Jupyter URL will look similar to http://127.0.0.1:8888/?token=0276e1837789712feay4982fh91274

Copy the URL be selecting it in the terminal by pressing CTRL+SHIFT+C then paste it in a Firefox web browser. The UI will then present JupyterLab.

firefox-jupyter


4. Accessing MinIO Storage

In the JupyterLab launcher section, navigate to Console Python 3 and click on the icon to view the in-browser terminal.

You can type in env and press enter to view all of the environment variables. Of these variables, the Minio object storage access and password keys are present. In the JupyterLab terminal, type the following command to view the Minio variables:

$ env | grep OBJECTS

minio-envs-web-age

The OBJECTS_ENDPOINT URL can be used alongside the ACCESSID and SECRET to retrieve data from Jupyter. Several of the example notebooks show how these can be used from within Spark or utilizing a library like boto3.

You can access the MinIO storage UI from port 9000 of your localhost: http://127.0.0.1:9000. The username and password will be the ACCESSID and SECRET defined in the Jupyter environment.

minio-login

Brandon Harper Feb 11, 2020
More Articles by Brandon Harper

Loading

Unable to find related content

Comments

Loading
Unable to retrieve data due to an error
Retry
No results found
Back to All Comments