# NumPy Primer
NumPy provides an implementation of a data structure called an N-Dimensional array.

* n-dimensional arrays can be used to model important structures in math like vectors and matrices

\begin{equation*}
V =  \begin{vmatrix}
x_{1} \\
x_{2} \\
x_{3}
\end{vmatrix}
\end{equation*}

\begin{equation*}
M =  \begin{vmatrix}
x_{11} & x_{12} & x_{13} \\
x_{21} & x_{22} & x_{23} \\
x_{31} & x_{32} & x_{33}
\end{vmatrix}
\end{equation*}

* NumPy implements the core constructs in C, which makes them very efficient for computation
  - Most Python objects are implemented as objects, which makes them easier to work with, but requires more time and memory for computation

In [None]:
# Helper structures for working with the image data
from collections import OrderedDict

# Helpers for working with streams
# * requests is an http client for retrieving data
# * BytesIO provides a file-like object that can be provided to methods
#   expecting a file handle.
import requests
from io import BytesIO

# NumPy and plotting tools
import numpy as np
import matplotlib as plt
import seaborn as sns

In [None]:
# Creating vector and arrays in Python

# Example 1: Vector (1-dimenion)
v = np.array((1, 0, 1))

# Example 2: Matrix (2-dimenions)
m = np.array([
        [2, 3, 5],
        [7, 1, 6],
        [9, 10, 15]])

NumPy includes methods to create generate arrays directly.

In [None]:
# Example 3: Create an array with a range of values from 0
# to 1000, stepping by 2.
e3 = np.arange(0, 1000, step=2)

# Example 4: Create an Empty (zero) Three Dimensional Matrix 
# with 400 rows, 400 columns, and 3 layers. This structure
# might be used to represent a color image that is 400
# by 400 pixels.
e4 = np.zeros((400, 400, 3))

print("Example 3, Range:\n%r" % e3)
print("Example 4, Zero Matrix:\n%r" % e4)

In [None]:
from functools import reduce

# Shape is the core property describing the size of the array.
# It is a tuple describing the size of all of the dimenions.
# For a three dimensional array, it would have three elements:
# * number of rows
# * number of columns
# * number of layers

# Convenience functions are available on the array that 
# describe the number of dimensions and the total number of elements.
# These can also be calculated from the shape.

# ndim is the number of dimensions, equivalent to the length of shape
assert e4.ndim == len(e4.shape)

# size is the total number of elements
assert e4.size == reduce(lambda l,r: l*r , e4.shape)

In [None]:
# Example 5.11: Flatten a three-dimensional structure to one dimension. 
# The -1 tells NumPy determine the number of columns needed automatically.
e51 = e4.reshape((-1,))

# Example 5.2: Restore the original three-dimensional structure.
e52 = e51.reshape((400, 400, -1))

# Example 6: Create a two-dimensional structure from a one-dimensional array
e6 = np.arange(0, 12).reshape((3, 4))

# Example 7: Flatten an array
e7 = e4.flatten()
assert e51.shape == e7.shape


print('Example 5.1:\n%r' % e51)
print('Example 6:\n%r' % e6)

#### Computation Included
NumPy provides efficient implementations of low level mathematical operations on the data

In [None]:
# Transformations/scaling
v*3

In [None]:
# Linear ALgebra Dot Product
m.dot(v)

## Modeling Data with NumPy
Because of the consise way that data can be represented in NumPy, it forms the foundation of many types of numerical computing.

* Images
* Tabulated Data

### Images
Images are represented as a set of values organized into a table.

* The table will have a set of dimensions corresponding to its height and width.
  - For black and white images, the value of the a particular "cell" is the color of the image. 0 for black and a value like 255 (corresponding to the max value of an 8-bit integer) for white.
  - For color images, there is more than one table. Each table is called a "channel".
  - In most computer systems, there are three channels corresponding to red (r), green (g), and blue (b).
* Images are encoded in a "format."
  - There are many libraries that can be used to read an image and render to an a NumPy array.
  - `imageio` is one that is used commonly with NumPy

In [None]:
import imageio # ImageIO handles loading of image data,
from matplotlib.pyplot import imshow as plt_imshow

import IPython, PIL
# When working in a Jupyter notebook, PIL is often used to allow for
# display and visualization of the image at points in a processing 
# pipeline

CHANNEL_LABELS = OrderedDict((
    ('r', 'red'),
    ('g', 'green'),
    ('b', 'blue')
))

%matplotlib inline

##### Example Image 1

In [None]:
# Example 1: Retrieve an image from a remote website and create an array.
# Light colored image.

# Retrieve
r_img_e1 = requests.get(
    "https://oak-tree.tech/documents/115/resnet.wolf.jpg")
img_e1_arr = imageio.imread(BytesIO(r_img_e1.content))

# Check the dimensions of the image
type(img_e1_arr), img_e1_arr.shape

Loading and working with images:

* BytesIO provides an interface for working with remote data
  - In Python, nearly all file input/output happens through a "file-like" object
* Requests allows for the fetch of remote data
* The "content" of the request is used to create a stream, which is then read by ImageIO to create an array object
* The array that is created by ImageIO is 2668 pixels by 4000 pixels with three channels (RGB)

In [None]:
# Example 1: Display the image data
IPython.display.display(PIL.Image.fromarray(img_e1_arr))

When working with images, we often care a great deal about the distribution of light and dark values. These are often called the histograms.

In [None]:
# Example 1: Calculate histograms from the NumPy arrays

# Step 1.1: Re-shape the two-dimensional table for each channel to a vector
# and add the data to the flattened array from above
cdata1 = np.array(
    [img_e1_arr[:,:,i].flatten() for i in range(0, img_e1_arr.shape[2])])
    
# Step 1.2: Visualize and plot the distributions of pixel data

# Step 1.2.1: Create an output figure for the histograms
plt.pyplot.figure(figsize=(30, 30))

for i, (ccode, clabel) in enumerate(CHANNEL_LABELS.items()):
    
    # Step 1.2.2: Plot image channel histogram
    plt.pyplot.subplot(3, 2, i*2+1)
    plt.pyplot.title(
        "Histogram Values for %s Channel" % clabel.title(), 
        fontsize=30)
    sns.distplot(cdata1[:][i], color=ccode)
    
    # Step 1.2.3: Plot image channel data
    sub = plt.pyplot.subplot(3, 2, i*2+2)
    sub.imshow(img_e1_arr[:,:,i], 
        interpolation='nearest', cmap='%ss' % clabel.title())

##### Example 2

In [None]:
# Example 2: Dark Colored Image
r_img_e2 = requests.get(
    'https://oak-tree.tech/documents/117/resnet.horse-bridle.jpg')
img_e2_arr = imageio.imread(BytesIO(r_img_e2.content))

# Display the image
IPython.display.display(PIL.Image.fromarray(img_e2_arr))

In [None]:
# Step 2.1: Segment the pixel data by channel
cdata2 = np.array(
    [img_e2_arr[:,:,i].flatten() for i in range(0, img_e2_arr.shape[2])])

# Step 2.2: Visualize and plot the distributions of pixel data

# Step 2.2.1: Create an output figure for the histograms
plt.pyplot.figure(figsize=(30, 30))

for i, (ccode, clabel) in enumerate(CHANNEL_LABELS.items()):
    
    # Step 2.2.2: Plot image channel histogram
    plt.pyplot.subplot(3, 2, i*2+1)
    plt.pyplot.title(
        "Histogram Values for %s Channel" % clabel.title(), 
        fontsize=30)
    sns.distplot(cdata2[:][i], color=ccode)
    
    # Step 2.2.3: Plot image channel data
    sub = plt.pyplot.subplot(3, 2, i*2+2)
    sub.imshow(img_e2_arr[:,:,i], 
        interpolation='nearest', cmap='%ss' % clabel.title())

### Tabular Data
The simplest form of structured data encountered in many types of Data Science is available in a spreadsheet, a CSV (comma separated value) file, or in a database.

* Table of information containing one row per sample (or record)
* Columns contain one piece of information about the sample (or variable)
  - Columns may contain numerical values or labels
  - Possible for columns to be related to one another in a "time series"
* Low-level structures such as NumPy normally encode data numerically
  - Higher level structures often have other types of data, such as strings for encoding more complex relationships
  - To be useful for machine learning, higher-level data often needs to be encoded to numerical values and represented by lower-level structures
* Working with Tabular data is typically done in Pandas, but it can be useful to load data at the lower level
  - Loading directly to a NumPy arrays allows quick conversion to PyTorch or TensorFlow tensors to deep learning

In [None]:
import csv
from io import StringIO

# Example: Fetch Remote Data and Load to NumPy 2D Array
r_wdata = requests.get(
    'https://oak-tree.tech/documents/96/framingham.csv')

# Check success of remote request, throw an error if
# the request has anything other than a 200 status code (OK)
if not r_wdata.ok:
    raise ValueError('Unable to retrieve %s. Statu code: %s.' % r_wdata.status_code)
    
# Replace \r with \n, prevent errors on Windows
wdata = StringIO(r_wdata.content.decode('utf-8').replace('\r', '\n'))

# Load data directly from data file to NumPy as 32 bit float. Encode missing or invalid (str)
# data as NaN (not a number).
wdata_arr = np.genfromtxt(wdata, dtype=np.float32, delimiter=",", skip_header=1)
wdata_arr

In [None]:
# Check the shape of the NumPy array against the data
wdata_csv = csv.DictReader(StringIO(r_wdata.content.decode('utf-8').replace('\r', '\n')), 
    delimiter=',', dialect=csv.excel_tab)

assert wdata_arr.shape[1] == len(wdata_csv.fieldnames)

print('NumPy Array Rows/Columns: %s' % str(wdata_arr.shape))
print('CSV Rows/Columns: %s. Column names:\n%s.' 
    % (len(wdata_csv.fieldnames), '\n'.join(wdata_csv.fieldnames)))

In [None]:
# Inspect data import to determine if the array contains missing values
sum(np.isnan(wdata_arr))