Rob Oakes
Mar 31, 2020

Modeling Data With NumPy: Tabulated/Structured Data

The simplest form of information encountered in many types of data analytics is that of "tabulated" or structured data. This is the type of information that is found sitting in a spreadsheet, a structured file (like comma-separated values or CSV), or in a database. It usually has a "schema" that describes names of columns, their values, what type of data they represent (string, number, boolean, date/time, and so forth), and what they mean.

Tabulated data usually follows a set of rules:

  • Tables typically contain one row per sample (or record).
  • Columns contain one piece of information about a metric of interest (variable)
    • The values encoded in the columns have numerical values or labels.
    • It is possible for rows to be related to one another, or to otherwise be part of a "time series."

Within Python, tabulated data is often analyzed using Pandas or Spark. This is because both higher-level libraries provide support for more sophisticated structures such as date times and re-encoding strings to categorical relationships. Even so, there can be value in loading and working with tabulated data directly in NumPy.

Data that will be part of machine learning workflows, for example, often needs to be encoded directly to numerical values and represented with lower-level structures before it can be used to create models. Using NumPy and directly managing the loading and cleaning process provides direct control, which may be beneficial when combining tabulated data with other complex data features such as image data.

In this article, we'll look at an example of how to fetch and load structured text data from a remote website and load it to a NumPy array via the genfromtxt method. We will then look at how to ensure that the data imported correctly and inspect the resulting array for missing values.

This is Part 2 in a larger series showing how NumPy can be used to model data for machine learning and other types of analytics. If you have not already done so, check out Part 1, which provides an overview of NumPy and shows how to work with image data. A Jupyter notebook with the code in the article can be downloaded here. A Docker environment with NumPy, Pandas, and other dependencies needed can be found here.

Import Dependencies and Supporting Libraries

The code listing below imports NumPy and helper utilities used in the example. These include:

  • numpy
  • requests
  • StringIO, which provides a file-like objects interface for unicode string data. File-like objects are the default interface used by Python to handle input/output (IO) operations.
  • csv, which provides classes and methods for reading CSV files.
import numpy as np

# Helpers for working with streams
# * requests is an http client for retrieving data
# * StringIO provides a file-like object that can be provided to methods
#   expecting a file handle.
import requests
from io import StringIO

# Helper methods for reading and writing CSV files
import csv

Retrieving and Loading Data

There are two methods that can be used to import text data within numpy:

  • loadtxt(file_name, dtype=float): load input data from a text file (or file-like object) and convert it to an array with a particular type. Files loaded with loadtxt cannt have missing values or non-numerical values. If present, loadtxt will throw a ValueError.
  • genfromtxt(file_name): works similarly to loadtxt, but will gracefully handle missing values or non-numerical values. It can use different delimiters, skip the header row of a CSV or tab-separated file (TSV), and attempt to decode the data type of the array.
    • If dtype=None is passed to the constructor, the method will attempt to determine the content type of each column individually.

While both methods are similar, it's not always known if structured text documents will be complete or have inconsistencies. For this reason, genfromtxt is more often used. The code listing below shows how to:

  • Retrieve a CSV file from the web and ensure that the request was successful. If the request has a non-200 status code (something other OK), the code raises a ValueError and stops execution. When working with remote data sources, it is a good idea to ensure that a request was successful prior to processing the data.
  • Decode the byte stream into a UTF-8 encoded string and initializes a StringIO instance so that the data can be read by genfromtxt. To prevent errors, during the parsing of the file, a search operation is executed on the string which looks for all carriage returns \r instances and replaces them with newline characters \n. Carriage returns are an artifact of some Windows programs that incorporate both \r and \n into files. The convention in Unix and Linux systems is to just use the newline.
  • Convert the file to a two-dimensional array through using np.genfromtxt.
    • In the method call, we specify the data type as 32 bit float (np.float32). Any values that cannot be converted to float will be encoded as NaN (not a number).
# Example: Fetch Remote Data and Load to NumPy 2D Array
r_wdata = requests.get(
    'https://oak-tree.tech/documents/96/framingham.csv')

# Check success of remote request, throw an error if
# the request has anything other than a 200 status code (OK)
if not r_wdata.ok:
    raise ValueError('Unable to retrieve %s. Statu code: %s.' % r_wdata.status_code)
    
# Replace \r with \n, prevent errors on Windows
wdata = StringIO(r_wdata.content.decode('utf-8').replace('\r', '\n'))

# Load data directly from data file to NumPy as 32 bit float. 
# Encode missing or invalid (str) data as NaN (not a number).
wdata_arr = np.genfromtxt(wdata, dtype=np.float32, delimiter=",", skip_header=1)

Check Integrity of the Imported File

When loading files from remote sources, it is often good to check the integrity of the data after load. This usually involves two steps:

  1. Check that the file was parsed correctly with the correct options, all columns were converted to cells, and that the resulting dimensions match what is expected.
  2. Inspect the data to determine if there is any missing or incomplete values. Within NumPy, missing data might be caused by incorrectly coded values (the use of a string label within what is otherwise a numerically encoded column, for example), or data that is genuinely missing/null.

Check File Dimensions and Shape

A quick option is to check the shape of the file is to partially import the data using an alternative library or method. Python provides a number of ways to work with CSV data, but the most convenient is probably the csv module.

The code listing below parses the header of the CSV file using a DictReader. It then compares the number of imported columns to the column headers from the DictReader.

# Check the shape of the NumPy array against the data
wdata_csv = csv.DictReader(StringIO(r_wdata.content.decode('utf-8').replace('\r', '\n')), 
    delimiter=',', dialect=csv.excel_tab)
    
assert wdata_arr.shape[1] == len(wdata_csv.fieldnames)

print('NumPy Array Rows/Columns: %s' % str(wdata_arr.shape))
print('CSV Rows/Columns: %s. Column names:\n%s.' 
    % (len(wdata_csv.fieldnames), '\n'.join(wdata_csv.fieldnames)))

Checking for Null Values (NaN)

Within NumPy, np.nan is used to represent a missing or invalid value. It will commonly appear in both NumPy arrays and Pandas data frames when importing a dataset with blank cells.

A count of the missing values within this dataset can be obtained using the np.isnan method available in the NumPy module and the built-in sum method.

  • np.isnan creates a "boolean" array the same size as the input matrix, with True values corresponding to the locations of nan values.
  • When this is passed to sum, the total count of all True values in the column will be tallied and reported in a new vector.
# Inspect data import to determine if the array contains missing values
sum(np.isnan(wdata_arr))
Rob Oakes Mar 31, 2020
More Articles by Rob Oakes

Loading

Unable to find related content

Comments

Loading
Unable to retrieve data due to an error
Retry
No results found
Back to All Comments