Modeling Data with NumPy: Tabulated/Structured Data
The simplest form of information encountered in many types of data analytics is that of "tabulated" or structured data. This is the type of information that is found sitting in a spreadsheet, a structured file (like comma-separated values or CSV), or in a database. It usually has a "schema" that describes names of columns, their values, what type of data they represent (string, number, boolean, date/time, and so forth), and what they mean.
Tabulated data usually follows a set of rules:
- Tables typically contain one row per sample (or record).
- Columns contain one piece of information about a metric of interest (variable)
- The values encoded in the columns have numerical values or labels.
- It is possible for rows to be related to one another, or to otherwise be part of a "time series."
Within Python, tabulated data is often analyzed using Pandas or Spark. This is because both higher-level libraries provide support for more sophisticated structures such as date times and re-encoding strings to categorical relationships. Even so, there can be value in loading and working with tabulated data directly in NumPy.
Data that will be part of machine learning workflows, for example, often needs to be encoded directly to numerical values and represented with lower-level structures before it can be used to create models. Using NumPy and directly managing the loading and cleaning process provides direct control, which may be beneficial when combining tabulated data with other complex data features such as image data.
In this article, we'll look at an example of how to fetch and load structured text data from a remote website and load it to a NumPy array via the genfromtxt
method. We will then look at how to ensure that the data imported correctly and inspect the resulting array for missing values.
This is Part 2 in a larger series showing how NumPy can be used to model data for machine learning and other types of analytics. If you have not already done so, check out Part 1, which provides an overview of NumPy and shows how to work with image data. A Jupyter notebook with the code in the article can be downloaded here. A Docker environment with NumPy, Pandas, and other dependencies needed can be found here.
Import Dependencies and Supporting Libraries
The code listing below imports NumPy and helper utilities used in the example. These include:
numpy
requests
StringIO
, which provides a file-like objects interface for unicode string data. File-like objects are the default interface used by Python to handle input/output (IO) operations.csv
, which provides classes and methods for reading CSV files.
import numpy as np # Helpers for working with streams # * requests is an http client for retrieving data # * StringIO provides a file-like object that can be provided to methods # expecting a file handle. import requests from io import StringIO # Helper methods for reading and writing CSV files import csv
Retrieving and Loading Data
There are two methods that can be used to import text data within numpy
:
loadtxt(file_name, dtype=float)
: load input data from a text file (or file-like object) and convert it to an array with a particular type. Files loaded withloadtxt
cannt have missing values or non-numerical values. If present,loadtxt
will throw aValueError
.genfromtxt(file_name)
: works similarly toloadtxt
, but will gracefully handle missing values or non-numerical values. It can use different delimiters, skip the header row of a CSV or tab-separated file (TSV), and attempt to decode the data type of the array.- If
dtype=None
is passed to the constructor, the method will attempt to determine the content type of each column individually.
- If
While both methods are similar, it's not always known if structured text documents will be complete or have inconsistencies. For this reason, genfromtxt
is more often used. The code listing below shows how to:
- Retrieve a CSV file from the web and ensure that the request was successful. If the request has a non-200 status code (something other OK), the code raises a
ValueError
and stops execution. When working with remote data sources, it is a good idea to ensure that a request was successful prior to processing the data. - Decode the byte stream into a UTF-8 encoded string and initializes a
StringIO
instance so that the data can be read bygenfromtxt
. To prevent errors, during the parsing of the file, a search operation is executed on the string which looks for all carriage returns\r
instances and replaces them with newline characters\n
. Carriage returns are an artifact of some Windows programs that incorporate both\r
and\n
into files. The convention in Unix and Linux systems is to just use the newline. - Convert the file to a two-dimensional array through using
np.genfromtxt
.- In the method call, we specify the data type as 32 bit float (
np.float32
). Any values that cannot be converted to float will be encoded as NaN (not a number).
- In the method call, we specify the data type as 32 bit float (
# Example: Fetch Remote Data and Load to NumPy 2D Array r_wdata = requests.get( 'https://oak-tree.tech/documents/96/framingham.csv') # Check success of remote request, throw an error if # the request has anything other than a 200 status code (OK) if not r_wdata.ok: raise ValueError('Unable to retrieve %s. Statu code: %s.' % r_wdata.status_code) # Replace \r with \n, prevent errors on Windows wdata = StringIO(r_wdata.content.decode('utf-8').replace('\r', '\n')) # Load data directly from data file to NumPy as 32 bit float. # Encode missing or invalid (str) data as NaN (not a number). wdata_arr = np.genfromtxt(wdata, dtype=np.float32, delimiter=",", skip_header=1)
Check Integrity of the Imported File
When loading files from remote sources, it is often good to check the integrity of the data after load. This usually involves two steps:
- Check that the file was parsed correctly with the correct options, all columns were converted to cells, and that the resulting dimensions match what is expected.
- Inspect the data to determine if there is any missing or incomplete values. Within NumPy, missing data might be caused by incorrectly coded values (the use of a string label within what is otherwise a numerically encoded column, for example), or data that is genuinely missing/null.
Check File Dimensions and Shape
A quick option is to check the shape of the file is to partially import the data using an alternative library or method. Python provides a number of ways to work with CSV data, but the most convenient is probably the csv
module.
The code listing below parses the header of the CSV file using a DictReader
. It then compares the number of imported columns to the column headers from the DictReader
.
# Check the shape of the NumPy array against the data wdata_csv = csv.DictReader(StringIO(r_wdata.content.decode('utf-8').replace('\r', '\n')), delimiter=',', dialect=csv.excel_tab) assert wdata_arr.shape[1] == len(wdata_csv.fieldnames) print('NumPy Array Rows/Columns: %s' % str(wdata_arr.shape)) print('CSV Rows/Columns: %s. Column names:\n%s.' % (len(wdata_csv.fieldnames), '\n'.join(wdata_csv.fieldnames)))
Checking for Null Values (NaN
)
Within NumPy, np.nan
is used to represent a missing or invalid value. It will commonly appear in both NumPy arrays and Pandas data frames when importing a dataset with blank cells.
A count of the missing values within this dataset can be obtained using the np.isnan
method available in the NumPy module and the built-in sum
method.
np.isnan
creates a "boolean" array the same size as the input matrix, withTrue
values corresponding to the locations ofnan
values.- When this is passed to
sum
, the total count of allTrue
values in the column will be tallied and reported in a new vector.
# Inspect data import to determine if the array contains missing values sum(np.isnan(wdata_arr))
Comments
Loading
No results found