NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size and in dimensions.
NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.
We will see throughout this notebook some different reasons for Numpy's superior efficiency when compared to other native data types in Python, particularly list. A crucialdifference is how elements in a numpy array is stored compared to list elements.
This is a trade-off between allowing a container to store heterogenous data types, which lists allow you to do, compared to homogenous data storage provided by numpy.
See this paragraph from your PDS textbook:
# import numpy library
import numpy as np
np.array()
form the core building block to create array.
# from a list of int elements
np.array([1, 4, 2, 5, 3])
# you can specify the type
np.array([1, 2, 3, 4], dtype='float32')
# unlikely lists, NumPy arrays can explicitly be multidimensional
multi_array = np.array([range(i, i+3) for i in [2, 4, 6]])
multi_array
We will see throughout this notebook why Numpy's arrays are more efficient than python built-in dtaa structures, like lists.
A primary difference to keep in mind is how elements are stored:
This is a trade-off between allowing a container to store heterogenous data types, which lists allow you to do, compared to homogenous data storage provided by numpy.
# lists support heterogenous data types. It needs to store this information somewhere for every element!
list_ = ["beep", "false", False, 1, 1.2]
list_
# numpys only support homogenous data types. Stores elements and this information in a single place!
numpy_boolean = np.array([[True, 0], [True, "TRUE"], [False, True]], dtype=bool)
numpy_boolean
See this paragraph from your PDS textbook:
Numpy also offer a set of distinct methods to create arrays from scratch, instead of converting from a list. Some options:
numpy.arange()
will create arrays with regularly incrementing valuesnumpy.linspace()
will create arrays with a specified number of elements, and spaced equally between the specified beginning and end values.numpy.zeros()
will create an array filled with 0 values with the specified shape.numpy.ones()
will create an array filled with 1 values# Arange a incremental sequence
np.arange(0, 10)
# array with equally spaced intervals
np.linspace(1,5,10)
# array filled with 0 values with the specified shape.
np.zeros((3,3))
# an array filled with 1 values
np.ones((3,3))
Numpy allows for the generation of number from several known mathematical distributions. Some examples below:
numpy.random.random()
will create an array of uniformly distributed random values between 0 and 1numpy.random.normal()
will create an array of normally distributed random values with mean 0 and standard deviation 1numpy.random.randint()
will create an array of random integers from a pre-defined intervalOther options that should be self-explanatory:
numpy.random.poisson()
numpy.random.binomial()
numpy.random.uniform()
# from a random sequence between 0 and 1
np.random.random((2, 5))
# from a normal distribution
np.random.normal(0, 1, (3, 3))
# random integers from a pre-defined interval
np.random.randint(0, 10, (3, 3))
numpy.dim()
: generates the number of dimensionnumpy.shape()
: generates the size of each dimensionnumpy.size()
: generates the full size of a array# generate a 3-d array from a nested list
array_3d = np.array([ # first element of 1 dimension
[
[1,2,3,4],
[2,3,4,1],
[-1,1,2,1]],
[# second element of 1 dimension
[1,2,3,4],
[2,3,4,1],
[-1,1,2,1]]])
# information
print("ndim: ", array_3d.ndim)
print("shape:", array_3d.shape)
print("size: ", array_3d.size)
You can reshape array, as soon as you input the appropriate new dimensions!
# new 2d array
array_3d.reshape(4, 6)
# or 6d array
array_3d.reshape(6, 4)
## but you need to provide the proper dimension
array_3d.reshape(6, 6)
## transpose an array. Very common property in matrix operations.
array_3d.transpose()
Numpy indexing is quite similar to list indexing in Python. And we covered lists and indexing last week.
In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired numerical index.
M[element_index]
For n-dimensional arrays, you can access elements with a tuple for row and column index.
M[row, columne]
You can use the :
shortcut for slicing.
# create an 5d array
X = np.random.randint(0, 100, (5, 5))
X
# index first row
X[0]
# index first column
X[:,0]
# index a specific cell
X[0,0]
# slice rows and columns
X[0:3,0:3]
# last row
X[-1,:]
As we just saw, numpy
makes your life easier for access elements on a retangular type of data -- when compared to nested lists.
In the same venue, numpy
uses the benefits of its easy indexing scheme to facilate reassignment of values.
# Start creating a array
X = np.zeros(50).reshape(10,5)
X
# Reassign data values by referencing positions
X[0,0] = 999
X
# Reassign whole ranges of values
X[0,:] = 999
X
# by row
X[:,0] = 999
X
# Reassignment using boolean values.
D = np.random.randn(50).reshape(10,5).round(1)
D
# reassignment
D[D > 0] = 1
D[D <= 0] = 0
D
# Using where "ifelse()-like" method
D = np.random.randn(50).reshape(10,5).round(1) # Generate some random numbers again
D # Before
np.where(D>0,1,0) # After
# np.select allow for element-wise selection reassignment, just like case_when from R
# basic usage: np.select(conditions, choices, default=0)
# create conditions
conditions = [D < 0, D == 0, D > 0]
# element wise reassignment
choices = [-1, 0, 1]
# run np.select
np.select(conditions, choices, default='unknown')
We can easily stack and grow numpy arrays. These are the main functions for concatenating arrays:
np.concatenate([array,array],axis=0)
: concatenate by rowsnp.concatenate([array,array],axis=1)
: concatenate by columns The same behavior can be achieved with np.vstack([array,array])
or `np.hstack([m1,m2])
# create arrays
X = np.random.randint(0, 100, (5, 2))
Y = np.random.randint(0, 100, (5, 2))
# rbind
np.concatenate([X,Y],axis=0)
# cbind
np.concatenate([X,Y],axis=1)
An sutil, but interesting point, about numpy arrays refers to the default behavior for slicing. When we slice an array we do not copy the array, rather we get a "view" of the array.
Why this matters?: any change in the view will affect the original array
Solution: Make a copy.
As noted in the reading for this week:
One important—and extremely useful—thing to know about array slices is that they return views rather than copies of the array data. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies
We need to use the .copy()
method from numpy to create a new array
# from lists
x = [1, 2, 3]
y=x[:] # slice is enough
# modify
y[0]=100
#print
print(y, x)
# for arrays
X = np.random.randint(0, 100, (1, 5))
# slice
X_sub = X[:3]
# modify
X_sub[0][0] = 1000
# print
print(X, X_sub)
# need to copy
# for arrays
X = np.random.randint(0, 100, (1, 5))
# slice.copy()
X_sub = X[:3].copy()
# modify
X_sub[0][0] = 1000
# print
print(X, X_sub)
A critical reason for numpy
popularity among data scientists is its efficiency. NumPy provides an easy to implement and flexible interface to optimized computation with arrays of data. The key to making it fast is to use built-in (or easy to implement) vectorized operations.
What are vectorized functions? A vectorize function allows for efficient processing of entire arrays or collections of data elements in a single operation. In plain english, it applies a particular operation in one-shot over a sequence of object. Vectorize functions are efficient because it allows us to avoid looping through entire collections of data.
Let's compare the peformance of vectorized function and a loop, using a example from your reading for this week
import numpy as np
rng = np.random.default_rng(seed=1701)
def compute_reciprocals(values):
output = np.empty(len(values))
for i in range(len(values)):
# notice the loop
output[i] = 1.0 / values[i]
return output
values = rng.integers(1, 10, size=5)
compute_reciprocals(values)
# simple implementation
big_array = rng.integers(1, 100, size=1000000)
%timeit -n 1000 compute_reciprocals(big_array)
# vectorize implementation
%timeit -n 1000 (1.0 / big_array) # notice the `/` here implements the method np.divide, which is a vectorize function.
NumPy provides built-in vectorized routines as methods for np.arrays
. This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.
These built-in vectorize methods are called ufuncs
(or "universal functions"). Numpy comes baked in with a large number those vectorized operations. See here for a detailed list.
The google colab notebook from your reading also provides a in-depth coverage of universal functions in numpy
. Check it out!
We can take advantage of numpy vectorize approach, and very easily vectorise our user-defined functions.
Consider the following function that yields a different string when input a
is larger/smaller than input b
.
def bigsmall(a,b):
if a > b:
return "A is larger"
else:
return "B is larger"
bigsmall(5,6)
# Create a vectorized version of the function
vec_bigsmall = np.vectorize(bigsmall)
vec_bigsmall
The vectorization here brings two main advantages:
# Advantage 1. Avoid the loops
bigsmall([0,2,5,7,0],4)
# vectorize
vec_bigsmall([0,2,5,7,0],4)
# Advantage II: vectorize, means faster
# write a function to run element-wise
def bigsmall_el_wise(a_collection, b):
container = []
for a in a_collection:
if a > b:
container.append("A is larger")
else:
container.append("B is larger")
return container
# Generating some random data
a_collection = np.random.rand(1000000)
b = 0.5
%timeit -n 1000 vec_bigsmall(big_array, b)
%timeit -n 1000 bigsmall_el_wise(big_array, b)
Broadcasting makes it possible for operations to be performed on arrays of mismatched shapes.
Broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes.
For example, say we have a numpy array of dimensions (5,1)
Now say we wanted to add the values in this array by 5
$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + 5 $$Broadcasting "pads" the array of 5 (which is shape = 1,1), and extends it so that it has similar dimension to the larger array in which the computation is being performed.
$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + \begin{bmatrix} 5\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\end{bmatrix} $$$$ \begin{bmatrix} 1 + 5\\2 + 5\\3 + 5\\4 + 5\\5 + 5\end{bmatrix} $$$$ \begin{bmatrix} 6\\7\\8\\9\\10\end{bmatrix} $$A = np.array([1,2,3,4,5])
A + 5
By 'broadcast', we mean that the smaller array is made to match the size of the larger array in order to allow for element-wise manipulations.
A general Rule of thumb: All corresponding dimension of the arrays must be compatible or one of the two dimensions is 1.
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays (from reading):
If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
np.arange(3) + 5
np.ones((3,3)) + np.arange(3)
np.arange(3).reshape(3,1) + np.arange(3)
Example of dimensional disagreement.
np.ones((4,7))
np.ones((4,7)) + np.zeros( (5,9) )
np.ones((4,7)) + np.zeros( (1,7) )
M = np.ones((3, 2))
M
a = np.arange(3)
a
M + a
Numpy provides a data class for missing values (i.e. nan
== "Not a Number", see here)
Y = np.random.randint(1,10,25).reshape(5,5) + .0
Y
Y[Y > 5] = np.nan
Y
type(np.nan)
# scan for missing values
np.isnan(Y)
~np.isnan(Y) # are not NAs
When we have missing values, we'll run into issues when computing across the data matrix.
np.mean(Y)
To get around this, we need to use special version of the methods that compensate for the existence of nan
.
np.nanmean(Y)
np.nanmean(Y,axis=0)
# Mean impute the missing values
Y[np.where(np.isnan(Y))] = np.nanmean(Y)
Y
Out of the box, numpy arrays can only handle one data class at a time. Most times we will use heterogenous data types -- spreadsheet with name, age, gender, address, etc..
This short section shows you how to use NumPy’s structured arrays
to get around of this limitation.
Let's started creating a some lists. Imagine these are columns on your dataframe
# lists
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
# nest these lists
nested_list = [name, age, weight]
nested_list
# convert to a numpy array
array_nested_list = np.array(nested_list).T
array_nested_list
# see data type - all data treated as strings.
array_nested_list.dtype
In case you which to preserve the preserve the data types for each variables, you could use structured arrays. These are almost like a less flexible dictionary.
You need to follow three steps:
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i', 'f')})
# see the skeleton of the structure
data
# add information
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
# then you can access prety much like dictions
data["name"]
Though possible to deal with heterogeneous data frames using numpy, there is a lot of overhead to constructing a data object.
As such, we'll use Pandas series and DataFrames to deal with heterogeneous data.
!jupyter nbconvert _week_4_numpy.ipynb --to html --template classic