PPOL 5203 Data Science I: Foundations

Introduction to Numpy

Tiago Ventura</center>

Learning goals¶

In this notebook, we will cover:

Introduction to numpy
Vectorization
Broadcasting

Numpy¶

NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size and in dimensions.

NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.

Why are numpy arrays more efficient?¶

We will see throughout this notebook some different reasons for Numpy's superior efficiency when compared to other native data types in Python, particularly list. A crucialdifference is how elements in a numpy array is stored compared to list elements.

Numpy leans toward less flexibility and more efficiency.
Lists gives you more flexibility and less efficiency.

This is a trade-off between allowing a container to store heterogenous data types, which lists allow you to do, compared to homogenous data storage provided by numpy.

See this paragraph from your PDS textbook:

At the implementation level, the array essentially contains a single pointer to one contiguous block of data. The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier. Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.

Basics of Numpy¶

# import numpy library
import numpy as np

Creating array from Python lists¶

np.array() form the core building block to create array.

input: list
output: numpy array

# from a list of int elements
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

# you can specify the type
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

# unlikely lists, NumPy arrays can explicitly be multidimensional
multi_array = np.array([range(i, i+3) for i in [2, 4, 6]])
multi_array

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

Numpy Arrays vs Built-in Lists¶

We will see throughout this notebook why Numpy's arrays are more efficient than python built-in dtaa structures, like lists.

A primary difference to keep in mind is how elements are stored:

Numpy leans toward less flexibility and more efficiency.
Lists gives you more flexibility and less efficiency.

This is a trade-off between allowing a container to store heterogenous data types, which lists allow you to do, compared to homogenous data storage provided by numpy.

# lists support heterogenous data types. It needs to store this information somewhere for every element!
list_ =  ["beep", "false", False, 1, 1.2]
list_

# numpys only support homogenous data types. Stores elements and this information in a single place!
numpy_boolean = np.array([[True, 0], [True, "TRUE"], [False, True]], dtype=bool)
numpy_boolean

array([[ True, False],
       [ True,  True],
       [False,  True]])

See this paragraph from your PDS textbook:

At the implementation level, the array essentially contains a single pointer to one contiguous block of data. The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier. Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.

Creating Arrays from Scratch¶

Numpy also offer a set of distinct methods to create arrays from scratch, instead of converting from a list. Some options:

numpy.arange() will create arrays with regularly incrementing values
numpy.linspace() will create arrays with a specified number of elements, and spaced equally between the specified beginning and end values.
numpy.zeros() will create an array filled with 0 values with the specified shape.
numpy.ones() will create an array filled with 1 values

# Arange a incremental sequence
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# array with equally spaced intervals
np.linspace(1,5,10)

array([1.        , 1.44444444, 1.88888889, 2.33333333, 2.77777778,
       3.22222222, 3.66666667, 4.11111111, 4.55555556, 5.        ])

# array filled with 0 values with the specified shape.
np.zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

# an array filled with 1 values
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

Random numbers with numpy.random¶

Numpy allows for the generation of number from several known mathematical distributions. Some examples below:

numpy.random.random() will create an array of uniformly distributed random values between 0 and 1
numpy.random.normal() will create an array of normally distributed random values with mean 0 and standard deviation 1
numpy.random.randint() will create an array of random integers from a pre-defined interval

Other options that should be self-explanatory:

numpy.random.poisson()
numpy.random.binomial()
numpy.random.uniform()

# from a random sequence between 0 and 1
np.random.random((2, 5))

array([[0.11200241, 0.9525024 , 0.13002056, 0.54637411, 0.99076604],
       [0.78905229, 0.94401186, 0.30721975, 0.11085805, 0.79922662]])

# from a normal distribution
np.random.normal(0, 1, (3, 3))

array([[-0.8996011 , -1.16138096,  2.22583843],
       [-1.14357556, -0.85923598,  0.4499315 ],
       [-0.47680814,  2.02929798, -0.40186935]])

# random integers from a pre-defined interval
np.random.randint(0, 10, (3, 3))

array([[0, 2, 9],
       [3, 3, 8],
       [0, 2, 3]])

Retrieving attributtes from Arrays¶

numpy.dim(): generates the number of dimension
numpy.shape(): generates the size of each dimension
numpy.size(): generates the full size of a array

# generate a 3-d array from a nested list
array_3d = np.array([ # first element of 1 dimension
                    [ 
                    [1,2,3,4],
                    [2,3,4,1],
                    [-1,1,2,1]],
                    [# second element of 1 dimension
                    [1,2,3,4],
                    [2,3,4,1],
                    [-1,1,2,1]]])

# information
print("ndim: ", array_3d.ndim)
print("shape:", array_3d.shape)
print("size: ", array_3d.size)

ndim:  3
shape: (2, 3, 4)
size:  24

Reshaping Arrays¶

You can reshape array, as soon as you input the appropriate new dimensions!

# new 2d array
array_3d.reshape(4, 6)

array([[ 1,  2,  3,  4,  2,  3],
       [ 4,  1, -1,  1,  2,  1],
       [ 1,  2,  3,  4,  2,  3],
       [ 4,  1, -1,  1,  2,  1]])

# or 6d array
array_3d.reshape(6, 4)

array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  1],
       [-1,  1,  2,  1],
       [ 1,  2,  3,  4],
       [ 2,  3,  4,  1],
       [-1,  1,  2,  1]])

## but you need to provide the proper dimension
array_3d.reshape(6, 6)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 2
      1 ## but you need to provide the proper dimension
----> 2 array_3d.reshape(6, 6)

ValueError: cannot reshape array of size 24 into shape (6,6)

## transpose an array. Very common property in matrix operations. 
array_3d.transpose()

array([[[ 1,  1],
        [ 2,  2],
        [-1, -1]],

       [[ 2,  2],
        [ 3,  3],
        [ 1,  1]],

       [[ 3,  3],
        [ 4,  4],
        [ 2,  2]],

       [[ 4,  4],
        [ 1,  1],
        [ 1,  1]]])

Array Indexing and Slicing¶

Numpy indexing is quite similar to list indexing in Python. And we covered lists and indexing last week.

In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired numerical index.

M[element_index]

For n-dimensional arrays, you can access elements with a tuple for row and column index.

M[row, columne]

You can use the : shortcut for slicing.

# create an 5d array
X = np.random.randint(0, 100, (5, 5))
X

array([[89, 16, 53, 52, 68],
       [29, 45, 43, 34, 32],
       [34, 63, 48, 25, 68],
       [65, 33, 62, 23, 30],
       [98, 84, 68, 11, 41]])

# index first row 
X[0]

array([89, 16, 53, 52, 68])

# index first column
X[:,0]

array([89, 29, 34, 65, 98])

# index a specific cell 
X[0,0]

89

# slice rows and columns
X[0:3,0:3]

array([[89, 16, 53],
       [29, 45, 43],
       [34, 63, 48]])

# last row
X[-1,:]

array([98, 84, 68, 11, 41])

Reassignment¶

As we just saw, numpy makes your life easier for access elements on a retangular type of data -- when compared to nested lists.

In the same venue, numpy uses the benefits of its easy indexing scheme to facilate reassignment of values.

**Importance:** Using numpy for reassignment will be at the core of your data wrangling work with pandas!

# Start creating a array
X = np.zeros(50).reshape(10,5)
X

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

# Reassign data values by referencing positions
X[0,0] = 999
X

array([[999.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.]])

# Reassign whole ranges of values
X[0,:] = 999
X

# by row
X[:,0] = 999
X

array([[999., 999., 999., 999., 999.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.]])

# Reassignment using boolean values. 
D = np.random.randn(50).reshape(10,5).round(1)
D

array([[-0.4, -0.7,  1.9,  1.6, -0.4],
       [ 0.5, -0.2, -1.2,  0.4, -1.6],
       [-0.5,  0.5, -0. , -1.6,  1.8],
       [-0.2, -0.6,  0.7, -0.6, -1. ],
       [ 0.6,  0.5,  0. ,  0.7,  1.6],
       [ 2. , -1.3,  2. ,  0.5, -0.4],
       [-0. , -0.5, -2. , -0.9,  0.9],
       [-0.7,  0.9, -1.1, -0.3,  0.2],
       [ 0.1,  1.3,  1.4,  0.7,  0.4],
       [-0.1, -0.4,  0.9, -1.8,  1.5]])

# reassignment
D[D > 0] = 1
D[D <= 0] = 0
D

array([[0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 0.],
       [1., 1., 0., 1., 1.],
       [1., 0., 1., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 1.],
       [1., 1., 1., 1., 1.],
       [0., 0., 1., 0., 1.]])

# Using where "ifelse()-like" method
D = np.random.randn(50).reshape(10,5).round(1) # Generate some random numbers again
D # Before 
np.where(D>0,1,0) # After

array([[0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 1, 1, 0],
       [0, 0, 0, 0, 1],
       [1, 1, 0, 0, 0],
       [1, 1, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0],
       [1, 0, 1, 0, 0]])

# np.select allow  for element-wise selection reassignment, just like case_when from R
# basic usage: np.select(conditions, choices, default=0)

# create conditions
conditions = [D < 0, D == 0, D > 0]

# element wise reassignment
choices = [-1, 0, 1]

# run np.select
np.select(conditions, choices, default='unknown')

array([['-1', '1', '-1', '-1', '-1'],
       ['-1', '-1', '-1', '-1', '-1'],
       ['1', '-1', '1', '0', '-1'],
       ['-1', '1', '1', '1', '-1'],
       ['-1', '-1', '-1', '-1', '1'],
       ['1', '1', '-1', '-1', '-1'],
       ['1', '1', '-1', '-1', '1'],
       ['-1', '0', '-1', '-1', '1'],
       ['-1', '-1', '-1', '1', '-1'],
       ['1', '-1', '1', '-1', '-1']], dtype='<U21')

Concatenating and Splitting Arrays¶

We can easily stack and grow numpy arrays. These are the main functions for concatenating arrays:

np.concatenate([array,array],axis=0): concatenate by rows
np.concatenate([array,array],axis=1): concatenate by columns

The same behavior can be achieved with np.vstack([array,array]) or `np.hstack([m1,m2])

# create arrays
X = np.random.randint(0, 100, (5, 2))
Y = np.random.randint(0, 100, (5, 2))

# rbind
np.concatenate([X,Y],axis=0)

array([[61, 60],
       [99, 14],
       [89, 28],
       [26, 94],
       [11, 40],
       [74, 11],
       [29, 36],
       [36, 16],
       [49, 21],
       [21, 87]])

# cbind
np.concatenate([X,Y],axis=1)

array([[61, 60, 74, 11],
       [99, 14, 29, 36],
       [89, 28, 36, 16],
       [26, 94, 49, 21],
       [11, 40, 21, 87]])

View vs Copy in Array¶

An sutil, but interesting point, about numpy arrays refers to the default behavior for slicing. When we slice an array we do not copy the array, rather we get a "view" of the array.

Why this matters?: any change in the view will affect the original array

Solution: Make a copy.

As noted in the reading for this week:

One important—and extremely useful—thing to know about array slices is that they return views rather than copies of the array data. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies

We need to use the .copy() method from numpy to create a new array

# from lists
x = [1, 2, 3]
y=x[:] # slice is enough

# modify
y[0]=100

#print
print(y, x)

[100, 2, 3] [1, 2, 3]

# for arrays
X = np.random.randint(0, 100, (1, 5))

# slice
X_sub = X[:3]

# modify
X_sub[0][0] = 1000

# print
print(X, X_sub)

[[1000   24   54   80    4]] [[1000   24   54   80    4]]

# need to copy
# for arrays
X = np.random.randint(0, 100, (1, 5))

# slice.copy()
X_sub = X[:3].copy()

# modify
X_sub[0][0] = 1000

# print
print(X, X_sub)

[[18 69 16  2 70]] [[1000   69   16    2   70]]

Vectorization (or ufunc in Numpy)¶

A critical reason for numpy popularity among data scientists is its efficiency. NumPy provides an easy to implement and flexible interface to optimized computation with arrays of data. The key to making it fast is to use built-in (or easy to implement) vectorized operations.

What are vectorized functions? A vectorize function allows for efficient processing of entire arrays or collections of data elements in a single operation. In plain english, it applies a particular operation in one-shot over a sequence of object. Vectorize functions are efficient because it allows us to avoid looping through entire collections of data.

Let's compare the peformance of vectorized function and a loop, using a example from your reading for this week

import numpy as np
rng = np.random.default_rng(seed=1701)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        # notice the loop
        output[i] = 1.0 / values[i]
    return output
        
values = rng.integers(1, 10, size=5)
compute_reciprocals(values)

array([0.11111111, 0.25      , 1.        , 0.33333333, 0.125     ])

# simple implementation
big_array = rng.integers(1, 100, size=1000000)
%timeit -n 1000 compute_reciprocals(big_array)

# vectorize implementation
%timeit -n 1000 (1.0 / big_array) # notice the `/` here implements the method np.divide, which is a vectorize function.

ALERT: What just happened?¶

NumPy provides built-in vectorized routines as methods for np.arrays. This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.

These built-in vectorize methods are called ufuncs (or "universal functions"). Numpy comes baked in with a large number those vectorized operations. See here for a detailed list.

The google colab notebook from your reading also provides a in-depth coverage of universal functions in numpy. Check it out!

Building vectorized functions¶

We can take advantage of numpy vectorize approach, and very easily vectorise our user-defined functions.

Consider the following function that yields a different string when input a is larger/smaller than input b.

def bigsmall(a,b):
    if a > b:
        return "A is larger"
    else:
        return "B is larger"

bigsmall(5,6)

'B is larger'

# Create a vectorized version of the function
vec_bigsmall = np.vectorize(bigsmall)
vec_bigsmall

<numpy.vectorize at 0x7f824021b340>

The vectorization here brings two main advantages:

Advantage 1: it allows us to apply the function to a collection without using loops.
Advantage 2: it does is in a vectorize manner

# Advantage 1. Avoid the loops
bigsmall([0,2,5,7,0],4)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[41], line 2
      1 # Advantage 1. Avoid the loops
----> 2 bigsmall([0,2,5,7,0],4)

Cell In[38], line 2, in bigsmall(a, b)
      1 def bigsmall(a,b):
----> 2     if a > b:
      3         return "A is larger"
      4     else:

TypeError: '>' not supported between instances of 'list' and 'int'

# vectorize
vec_bigsmall([0,2,5,7,0],4)

array(['B is larger', 'B is larger', 'A is larger', 'A is larger',
       'B is larger'], dtype='<U11')

# Advantage II: vectorize, means faster
# write a function to run element-wise
def bigsmall_el_wise(a_collection, b):
    container = []
    for a in a_collection:
        if a > b:
            container.append("A is larger")
        else:
            container.append("B is larger")
    return container

# Generating some random data
a_collection = np.random.rand(1000000)
b = 0.5

%timeit -n 1000 vec_bigsmall(big_array, b)

119 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit -n 1000 bigsmall_el_wise(big_array, b)

577 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Broadcasting¶

Broadcasting makes it possible for operations to be performed on arrays of mismatched shapes.

Broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes.

For example, say we have a numpy array of dimensions (5,1)

$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} $$

Now say we wanted to add the values in this array by 5

$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + 5 $$

Broadcasting "pads" the array of 5 (which is shape = 1,1), and extends it so that it has similar dimension to the larger array in which the computation is being performed.

$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + \begin{bmatrix} 5\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\end{bmatrix} $$$$ \begin{bmatrix} 1 + 5\\2 + 5\\3 + 5\\4 + 5\\5 + 5\end{bmatrix} $$$$ \begin{bmatrix} 6\\7\\8\\9\\10\end{bmatrix} $$

A = np.array([1,2,3,4,5])
A + 5

array([ 6,  7,  8,  9, 10])

By 'broadcast', we mean that the smaller array is made to match the size of the larger array in order to allow for element-wise manipulations.

How it works:¶

Shapes of the two arrays are compared element-wise.
Dimensions are considered in reverse order, starting with the trailing dimensions, and working forward
We are stretching the smaller array by making copies of its elements. However, and this is key, no actual copies are made, making the method computationally and memory efficient.

A general Rule of thumb: All corresponding dimension of the arrays must be compatible or one of the two dimensions is 1.

Rules of Broadcasting¶

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays (from reading):

Rule 1¶

If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.

Rule 2¶

If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.

Rule 3¶

If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Example 1¶

np.arange(3) + 5

array([5, 6, 7])

$$ \texttt{np.arange(3)} = \begin{bmatrix} 0&1&2\end{bmatrix} $$

$$ \texttt{5} = \begin{bmatrix} 5 \end{bmatrix} $$

$$ \begin{bmatrix} 0&1&2\end{bmatrix} + \begin{bmatrix} 5 & \color{lightgrey}{5} & \color{lightgrey}{5}\end{bmatrix} = \begin{bmatrix} 5 & 6 & 7\end{bmatrix} $$

Example 2¶

np.ones((3,3)) + np.arange(3)

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

$$ \texttt{np.ones((3,3)) = }\begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$

$$ \texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} $$

$$ \begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} + \begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix} = \begin{bmatrix} 1 & 2 & 3\\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{bmatrix} $$

Example 3¶

np.arange(3).reshape(3,1) + np.arange(3)

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

$$ \texttt{np.arange(3).reshape(3,1)} = \begin{bmatrix} 0 \\ 1 \\ 2\end{bmatrix} $$

$$ \texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} $$

$$ \begin{bmatrix} 0 & \color{lightgrey}{0} & \color{lightgrey}{0} \\ 1 & \color{lightgrey}{1} & \color{lightgrey}{1} \\ 2 & \color{lightgrey}{2} & \color{lightgrey}{2}\end{bmatrix} + \begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix} = \begin{bmatrix} 0 & 1 & 2\\ 1 &2&3 \\ 2& 3 & 4\end{bmatrix} $$

Example 4¶

Example of dimensional disagreement.

np.ones((4,7))

array([[1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.]])

np.ones((4,7))  + np.zeros( (5,9) )

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-46dcb0444846> in <module>()
----> 1 np.ones((4,7))  + np.zeros( (5,9) )

ValueError: operands could not be broadcast together with shapes (4,7) (5,9)

np.ones((4,7))  + np.zeros( (1,7) )

array([[1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.]])

M = np.ones((3, 2))
M

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

a = np.arange(3)
a

array([0, 1, 2])

M + a

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[61], line 1
----> 1 M + a

ValueError: operands could not be broadcast together with shapes (3,2) (3,)

Miscellaneous¶

Missing Values¶

Numpy provides a data class for missing values (i.e. nan == "Not a Number", see here)

Y = np.random.randint(1,10,25).reshape(5,5) + .0
Y

array([[4., 7., 2., 9., 5.],
       [2., 7., 6., 1., 4.],
       [2., 4., 8., 8., 8.],
       [6., 4., 7., 4., 6.],
       [2., 6., 2., 4., 2.]])

Y[Y > 5] = np.nan
Y

array([[ 4., nan,  2., nan,  5.],
       [ 2., nan, nan,  1.,  4.],
       [ 2.,  4., nan, nan, nan],
       [nan,  4., nan,  4., nan],
       [ 2., nan,  2.,  4.,  2.]])

type(np.nan)

float

# scan for missing values
np.isnan(Y)

array([[False,  True, False,  True, False],
       [False,  True,  True, False, False],
       [False, False,  True,  True,  True],
       [ True, False,  True, False,  True],
       [False,  True, False, False, False]])

~np.isnan(Y) # are not NAs

array([[ True, False,  True, False,  True],
       [ True, False, False,  True,  True],
       [ True,  True, False, False, False],
       [False,  True, False,  True, False],
       [ True, False,  True,  True,  True]])

When we have missing values, we'll run into issues when computing across the data matrix.

np.mean(Y)

nan

To get around this, we need to use special version of the methods that compensate for the existence of nan.

np.nanmean(Y)

3.0

np.nanmean(Y,axis=0)

array([2.5       , 4.        , 2.        , 3.        , 3.66666667])

# Mean impute the missing values
Y[np.where(np.isnan(Y))] = np.nanmean(Y)
Y

array([[4., 3., 2., 3., 5.],
       [2., 3., 3., 1., 4.],
       [2., 4., 3., 3., 3.],
       [3., 4., 3., 4., 3.],
       [2., 3., 2., 4., 2.]])

Structured Data: NumPy’s Structured Arrays¶

Out of the box, numpy arrays can only handle one data class at a time. Most times we will use heterogenous data types -- spreadsheet with name, age, gender, address, etc..

This short section shows you how to use NumPy’s structured arrays to get around of this limitation.

Let's started creating a some lists. Imagine these are columns on your dataframe

# lists
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

# nest these lists
nested_list = [name, age, weight]
nested_list

[['Alice', 'Bob', 'Cathy', 'Doug'], [25, 45, 37, 19], [55.0, 85.5, 68.0, 61.5]]

# convert to a numpy array
array_nested_list = np.array(nested_list).T
array_nested_list

array([['Alice', '25', '55.0'],
       ['Bob', '45', '85.5'],
       ['Cathy', '37', '68.0'],
       ['Doug', '19', '61.5']], dtype='<U32')

# see data type - all data treated as strings. 
array_nested_list.dtype

dtype('<U32')

In case you which to preserve the preserve the data types for each variables, you could use structured arrays. These are almost like a less flexible dictionary.

You need to follow three steps:

Create a empty structure with pre-defined size
Provide names for the 'collumns'
Provide types for the collumns

data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                             'formats':('U10', 'i', 'f')})

# see the skeleton of the structure
data

array([('', 0, 0.), ('', 0, 0.), ('', 0, 0.), ('', 0, 0.)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f4')])

# add information
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]

# then you can access prety much like dictions
data["name"]

array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

Though possible to deal with heterogeneous data frames using numpy, there is a lot of overhead to constructing a data object.

As such, we'll use Pandas series and DataFrames to deal with heterogeneous data.

!jupyter nbconvert _week_4_numpy.ipynb --to html --template classic

[NbConvertApp] Converting notebook _week_4_numpy.ipynb to html
[NbConvertApp] Writing 384958 bytes to _week_4_numpy.html

PPOL 5203 Data Science I: Foundations Introduction to Numpy Tiago Ventura</center>