<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'>Introduction to Numpy <br><br>
Tiago Ventura</center></center> <h1> 

---

## Learning goals

In this notebook, we will cover: 

- Introduction to numpy
- Vectorization
- Broadcasting


### Numpy

NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size and in dimensions. 

NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.

### Why are numpy arrays more efficient?

We will see throughout this notebook some different reasons for Numpy's superior efficiency when compared to other native data types in Python, particularly lists. A crucial difference is how elements in a numpy array are stored compared to list elements. 

- Numpy leans toward less flexibility and more efficiency. 
- Lists gives you more flexibility and less efficiency. 

This is a trade-off between allowing a container to store **heterogenous** data types, which lists allow you to do, compared to **homogenous** data storage provided by numpy. 

See this paragraph from your PDS textbook: 

<div class="alert alert-block alert-info"> 

At the implementation level, the array essentially contains a single pointer to one contiguous block of data. The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier. Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.

</div>    

## Basics of Numpy

In [1]:
# import numpy library
import numpy as np

#### Creating array from Python lists

`np.array()` form the core building block to create array.

- input: list
- output: numpy array

In [2]:
# from a list of int elements
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

In [3]:
# you can specify the type
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

In [4]:
# unlikely lists, NumPy arrays can explicitly be multidimensional
multi_array = np.array([range(i, i+3) for i in [2, 4, 6]])
multi_array

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

### Numpy Arrays vs Built-in Lists

In [6]:
# lists support heterogenous data types. It needs to store this information somewhere for every element!
list_ =  ["beep", "false", False, 1, 1.2]
list_

# numpys only support homogenous data types. Stores elements and this information in a single place!
numpy_boolean = np.array([[True, 0], [True, "TRUE"], [False, True]], dtype=bool)
numpy_boolean

array([['True', '0'],
       ['True', 'TRUE'],
       ['False', 'True']], dtype='<U21')

In [8]:
# without define the datatyp
numpy_boolean = np.array([[True, 0], [True, "TRUE"], [False, True]])

# will create a string type
numpy_boolean

array([['True', '0'],
       ['True', 'TRUE'],
       ['False', 'True']], dtype='<U21')

#### Creating Arrays from Scratch

Numpy also offer a set of distinct methods to create arrays from scratch, instead of converting from a list. Some options: 

- `numpy.arange()` will create arrays with regularly incrementing values
- `numpy.linspace()` will create arrays with a specified number of elements, and spaced equally between the specified beginning and end values.
- `numpy.zeros()` will create an array filled with 0 values with the specified shape.
- `numpy.ones()` will create an array filled with 1 values


In [9]:
# Arange a incremental sequence
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
# array with equally spaced intervals
np.linspace(1,5,10) 

array([1.        , 1.44444444, 1.88888889, 2.33333333, 2.77777778,
       3.22222222, 3.66666667, 4.11111111, 4.55555556, 5.        ])

In [11]:
# array filled with 0 values with the specified shape.
np.zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [12]:
# an array filled with 1 values
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

#### Random numbers with numpy.random

Numpy allows for the generation of number from several known mathematical distributions. Some examples below: 

- `numpy.random.random()` will create an array of uniformly distributed random values between 0 and 1
- `numpy.random.normal()` will create an array of normally distributed random values with mean 0 and standard deviation 1
- `numpy.random.randint()` will create an array of random integers from a pre-defined interval

Other options that should be self-explanatory: 

- `numpy.random.poisson()`
- `numpy.random.binomial()`
- `numpy.random.uniform()`

In [13]:
# from a random sequence between 0 and 1
np.random.random((2, 5))

array([[0.48692744, 0.03309944, 0.59179774, 0.11386525, 0.3140595 ],
       [0.57067889, 0.81819492, 0.91066097, 0.65898677, 0.42020137]])

In [14]:
# from a normal distribution
np.random.normal(0, 1, (3, 3))

array([[ 0.57500904,  0.56405735, -0.40547543],
       [ 1.40098955,  1.65582837,  1.73618222],
       [ 1.23867901,  0.43744013, -0.29132206]])

In [15]:
# random integers from a pre-defined interval
np.random.randint(0, 10, (3, 3))

array([[4, 7, 3],
       [2, 9, 6],
       [1, 4, 3]])

### Retrieving attributtes from Arrays

- `numpy.dim()`: generates the number of dimension
- `numpy.shape()`: generates the size of each dimension
- `numpy.size()`: generates the full size of a array

In [16]:
# generate a 3-d array from a nested list
array_3d = np.array([ # first element of 1 dimension
                    [ 
                    [1,2,3,4],
                    [2,3,4,1],
                    [-1,1,2,1]],
                    [# second element of 1 dimension
                    [1,2,3,4],
                    [2,3,4,1],
                    [-1,1,2,1]]])

# information
print("ndim: ", array_3d.ndim)
print("shape:", array_3d.shape)
print("size: ", array_3d.size)

ndim:  3
shape: (2, 3, 4)
size:  24


### Reshaping Arrays

You can reshape array, as soon as you input the appropriate new dimensions!

In [17]:
# new 2d array
array_3d.reshape(4, 6)

array([[ 1,  2,  3,  4,  2,  3],
       [ 4,  1, -1,  1,  2,  1],
       [ 1,  2,  3,  4,  2,  3],
       [ 4,  1, -1,  1,  2,  1]])

In [18]:
# or another  array
array_3d.reshape(6, 4)

array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  1],
       [-1,  1,  2,  1],
       [ 1,  2,  3,  4],
       [ 2,  3,  4,  1],
       [-1,  1,  2,  1]])

In [19]:
## but you need to provide the proper dimension
array_3d.reshape(6, 6)

ValueError: cannot reshape array of size 24 into shape (6,6)

In [22]:
## transpose an array. Very common property in matrix operations. 
array_3d.reshape(6, 4).transpose()

array([[ 1,  2, -1,  1,  2, -1],
       [ 2,  3,  1,  2,  3,  1],
       [ 3,  4,  2,  3,  4,  2],
       [ 4,  1,  1,  4,  1,  1]])

### Array Indexing and Slicing

Numpy indexing is quite similar to list indexing in Python. And we covered lists and indexing last week.

In a one-dimensional array, you can access the ith value (**counting from zero**) by specifying the desired numerical index. 

```
M[element_index]
```

For n-dimensional arrays, you can access elements with a tuple for row and column index. 

```
M[row, column]
```

You can use the `:` shortcut for slicing.

In [23]:
# create an 5d array
X = np.random.randint(0, 100, (5, 5))
X

array([[ 9, 72, 91, 96, 88],
       [71, 66, 85, 34, 63],
       [47, 14, 46, 15, 20],
       [74, 80, 54,  4,  7],
       [26, 89, 85, 36, 92]])

In [24]:
# index first row 
X[0] 

array([ 9, 72, 91, 96, 88])

In [25]:
# index first column
X[:,0]

array([ 9, 71, 47, 74, 26])

In [26]:
# index a specific cell 
X[0,0] 

9

In [27]:
# slice rows and columns
X[0:3,0:3] 

array([[ 9, 72, 91],
       [71, 66, 85],
       [47, 14, 46]])

In [28]:
# last row
X[-1,:] 

array([26, 89, 85, 36, 92])

### Reassignment

As we just saw, `numpy` makes your life easier for access elements on a retangular type of data -- when compared to nested lists. 

In the same venue, `numpy` uses the benefits of its easy indexing scheme to facilate reassignment of values. 

<div class="alert alert-block alert-danger"> 

**Importance:**
    
Using numpy for reassignment will be at the core of your data wrangling work with pandas!
    
</div>

In [29]:
# Start creating a array
X = np.zeros(50).reshape(10,5)
X

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [30]:
# Reassign data values by referencing positions
X[0,0] = 999
X

array([[999.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.]])

In [32]:
# Reassign whole ranges of values
X[0,:] = 999
X

array([[999., 999., 999., 999., 999.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.]])

In [33]:
# by row
X[:,0] = 999
X

array([[999., 999., 999., 999., 999.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.]])

In [34]:
# Reassignment using boolean values. 
D = np.random.randn(50).reshape(10,5).round(1)
D

array([[ 3.1,  0.3,  1.2,  1.7,  1.4],
       [ 0.4, -0.1,  1. ,  0.5, -1.3],
       [-0. ,  0.3,  1.4,  0.6, -0.7],
       [ 1.3, -0.1, -0.1, -0.4,  0.7],
       [-1.4, -0.7, -0.5,  0.6, -2. ],
       [ 0.4,  0.8, -0.5,  1.1, -0.4],
       [ 0.4, -0.1, -1.2, -0. ,  1.3],
       [-0.2, -0.4,  2. ,  0.1,  0.3],
       [ 0.9,  1.5, -0.4,  1.4, -0.3],
       [ 0.6,  1.3, -0.7,  1.1,  0. ]])

In [35]:
# reassignment
D[D > 0] = 1
print(D)

[[ 1.   1.   1.   1.   1. ]
 [ 1.  -0.1  1.   1.  -1.3]
 [-0.   1.   1.   1.  -0.7]
 [ 1.  -0.1 -0.1 -0.4  1. ]
 [-1.4 -0.7 -0.5  1.  -2. ]
 [ 1.   1.  -0.5  1.  -0.4]
 [ 1.  -0.1 -1.2 -0.   1. ]
 [-0.2 -0.4  1.   1.   1. ]
 [ 1.   1.  -0.4  1.  -0.3]
 [ 1.   1.  -0.7  1.   0. ]]


In [36]:
D[D <= 0] = 0
D

array([[1., 1., 1., 1., 1.],
       [1., 0., 1., 1., 0.],
       [0., 1., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.],
       [1., 1., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 1., 1.],
       [1., 1., 0., 1., 0.],
       [1., 1., 0., 1., 0.]])

In [37]:
# Using where "ifelse()-like" method
D = np.random.randn(50).reshape(10,5).round(1) # Generate some random numbers again
D # Before 
np.where(D>0,1,0) # After

array([[0, 1, 1, 1, 1],
       [1, 0, 0, 0, 0],
       [1, 1, 0, 1, 0],
       [0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0],
       [0, 1, 1, 0, 0],
       [1, 1, 0, 0, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 1, 1, 1]])

In [38]:
# np.select allow  for element-wise selection reassignment, just like case_when from R
# basic usage: np.select(conditions, choices, default=0)

# create conditions
conditions = [D < 0, D == 0, D > 0]

# element wise reassignment
choices = [-1, 0, 1]

# run np.select
np.select(conditions, choices, default='unknown')

array([['-1', '1', '1', '1', '1'],
       ['1', '-1', '-1', '-1', '-1'],
       ['1', '1', '-1', '1', '-1'],
       ['-1', '0', '1', '1', '-1'],
       ['-1', '-1', '1', '-1', '0'],
       ['-1', '1', '1', '-1', '-1'],
       ['1', '1', '-1', '-1', '1'],
       ['1', '1', '-1', '-1', '-1'],
       ['0', '-1', '-1', '1', '-1'],
       ['-1', '1', '1', '1', '1']], dtype='<U21')

### Concatenating and Splitting Arrays

We can easily stack and grow numpy arrays. These are the main functions for concatenating arrays: 

- `np.concatenate([array,array],axis=0)`: concatenate by rows
- `np.concatenate([array,array],axis=1)`: concatenate by columns 

The same behavior can be achieved with `np.vstack([array,array])` or `np.hstack([m1,m2])

In [39]:
# create arrays
X = np.random.randint(0, 100, (5, 2))
Y = np.random.randint(0, 100, (5, 2))

In [40]:
# rbind
np.concatenate([X,Y],axis=0)

array([[86, 30],
       [28, 17],
       [87, 21],
       [99, 47],
       [69, 14],
       [29, 87],
       [91, 36],
       [40, 79],
       [44, 10],
       [ 5, 68]])

In [41]:
# cbind
np.concatenate([X,Y],axis=1)

array([[86, 30, 29, 87],
       [28, 17, 91, 36],
       [87, 21, 40, 79],
       [99, 47, 44, 10],
       [69, 14,  5, 68]])

### View vs Copy in Array

An sutil, but interesting point, about numpy arrays refers to the default behavior for slicing. When we slice an array
we **_do not copy the array_**, rather we get a "**view**" of the array.

**Why this matters?**: any change in the view will affect the original array
    
**Solution**: Make a copy. 

As noted in the [reading for this week](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html): 

> One important—and extremely useful—thing to know about array slices is that they return views rather than copies of the array data. **This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies**

We need to use the `.copy()` method from numpy to create a new array

In [42]:
# from lists
x = [1, 2, 3]
y=x[:] # slice is enough

# modify
y[0]=100

#print
print(y, x)

[100, 2, 3] [1, 2, 3]


In [43]:
# for arrays
X = np.random.randint(0, 100, (1, 5))

# slice
X_sub = X[:3]

# modify
X_sub[0][0] = 1000

# print
print(X, X_sub)

[[1000   32   15   22   20]] [[1000   32   15   22   20]]


In [44]:
# need to copy
# for arrays
X = np.random.randint(0, 100, (1, 5))

# slice.copy()
X_sub = X[:3].copy()

# modify
X_sub[0][0] = 1000

# print
print(X, X_sub)

[[50 59 68 73 14]] [[1000   59   68   73   14]]


## Vectorization (or ufunc in Numpy)

A critical reason for `numpy` popularity among data scientists is its efficiency.  NumPy provides an easy to implement and flexible interface to optimized computation with arrays of data. The key to making it fast is to use built-in (or easy to implement) vectorized operations. 


**What are vectorized functions?** A vectorize function allows for efficient processing of entire arrays or collections of data elements in a **single operation**. In plain english, it applies a particular operation in single-shot over a sequence of object. Vectorize functions are efficient because it allows us to avoid looping through entire collections of data. 

Let's compare the peformance of vectorized function and a loop, using a example from your [reading for this week](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html)

#### Comparing Vectorized and Scalar Functions

In [8]:
# Let's try to calculate a 1 by a list of numbers.
values = list(rng.integers(1, 10, size=5))

# error
1.0 / values

TypeError: unsupported operand type(s) for /: 'float' and 'list'

In [None]:
# solution: write a non-vectorized function
import numpy as np
rng = np.random.default_rng(seed=1701)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        # notice the loop
        output[i] = 1.0 / values[i]
    return output
        
values = rng.integers(1, 10, size=5)
compute_reciprocals(values)

In [5]:
# but it works with an array, because numpy allows for vectorized functions
# numpy
values = rng.integers(1, 10, size=5)

# error
1.0/values

array([0.2  , 0.2  , 1.   , 0.125, 0.25 ])

 ### <span style="color:red"> ALERT: What just happened?</span>

NumPy provides built-in vectorized routines as methods for `np.arrays`. This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.


These built-in vectorize methods are called `ufuncs` (or "universal functions"). Numpy comes baked in with a large number those vectorized operations. [See here for a detailed list.](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html) 

The [google colab notebook](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.03-Computation-on-arrays-ufuncs.ipynb) from your reading also provides a in-depth coverage of universal functions in `numpy`. Check it out!

#### Check difference in perfomance

In [51]:
# simple implementation
big_array = rng.integers(1, 100, size=1000)
type(big_array)

numpy.ndarray

In [47]:
# using the loop version
%timeit -n 1000 compute_reciprocals(big_array)

802 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [44]:
# vectorize implementation
%timeit -n 1000 (1.0 / big_array) # notice the `/` here implements the method np.divide, which is a vectorize function. 

3.22 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Building vectorized functions

We can take advantage of numpy vectorize approach, and very easily vectorise our user-defined functions. 

Consider the following function that yields a different string when input `a` is larger/smaller than input `b`.

In [52]:
def bigsmall(a,b):
    if a > b:
        return "A is larger"
    else:
        return "B is larger"

In [53]:
bigsmall(5,6)

'B is larger'

In [54]:
# Create a vectorized version of the function
vec_bigsmall = np.vectorize(bigsmall)
vec_bigsmall 

<numpy.vectorize at 0x106197cd0>

The vectorization here brings two main advantages: 

- Advantage 1: it allows us to apply the function to a collection without using loops. 
- Advantage 2: it does is in a vectorize manner

In [57]:
# Advantage 1. Avoid the loops
bigsmall([0,2,5,7,0],4)

## error because the function is not vectorized

TypeError: '>' not supported between instances of 'list' and 'int'

In [58]:
# vectorize
vec_bigsmall([0,2,5,7,0],4)

array(['B is larger', 'B is larger', 'A is larger', 'A is larger',
       'B is larger'], dtype='<U11')

In [59]:
# Advantage II: vectorize, means faster
# write a function to run element-wise
def bigsmall_el_wise(a_collection, b):
    container = []
    for a in a_collection:
        if a > b:
            container.append("A is larger")
        else:
            container.append("B is larger")
    return container

In [60]:
# Generating some random data
a_collection = np.random.rand(1000000)
b = 0.5

In [61]:
%timeit -n 1000 vec_bigsmall(big_array, b)

156 µs ± 9.58 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [62]:
%timeit -n 1000 bigsmall_el_wise(big_array, b)

734 µs ± 7.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Broadcasting

**Broadcasting** makes it possible for operations to be performed on arrays of mismatched shapes.

Broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes.

For example, say we have a numpy array of dimensions (5,1)

$$ 
\begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix}
$$

Now say we wanted to add the values in this array by 5

$$ 
\begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + 5
$$

Broadcasting "pads" the array of 5 (which is shape = 1,1), and extends it so that it has similar dimension to the larger array in which the computation is being performed.

$$ 
\begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + \begin{bmatrix} 5\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\end{bmatrix}
$$

$$ 
\begin{bmatrix} 1 + 5\\2 + 5\\3 + 5\\4 + 5\\5 + 5\end{bmatrix} 
$$

$$ 
\begin{bmatrix} 6\\7\\8\\9\\10\end{bmatrix} 
$$

In [41]:
A = np.array([1,2,3,4,5])
A + 5

array([ 6,  7,  8,  9, 10])

By 'broadcast', we mean that the smaller array is made to match the size of the larger array in order to allow for element-wise manipulations.

### How it works:

- Shapes of the two arrays are compared _element-wise_. 
- Dimensions are considered in reverse order, starting with the trailing dimensions, and working forward 
- We are stretching the smaller array by making copies of its elements. However, and this is key, no actual copies are made, making the method computationally and memory efficient.

A general **Rule of thumb**: All corresponding dimension of the arrays must be compatible or one of the two dimensions is 1.

## Rules of Broadcasting

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays (from [reading](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html)). Learning these rules is not really important. You only need to get the intuition of what is going on!


### Rule 1
> If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.

### Rule 2

> If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.

### Rule 3 

> If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

#### Example 1

In [63]:
np.arange(3) + 5

array([5, 6, 7])

$$
\texttt{np.arange(3)} = \begin{bmatrix} 0&1&2\end{bmatrix}
$$

<br> 

$$
\texttt{5}  = \begin{bmatrix} 5 \end{bmatrix}
$$

<br> 

$$
\begin{bmatrix} 0&1&2\end{bmatrix} + \begin{bmatrix} 5 & \color{lightgrey}{5} & \color{lightgrey}{5}\end{bmatrix} = \begin{bmatrix} 5 & 6 & 7\end{bmatrix} 
$$

#### Example 2

In [62]:
np.ones((3,3)) + np.arange(3)

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

$$
\texttt{np.ones((3,3)) = }\begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}
$$

<br>

$$
\texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} 
$$

<br>

$$
\begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} + 
\begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\  \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix}  = 
\begin{bmatrix} 1 & 2 & 3\\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{bmatrix} 
$$

#### Example 3

In [64]:
np.arange(3).reshape(3,1) + np.arange(3)

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

$$
\texttt{np.arange(3).reshape(3,1)} = \begin{bmatrix} 0 \\ 1 \\ 2\end{bmatrix} 
$$

<br>

$$
\texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} 
$$

<br>

$$
\begin{bmatrix} 0 & \color{lightgrey}{0} & \color{lightgrey}{0} \\ 1 & \color{lightgrey}{1} & \color{lightgrey}{1} \\  2 & \color{lightgrey}{2} & \color{lightgrey}{2}\end{bmatrix} +
\begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\  \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix}  =
\begin{bmatrix} 0 & 1 & 2\\ 1 &2&3 \\ 2& 3 & 4\end{bmatrix} 
$$


#### Example 4

Example of dimensional disagreement.

In [45]:
np.ones((4,7)) 

array([[1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.]])

In [46]:
np.ones((4,7))  + np.zeros( (5,9) )

ValueError: operands could not be broadcast together with shapes (4,7) (5,9) 

In [57]:
np.ones((4,7))  + np.zeros( (1,7) )

array([[1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.]])

Another example

In [66]:
M = np.ones((3, 2))
M

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

In [67]:
a = np.arange(3)
a

array([0, 1, 2])

In [68]:
print(M.shape)
print(a.shape)

(3, 2)
(3,)


In [70]:
# error back
M + a

ValueError: operands could not be broadcast together with shapes (3,2) (3,) 

# Miscellaneous (Not Covered in Class)

### Missing Values

Numpy provides a data class for missing values (i.e. `nan` == "Not a Number", see [here](https://en.wikipedia.org/wiki/NaN))

In [75]:
Y = np.random.randint(1,10,25).reshape(5,5) + .0
Y

array([[4., 7., 2., 9., 5.],
       [2., 7., 6., 1., 4.],
       [2., 4., 8., 8., 8.],
       [6., 4., 7., 4., 6.],
       [2., 6., 2., 4., 2.]])

In [76]:
Y[Y > 5] = np.nan
Y

array([[ 4., nan,  2., nan,  5.],
       [ 2., nan, nan,  1.,  4.],
       [ 2.,  4., nan, nan, nan],
       [nan,  4., nan,  4., nan],
       [ 2., nan,  2.,  4.,  2.]])

In [77]:
type(np.nan)

float

In [78]:
# scan for missing values
np.isnan(Y)

array([[False,  True, False,  True, False],
       [False,  True,  True, False, False],
       [False, False,  True,  True,  True],
       [ True, False,  True, False,  True],
       [False,  True, False, False, False]])

In [79]:
~np.isnan(Y) # are not NAs

array([[ True, False,  True, False,  True],
       [ True, False, False,  True,  True],
       [ True,  True, False, False, False],
       [False,  True, False,  True, False],
       [ True, False,  True,  True,  True]])

When we have missing values, we'll run into issues when computing across the data matrix.

In [80]:
np.mean(Y)

nan

To get around this, we need to use special version of the methods that compensate for the existence of `nan`.

In [81]:
np.nanmean(Y)

3.0

In [82]:
np.nanmean(Y,axis=0)

array([2.5       , 4.        , 2.        , 3.        , 3.66666667])

In [83]:
# Mean impute the missing values
Y[np.where(np.isnan(Y))] = np.nanmean(Y)
Y

array([[4., 3., 2., 3., 5.],
       [2., 3., 3., 1., 4.],
       [2., 4., 3., 3., 3.],
       [3., 4., 3., 4., 3.],
       [2., 3., 2., 4., 2.]])

### Structured Data: NumPy’s Structured Arrays

Out of the box, numpy arrays can only handle one data class at a time. Most times we will use heterogenous data types -- spreadsheet with name, age, gender, address, etc..

This short section shows you how to use `NumPy’s structured arrays` to get around of this limitation. 

Let's started creating a some lists. Imagine these are columns on your dataframe

In [65]:
# lists
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

In [67]:
# nest these lists
nested_list = [name, age, weight]
nested_list

[['Alice', 'Bob', 'Cathy', 'Doug'], [25, 45, 37, 19], [55.0, 85.5, 68.0, 61.5]]

In [71]:
# convert to a numpy array
array_nested_list = np.array(nested_list).T
array_nested_list

array([['Alice', '25', '55.0'],
       ['Bob', '45', '85.5'],
       ['Cathy', '37', '68.0'],
       ['Doug', '19', '61.5']], dtype='<U32')

In [70]:
# see data type - all data treated as strings. 
array_nested_list.dtype


dtype('<U32')

In case you which to preserve the preserve the data types for each variables, you could use structured arrays. These are almost like a less flexible dictionary. 

You need to follow three steps: 

- Create a empty structure with pre-defined size
- Provide names for the 'collumns'
- Provide types for the collumns

In [95]:
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                             'formats':('U10', 'i', 'f')})

In [96]:
# see the skeleton of the structure
data

array([('', 0, 0.), ('', 0, 0.), ('', 0, 0.), ('', 0, 0.)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f4')])

In [97]:
# add information
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]


In [98]:
# then you can access prety much like dictions
data["name"]

array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

Though possible to deal with heterogeneous data frames using numpy, there is a lot of overhead to constructing a data object. 

**As such, we'll use Pandas series and DataFrames to deal with heterogeneous data.**

In [71]:
!jupyter nbconvert _week_5_numpy_for_class.ipynb --to html --template classic


[NbConvertApp] Converting notebook _week_5_numpy_for_class.ipynb to html
[NbConvertApp] Writing 444362 bytes to _week_5_numpy_for_class.html
