<h1><center> PPOL564 - Data Science I: Foundations<br><br><font color='grey'> Working with Nested Lists </font> </center><h1>

## Learning Goals

In this notebook, we will cover:

- open csv as nested lists
- Working with retangular data as nested lists

This is mostly code I will go over during the lecture. 

In [3]:
# Batteries included Functions
import csv # convert a .csv to a nested list
import os  # library for managing our operating system. 


# Read in the gapminder data 

# with to open the connection
with open("gapminder.csv",mode="rt") as file:
    # iterate through the connection with list comprehension
    data = [row for row in csv.reader(file)]
    # output: nested list

#### What does the data look like?

In [4]:
# it is a nested list. 
data

[['country', 'lifeExp', 'gdpPercap'],
 ['Guinea_Bissau', '39.21', '652.157'],
 ['Bolivia', '52.505', '2961.229'],
 ['Austria', '73.103', '20411.916'],
 ['Malawi', '43.352', '575.447'],
 ['Finland', '72.992', '17473.723'],
 ['North_Korea', '63.607', '2591.853'],
 ['Malaysia', '64.28', '5406.038'],
 ['Hungary', '69.393', '10888.176'],
 ['Congo', '52.502', '3312.788'],
 ['Morocco', '57.609', '2447.909'],
 ['Germany', '73.444', '20556.684'],
 ['Ecuador', '62.817', '5733.625'],
 ['Kuwait', '68.922', '65332.91'],
 ['New_Zealand', '73.989', '17262.623'],
 ['Mauritania', '52.302', '1356.671'],
 ['Uganda', '47.619', '810.384'],
 ['Equatorial Guinea', '42.96', '2469.167'],
 ['Croatia', '70.056', '9331.712'],
 ['Indonesia', '54.336', '1741.365'],
 ['Canada', '74.903', '22410.746'],
 ['Comoros', '52.382', '1314.38'],
 ['Montenegro', '70.299', '7208.065'],
 ['Slovenia', '71.601', '14074.582'],
 ['Trinidad and Tobago', '66.828', '7866.872'],
 ['Poland', '70.177', '8416.554'],
 ['Lesotho', '50.007', 

## Indexing Nested Lists

Notice something important here, because we open the data using a iterator, the code doesn't know that the first row is the header of the csv

In [6]:
# accessing the header
print(data[0])

# For any row > 0, row == 0 is the column names. 
print(data[100])

['country', 'lifeExp', 'gdpPercap']
['Burundi', '44.817', '471.663']


### Indexing by columns

In [8]:
# Indexing Columns - Remember this is a nested list

# Referencing a column data value
d = data[100] # First select the row
print(d)

['Burundi', '44.817', '471.663']


In [9]:
# Then reference the colum
d[1] 

'44.817'

In [10]:
# doing the above all in one step
data[100][1]

'44.817'

In [12]:
# The key is to keep in mind the column names
cnames = data.pop(0)
print(cnames)

['country', 'lifeExp', 'gdpPercap']


In [13]:
# We can now reference this column name list to pull out the columns we're interested in.
ind = cnames.index("lifeExp") # Index allows us to "look up" the location of a data value. 
data[99][ind]

'44.817'

## Accessing a entire column

If I want to extract all the values of a particular column, I need to loop through all the *j* element of a sublist. 

In [14]:
# Looping through each row pulling out the relevant data value
life_exp = []
for row in data:
    life_exp.append(float(row[ind]))

In [15]:
# Same idea, but as a list comprehension 
life_exp = [float(row[ind]) for row in data]
print(life_exp)

[39.21, 52.505, 73.103, 43.352, 72.992, 63.607, 64.28, 69.393, 52.502, 57.609, 73.444, 62.817, 68.922, 73.989, 52.302, 47.619, 42.96, 70.056, 54.336, 74.903, 52.382, 70.299, 71.601, 66.828, 70.177, 50.007, 74.014, 60.721, 52.681, 44.401, 67.708, 59.304, 73.733, 52.341, 58.859, 59.696, 66.644, 66.526, 47.903, 69.744, 65.866, 51.499, 46.78, 68.749, 49.002, 67.431, 73.646, 59.03, 71.511, 46.381, 71.22, 43.581, 49.834, 44.544, 71.045, 53.491, 48.401, 61.346, 41.482, 72.739, 68.433, 57.48, 40.38, 43.413, 58.679, 42.476, 47.771, 46.774, 51.221, 64.953, 45.996, 68.291, 61.554, 56.243, 50.626, 58.443, 52.663, 54.598, 48.436, 37.479, 65.409, 57.896, 53.322, 75.565, 73.923, 74.827, 59.633, 53.166, 62.2, 65.606, 74.663, 55.89, 48.986, 58.637, 57.921, 43.24, 66.581, 76.511, 40.989, 44.817, 67.802, 70.181, 60.967, 74.37, 48.78, 45.999, 73.642, 60.329, 65.001, 44.476, 56.729, 63.898, 48.129, 73.478, 54.882, 61.785, 36.769, 70.696, 47.912, 66.809, 69.06, 74.203, 75.648, 74.349, 44.559, 43.867, 68.551

In [16]:
# Make this code more flexible with list comprehensions
var_name = "gdpPercap"
out = [row[cnames.index(var_name)] for row in data]

## Motivating Numpy

All of the above seems a little too much for working with retangular data in Python. And it is. So of course, there are more recent, modern and easy to work with strategies to work with data frames in Python. 

A first approach to facilitate working with Data Frames in Python comes through using `numpy` to convert nested lists in `arrays`. 

**If you coming from R, think about numpy arrays as matrices.**

We will see more of numpy soon. But, let's see briefly how numpy works and the speed boost of using numpy to access data in Python


In [25]:
# Read in the gapminder data 
with open("gapminder.csv",mode="rt") as file:
    data = [row for row in csv.reader(file)]

# lets remove the first list
data.pop(0)

['country', 'lifeExp', 'gdpPercap']

In [17]:
# Numpy offers an efficiency boost, especially when indexing
import numpy as np

# Convert to a numpy array
data_np = np.array(data)
data_np

array([['Guinea_Bissau', '39.21', '652.157'],
       ['Bolivia', '52.505', '2961.229'],
       ['Austria', '73.103', '20411.916'],
       ['Malawi', '43.352', '575.447'],
       ['Finland', '72.992', '17473.723'],
       ['North_Korea', '63.607', '2591.853'],
       ['Malaysia', '64.28', '5406.038'],
       ['Hungary', '69.393', '10888.176'],
       ['Congo', '52.502', '3312.788'],
       ['Morocco', '57.609', '2447.909'],
       ['Germany', '73.444', '20556.684'],
       ['Ecuador', '62.817', '5733.625'],
       ['Kuwait', '68.922', '65332.91'],
       ['New_Zealand', '73.989', '17262.623'],
       ['Mauritania', '52.302', '1356.671'],
       ['Uganda', '47.619', '810.384'],
       ['Equatorial Guinea', '42.96', '2469.167'],
       ['Croatia', '70.056', '9331.712'],
       ['Indonesia', '54.336', '1741.365'],
       ['Canada', '74.903', '22410.746'],
       ['Comoros', '52.382', '1314.38'],
       ['Montenegro', '70.299', '7208.065'],
       ['Slovenia', '71.601', '14074.582'],
      

### slicing data with numpy

In [18]:
# simple slicing of rows and columns of your 2d array
# array[rows, columns]
data_np[:,2]

array(['652.157', '2961.229', '20411.916', '575.447', '17473.723',
       '2591.853', '5406.038', '10888.176', '3312.788', '2447.909',
       '20556.684', '5733.625', '65332.91', '17262.623', '1356.671',
       '810.384', '2469.167', '9331.712', '1741.365', '22410.746',
       '1314.38', '7208.065', '14074.582', '7866.872', '8416.554',
       '780.553', '16245.209', '3477.21', '1200.416', '680.133',
       '3484.779', '12013.579', '13969.037', '1044.582', '5613.844',
       '4469.453', '4898.398', '1854.731', '675.368', '6384.055',
       '7269.216', '1153.82', '1569.275', '6197.645', '3163.352',
       '6703.289', '14160.936', '4426.026', '13920.011', '2697.833',
       '17425.382', '1488.309', '817.559', '648.343', '6283.259',
       '3675.582', '1835.01', '3009.288', '675.669', '10863.164',
       '3255.367', '1017.713', '542.278', '673.093', '20261.744',
       '604.814', '1335.595', '1165.454', '11529.865', '4768.942',
       '1358.199', '7300.17', '2844.856', '3074.031', '1533.12

### Data operations are easier and faster

In [19]:
%%timeit -r 10 -n 100000
out1 = []
for row in data:
    out1.append(row[ind])

3.02 µs ± 75.5 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)


In [20]:
%%timeit -r 10 -n 100000
out2 = [row[ind] for row in data]

2.46 µs ± 40.7 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)


In [21]:
%%timeit -r 10 -n 100000
out3 = data_np[:,ind]

98.3 ns ± 1.95 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)


In [22]:
!jupyter nbconvert _week_4_nested_lists.ipynb --to html --template classic


[NbConvertApp] Converting notebook _week_4_nested_lists.ipynb to html
[NbConvertApp] Writing 315869 bytes to _week_4_nested_lists.html
