In this notebook, we will cover:
This is mostly code I will go over during the lecture.
# Batteries included Functions
import csv # convert a .csv to a nested list
import os # library for managing our operating system.
# Read in the gapminder data
# with to open the connection
with open("gapminder.csv",mode="rt") as file:
# iterate through the connection with list comprehension
data = [row for row in csv.reader(file)]
# output: nested list
# it is a nested list.
data
Notice something important here, because we open the data using a iterator, the code doesn't know that the first row is the header of the csv
# accessing the header
print(data[0])
# For any row > 0, row == 0 is the column names.
print(data[100])
# Indexing Columns - Remember this is a nested list
# Referencing a column data value
d = data[100] # First select the row
print(d)
# Then reference the colum
d[1]
# doing the above all in one step
data[100][1]
# The key is to keep in mind the column names
cnames = data.pop(0)
print(cnames)
# We can now reference this column name list to pull out the columns we're interested in.
ind = cnames.index("lifeExp") # Index allows us to "look up" the location of a data value.
data[99][ind]
If I want to extract all the values of a particular column, I need to loop through all the j element of a sublist.
# Looping through each row pulling out the relevant data value
life_exp = []
for row in data:
life_exp.append(float(row[ind]))
# Same idea, but as a list comprehension
life_exp = [float(row[ind]) for row in data]
print(life_exp)
# Make this code more flexible with list comprehensions
var_name = "gdpPercap"
out = [row[cnames.index(var_name)] for row in data]
All of the above seems a little too much for working with retangular data in Python. And it is. So of course, there are more recent, modern and easy to work with strategies to work with data frames in Python.
A first approach to facilitate working with Data Frames in Python comes through using numpy
to convert nested lists in arrays
.
If you coming from R, think about numpy arrays as matrices.
We will see more of numpy soon. But, let's see briefly how numpy works and the speed boost of using numpy to access data in Python
# Read in the gapminder data
with open("gapminder.csv",mode="rt") as file:
data = [row for row in csv.reader(file)]
# lets remove the first list
data.pop(0)
# Numpy offers an efficiency boost, especially when indexing
import numpy as np
# Convert to a numpy array
data_np = np.array(data)
data_np
# simple slicing of rows and columns of your 2d array
# array[rows, columns]
data_np[:,2]
%%timeit -r 10 -n 100000
out1 = []
for row in data:
out1.append(row[ind])
%%timeit -r 10 -n 100000
out2 = [row[ind] for row in data]
%%timeit -r 10 -n 100000
out3 = data_np[:,ind]
!jupyter nbconvert _week_4_nested_lists.ipynb --to html --template classic