PPOL564 - Data Science I: Foundations

Working with Nested Lists

Learning Goals

In this notebook, we will cover:

  • open csv as nested lists
  • Working with retangular data as nested lists

This is mostly code I will go over during the lecture.

In [3]:
# Batteries included Functions
import csv # convert a .csv to a nested list
import os  # library for managing our operating system. 


# Read in the gapminder data 

# with to open the connection
with open("gapminder.csv",mode="rt") as file:
    # iterate through the connection with list comprehension
    data = [row for row in csv.reader(file)]
    # output: nested list

What does the data look like?

In [4]:
# it is a nested list. 
data
Out[4]:
[['country', 'lifeExp', 'gdpPercap'],
 ['Guinea_Bissau', '39.21', '652.157'],
 ['Bolivia', '52.505', '2961.229'],
 ['Austria', '73.103', '20411.916'],
 ['Malawi', '43.352', '575.447'],
 ['Finland', '72.992', '17473.723'],
 ['North_Korea', '63.607', '2591.853'],
 ['Malaysia', '64.28', '5406.038'],
 ['Hungary', '69.393', '10888.176'],
 ['Congo', '52.502', '3312.788'],
 ['Morocco', '57.609', '2447.909'],
 ['Germany', '73.444', '20556.684'],
 ['Ecuador', '62.817', '5733.625'],
 ['Kuwait', '68.922', '65332.91'],
 ['New_Zealand', '73.989', '17262.623'],
 ['Mauritania', '52.302', '1356.671'],
 ['Uganda', '47.619', '810.384'],
 ['Equatorial Guinea', '42.96', '2469.167'],
 ['Croatia', '70.056', '9331.712'],
 ['Indonesia', '54.336', '1741.365'],
 ['Canada', '74.903', '22410.746'],
 ['Comoros', '52.382', '1314.38'],
 ['Montenegro', '70.299', '7208.065'],
 ['Slovenia', '71.601', '14074.582'],
 ['Trinidad and Tobago', '66.828', '7866.872'],
 ['Poland', '70.177', '8416.554'],
 ['Lesotho', '50.007', '780.553'],
 ['Italy', '74.014', '16245.209'],
 ['Tunisia', '60.721', '3477.21'],
 ['Kenya', '52.681', '1200.416'],
 ['Gambia', '44.401', '680.133'],
 ['Bosnia and Herzegovina', '67.708', '3484.779'],
 ['Libya', '59.304', '12013.579'],
 ['Greece', '73.733', '13969.037'],
 ['Ghana', '52.341', '1044.582'],
 ['Peru', '58.859', '5613.844'],
 ['Turkey', '59.696', '4469.453'],
 ['Reunion', '66.644', '4898.398'],
 ['Sri_Lanka', '66.526', '1854.731'],
 ['Cambodia', '47.903', '675.368'],
 ['Bulgaria', '69.744', '6384.055'],
 ['Lebanon', '65.866', '7269.216'],
 ['Togo', '51.499', '1153.82'],
 ['Yemen', '46.78', '1569.275'],
 ['Jamaica', '68.749', '6197.645'],
 ['Swaziland', '49.002', '3163.352'],
 ['Chile', '67.431', '6703.289'],
 ['Israel', '73.646', '14160.936'],
 ['Algeria', '59.03', '4426.026'],
 ['Czech_Republic', '71.511', '13920.011'],
 ['Djibouti', '46.381', '2697.833'],
 ['Singapore', '71.22', '17425.382'],
 ['Nigeria', '43.581', '1488.309'],
 ['Bangladesh', '49.834', '817.559'],
 ['DRC', '44.544', '648.343'],
 ['Cuba', '71.045', '6283.259'],
 ['Namibia', '53.491', '3675.582'],
 ['Sudan', '48.401', '1835.01'],
 ['Syria', '61.346', '3009.288'],
 ['Rwanda', '41.482', '675.669'],
 ['Puerto Rico', '72.739', '10863.164'],
 ['Albania', '68.433', '3255.367'],
 ['Vietnam', '57.48', '1017.713'],
 ['Mozambique', '40.38', '542.278'],
 ['Mali', '43.413', '673.093'],
 ['Saudi Arabia', '58.679', '20261.744'],
 ['Liberia', '42.476', '604.814'],
 ['Madagascar', '47.771', '1335.595'],
 ['Chad', '46.774', '1165.454'],
 ['Gabon', '51.221', '11529.865'],
 ['Mauritius', '64.953', '4768.942'],
 ['Zambia', '45.996', '1358.199'],
 ['Romania', '68.291', '7300.17'],
 ['Dominican Republic', '61.554', '2844.856'],
 ['Egypt', '56.243', '3074.031'],
 ['Senegal', '50.626', '1533.122'],
 ['Oman', '58.443', '12138.562'],
 ['Zimbabwe', '52.663', '635.858'],
 ['Botswana', '54.598', '5031.504'],
 ["Cote d'Ivoire", '48.436', '1912.825'],
 ['Afghanistan', '37.479', '802.675'],
 ['Mexico', '65.409', '7724.113'],
 ['Sao Tome and Principe', '57.896', '1382.782'],
 ['Myanmar', '53.322', '439.333'],
 ['Switzerland', '75.565', '27074.334'],
 ['United Kingdom', '73.923', '19380.473'],
 ['Japan', '74.827', '17750.87'],
 ['El Salvador', '59.633', '4431.847'],
 ['India', '53.166', '1057.296'],
 ['Thailand', '62.2', '3045.966'],
 ['Bahrain', '65.606', '18077.664'],
 ['Australia', '74.663', '19980.596'],
 ['Mongolia', '55.89', '1692.805'],
 ['Nepal', '48.986', '782.729'],
 ['Iran', '58.637', '7376.583'],
 ['Honduras', '57.921', '2834.413'],
 ['Guinea', '43.24', '776.067'],
 ['Venezuela', '66.581', '10088.516'],
 ['Iceland', '76.511', '20531.422'],
 ['Somalia', '40.989', '1140.793'],
 ['Burundi', '44.817', '471.663'],
 ['Panama', '67.802', '5754.827'],
 ['Costa Rica', '70.181', '5448.611'],
 ['Philippines', '60.967', '2174.771'],
 ['Denmark', '74.37', '21671.825'],
 ['Benin', '48.78', '1155.395'],
 ['Eritrea', '45.999', '541.003'],
 ['Belgium', '73.642', '19900.758'],
 ['West Bank and Gaza', '60.329', '3759.997'],
 ['South_Korea', '65.001', '8217.318'],
 ['Ethiopia', '44.476', '509.115'],
 ['Guatemala', '56.729', '4015.403'],
 ['Colombia', '63.898', '4195.343'],
 ['Cameroon', '48.129', '1774.634'],
 ['United States', '73.478', '26261.151'],
 ['Pakistan', '54.882', '1439.271'],
 ['China', '61.785', '1488.308'],
 ['Sierra Leone', '36.769', '1072.819'],
 ['Slovak Republic', '70.696', '10415.531'],
 ['Tanzania', '47.912', '849.281'],
 ['Paraguay', '66.809', '3239.607'],
 ['Argentina', '69.06', '8955.554'],
 ['Spain', '74.203', '14029.826'],
 ['Netherlands', '75.648', '21748.852'],
 ['France', '74.349', '18833.57'],
 ['Niger', '44.559', '781.077'],
 ['Central African Republic', '43.867', '958.785'],
 ['Serbia', '68.551', '9305.049'],
 ['Iraq', '56.582', '7811.809'],
 ['Uruguay', '70.782', '7100.133'],
 ['Angola', '37.883', '3607.101'],
 ['Sweden', '76.177', '19943.126'],
 ['Nicaragua', '58.349', '3424.656'],
 ['South Africa', '53.993', '7247.431'],
 ['Burkina Faso', '44.694', '843.991'],
 ['Haiti', '50.165', '1620.739'],
 ['Norway', '75.843', '26747.307'],
 ['Taiwan', '70.337', '10224.807'],
 ['Portugal', '70.42', '11354.092'],
 ['Jordan', '59.786', '3128.121'],
 ['Ireland', '73.017', '15758.606'],
 ['Brazil', '62.239', '5829.317']]

Indexing Nested Lists

Notice something important here, because we open the data using a iterator, the code doesn't know that the first row is the header of the csv

In [6]:
# accessing the header
print(data[0])

# For any row > 0, row == 0 is the column names. 
print(data[100])
['country', 'lifeExp', 'gdpPercap']
['Burundi', '44.817', '471.663']

Indexing by columns

In [8]:
# Indexing Columns - Remember this is a nested list

# Referencing a column data value
d = data[100] # First select the row
print(d)
['Burundi', '44.817', '471.663']
In [9]:
# Then reference the colum
d[1] 
Out[9]:
'44.817'
In [10]:
# doing the above all in one step
data[100][1]
Out[10]:
'44.817'
In [12]:
# The key is to keep in mind the column names
cnames = data.pop(0)
print(cnames)
['country', 'lifeExp', 'gdpPercap']
In [13]:
# We can now reference this column name list to pull out the columns we're interested in.
ind = cnames.index("lifeExp") # Index allows us to "look up" the location of a data value. 
data[99][ind]
Out[13]:
'44.817'

Accessing a entire column

If I want to extract all the values of a particular column, I need to loop through all the j element of a sublist.

In [14]:
# Looping through each row pulling out the relevant data value
life_exp = []
for row in data:
    life_exp.append(float(row[ind]))
In [15]:
# Same idea, but as a list comprehension 
life_exp = [float(row[ind]) for row in data]
print(life_exp)
[39.21, 52.505, 73.103, 43.352, 72.992, 63.607, 64.28, 69.393, 52.502, 57.609, 73.444, 62.817, 68.922, 73.989, 52.302, 47.619, 42.96, 70.056, 54.336, 74.903, 52.382, 70.299, 71.601, 66.828, 70.177, 50.007, 74.014, 60.721, 52.681, 44.401, 67.708, 59.304, 73.733, 52.341, 58.859, 59.696, 66.644, 66.526, 47.903, 69.744, 65.866, 51.499, 46.78, 68.749, 49.002, 67.431, 73.646, 59.03, 71.511, 46.381, 71.22, 43.581, 49.834, 44.544, 71.045, 53.491, 48.401, 61.346, 41.482, 72.739, 68.433, 57.48, 40.38, 43.413, 58.679, 42.476, 47.771, 46.774, 51.221, 64.953, 45.996, 68.291, 61.554, 56.243, 50.626, 58.443, 52.663, 54.598, 48.436, 37.479, 65.409, 57.896, 53.322, 75.565, 73.923, 74.827, 59.633, 53.166, 62.2, 65.606, 74.663, 55.89, 48.986, 58.637, 57.921, 43.24, 66.581, 76.511, 40.989, 44.817, 67.802, 70.181, 60.967, 74.37, 48.78, 45.999, 73.642, 60.329, 65.001, 44.476, 56.729, 63.898, 48.129, 73.478, 54.882, 61.785, 36.769, 70.696, 47.912, 66.809, 69.06, 74.203, 75.648, 74.349, 44.559, 43.867, 68.551, 56.582, 70.782, 37.883, 76.177, 58.349, 53.993, 44.694, 50.165, 75.843, 70.337, 70.42, 59.786, 73.017, 62.239]
In [16]:
# Make this code more flexible with list comprehensions
var_name = "gdpPercap"
out = [row[cnames.index(var_name)] for row in data]

Motivating Numpy

All of the above seems a little too much for working with retangular data in Python. And it is. So of course, there are more recent, modern and easy to work with strategies to work with data frames in Python.

A first approach to facilitate working with Data Frames in Python comes through using numpy to convert nested lists in arrays.

If you coming from R, think about numpy arrays as matrices.

We will see more of numpy soon. But, let's see briefly how numpy works and the speed boost of using numpy to access data in Python

In [25]:
# Read in the gapminder data 
with open("gapminder.csv",mode="rt") as file:
    data = [row for row in csv.reader(file)]

# lets remove the first list
data.pop(0)
Out[25]:
['country', 'lifeExp', 'gdpPercap']
In [26]:
# Numpy offers an efficiency boost, especially when indexing
import numpy as np

# Convert to a numpy array
data_np = np.array(data)
data_np
Out[26]:
array([['Guinea_Bissau', '39.21', '652.157'],
       ['Bolivia', '52.505', '2961.229'],
       ['Austria', '73.103', '20411.916'],
       ['Malawi', '43.352', '575.447'],
       ['Finland', '72.992', '17473.723'],
       ['North_Korea', '63.607', '2591.853'],
       ['Malaysia', '64.28', '5406.038'],
       ['Hungary', '69.393', '10888.176'],
       ['Congo', '52.502', '3312.788'],
       ['Morocco', '57.609', '2447.909'],
       ['Germany', '73.444', '20556.684'],
       ['Ecuador', '62.817', '5733.625'],
       ['Kuwait', '68.922', '65332.91'],
       ['New_Zealand', '73.989', '17262.623'],
       ['Mauritania', '52.302', '1356.671'],
       ['Uganda', '47.619', '810.384'],
       ['Equatorial Guinea', '42.96', '2469.167'],
       ['Croatia', '70.056', '9331.712'],
       ['Indonesia', '54.336', '1741.365'],
       ['Canada', '74.903', '22410.746'],
       ['Comoros', '52.382', '1314.38'],
       ['Montenegro', '70.299', '7208.065'],
       ['Slovenia', '71.601', '14074.582'],
       ['Trinidad and Tobago', '66.828', '7866.872'],
       ['Poland', '70.177', '8416.554'],
       ['Lesotho', '50.007', '780.553'],
       ['Italy', '74.014', '16245.209'],
       ['Tunisia', '60.721', '3477.21'],
       ['Kenya', '52.681', '1200.416'],
       ['Gambia', '44.401', '680.133'],
       ['Bosnia and Herzegovina', '67.708', '3484.779'],
       ['Libya', '59.304', '12013.579'],
       ['Greece', '73.733', '13969.037'],
       ['Ghana', '52.341', '1044.582'],
       ['Peru', '58.859', '5613.844'],
       ['Turkey', '59.696', '4469.453'],
       ['Reunion', '66.644', '4898.398'],
       ['Sri_Lanka', '66.526', '1854.731'],
       ['Cambodia', '47.903', '675.368'],
       ['Bulgaria', '69.744', '6384.055'],
       ['Lebanon', '65.866', '7269.216'],
       ['Togo', '51.499', '1153.82'],
       ['Yemen', '46.78', '1569.275'],
       ['Jamaica', '68.749', '6197.645'],
       ['Swaziland', '49.002', '3163.352'],
       ['Chile', '67.431', '6703.289'],
       ['Israel', '73.646', '14160.936'],
       ['Algeria', '59.03', '4426.026'],
       ['Czech_Republic', '71.511', '13920.011'],
       ['Djibouti', '46.381', '2697.833'],
       ['Singapore', '71.22', '17425.382'],
       ['Nigeria', '43.581', '1488.309'],
       ['Bangladesh', '49.834', '817.559'],
       ['DRC', '44.544', '648.343'],
       ['Cuba', '71.045', '6283.259'],
       ['Namibia', '53.491', '3675.582'],
       ['Sudan', '48.401', '1835.01'],
       ['Syria', '61.346', '3009.288'],
       ['Rwanda', '41.482', '675.669'],
       ['Puerto Rico', '72.739', '10863.164'],
       ['Albania', '68.433', '3255.367'],
       ['Vietnam', '57.48', '1017.713'],
       ['Mozambique', '40.38', '542.278'],
       ['Mali', '43.413', '673.093'],
       ['Saudi Arabia', '58.679', '20261.744'],
       ['Liberia', '42.476', '604.814'],
       ['Madagascar', '47.771', '1335.595'],
       ['Chad', '46.774', '1165.454'],
       ['Gabon', '51.221', '11529.865'],
       ['Mauritius', '64.953', '4768.942'],
       ['Zambia', '45.996', '1358.199'],
       ['Romania', '68.291', '7300.17'],
       ['Dominican Republic', '61.554', '2844.856'],
       ['Egypt', '56.243', '3074.031'],
       ['Senegal', '50.626', '1533.122'],
       ['Oman', '58.443', '12138.562'],
       ['Zimbabwe', '52.663', '635.858'],
       ['Botswana', '54.598', '5031.504'],
       ["Cote d'Ivoire", '48.436', '1912.825'],
       ['Afghanistan', '37.479', '802.675'],
       ['Mexico', '65.409', '7724.113'],
       ['Sao Tome and Principe', '57.896', '1382.782'],
       ['Myanmar', '53.322', '439.333'],
       ['Switzerland', '75.565', '27074.334'],
       ['United Kingdom', '73.923', '19380.473'],
       ['Japan', '74.827', '17750.87'],
       ['El Salvador', '59.633', '4431.847'],
       ['India', '53.166', '1057.296'],
       ['Thailand', '62.2', '3045.966'],
       ['Bahrain', '65.606', '18077.664'],
       ['Australia', '74.663', '19980.596'],
       ['Mongolia', '55.89', '1692.805'],
       ['Nepal', '48.986', '782.729'],
       ['Iran', '58.637', '7376.583'],
       ['Honduras', '57.921', '2834.413'],
       ['Guinea', '43.24', '776.067'],
       ['Venezuela', '66.581', '10088.516'],
       ['Iceland', '76.511', '20531.422'],
       ['Somalia', '40.989', '1140.793'],
       ['Burundi', '44.817', '471.663'],
       ['Panama', '67.802', '5754.827'],
       ['Costa Rica', '70.181', '5448.611'],
       ['Philippines', '60.967', '2174.771'],
       ['Denmark', '74.37', '21671.825'],
       ['Benin', '48.78', '1155.395'],
       ['Eritrea', '45.999', '541.003'],
       ['Belgium', '73.642', '19900.758'],
       ['West Bank and Gaza', '60.329', '3759.997'],
       ['South_Korea', '65.001', '8217.318'],
       ['Ethiopia', '44.476', '509.115'],
       ['Guatemala', '56.729', '4015.403'],
       ['Colombia', '63.898', '4195.343'],
       ['Cameroon', '48.129', '1774.634'],
       ['United States', '73.478', '26261.151'],
       ['Pakistan', '54.882', '1439.271'],
       ['China', '61.785', '1488.308'],
       ['Sierra Leone', '36.769', '1072.819'],
       ['Slovak Republic', '70.696', '10415.531'],
       ['Tanzania', '47.912', '849.281'],
       ['Paraguay', '66.809', '3239.607'],
       ['Argentina', '69.06', '8955.554'],
       ['Spain', '74.203', '14029.826'],
       ['Netherlands', '75.648', '21748.852'],
       ['France', '74.349', '18833.57'],
       ['Niger', '44.559', '781.077'],
       ['Central African Republic', '43.867', '958.785'],
       ['Serbia', '68.551', '9305.049'],
       ['Iraq', '56.582', '7811.809'],
       ['Uruguay', '70.782', '7100.133'],
       ['Angola', '37.883', '3607.101'],
       ['Sweden', '76.177', '19943.126'],
       ['Nicaragua', '58.349', '3424.656'],
       ['South Africa', '53.993', '7247.431'],
       ['Burkina Faso', '44.694', '843.991'],
       ['Haiti', '50.165', '1620.739'],
       ['Norway', '75.843', '26747.307'],
       ['Taiwan', '70.337', '10224.807'],
       ['Portugal', '70.42', '11354.092'],
       ['Jordan', '59.786', '3128.121'],
       ['Ireland', '73.017', '15758.606'],
       ['Brazil', '62.239', '5829.317']], dtype='<U24')

slicing data with numpy

In [27]:
# simple slicing of rows and columns of your 2d array
# array[rows, columns]
data_np[:,2]
Out[27]:
array(['652.157', '2961.229', '20411.916', '575.447', '17473.723',
       '2591.853', '5406.038', '10888.176', '3312.788', '2447.909',
       '20556.684', '5733.625', '65332.91', '17262.623', '1356.671',
       '810.384', '2469.167', '9331.712', '1741.365', '22410.746',
       '1314.38', '7208.065', '14074.582', '7866.872', '8416.554',
       '780.553', '16245.209', '3477.21', '1200.416', '680.133',
       '3484.779', '12013.579', '13969.037', '1044.582', '5613.844',
       '4469.453', '4898.398', '1854.731', '675.368', '6384.055',
       '7269.216', '1153.82', '1569.275', '6197.645', '3163.352',
       '6703.289', '14160.936', '4426.026', '13920.011', '2697.833',
       '17425.382', '1488.309', '817.559', '648.343', '6283.259',
       '3675.582', '1835.01', '3009.288', '675.669', '10863.164',
       '3255.367', '1017.713', '542.278', '673.093', '20261.744',
       '604.814', '1335.595', '1165.454', '11529.865', '4768.942',
       '1358.199', '7300.17', '2844.856', '3074.031', '1533.122',
       '12138.562', '635.858', '5031.504', '1912.825', '802.675',
       '7724.113', '1382.782', '439.333', '27074.334', '19380.473',
       '17750.87', '4431.847', '1057.296', '3045.966', '18077.664',
       '19980.596', '1692.805', '782.729', '7376.583', '2834.413',
       '776.067', '10088.516', '20531.422', '1140.793', '471.663',
       '5754.827', '5448.611', '2174.771', '21671.825', '1155.395',
       '541.003', '19900.758', '3759.997', '8217.318', '509.115',
       '4015.403', '4195.343', '1774.634', '26261.151', '1439.271',
       '1488.308', '1072.819', '10415.531', '849.281', '3239.607',
       '8955.554', '14029.826', '21748.852', '18833.57', '781.077',
       '958.785', '9305.049', '7811.809', '7100.133', '3607.101',
       '19943.126', '3424.656', '7247.431', '843.991', '1620.739',
       '26747.307', '10224.807', '11354.092', '3128.121', '15758.606',
       '5829.317'], dtype='<U24')

Data operations are easier and faster

In [12]:
%%timeit -r 10 -n 100000
out1 = []
for row in data:
    out1.append(row[ind])
2.56 µs ± 48.8 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
In [13]:
%%timeit -r 10 -n 100000
out2 = [row[ind] for row in data]
2.03 µs ± 39.4 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
In [15]:
%%timeit -r 10 -n 100000
out3 = data_np[:,ind]
117 ns ± 57.7 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
In [17]:
!jupyter nbconvert _week_4_nested_lists.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week_4_nested_lists.ipynb to html
[NbConvertApp] Writing 310911 bytes to _week_4_nested_lists.html