PPOL 5203 - Data Science I: Foundations

Week 4: From Nested Lists to Dataframes - Numpy and Intro do Pandas

Author

Professor: Tiago Ventura

Published

October 8, 2024

Where we are ….

We started with the basics of being a data scientist
Then we moved over to the primitives of Python as your main DS tool:
- OOP, Native Data Types in Python
- Python Libraries, Loops, Functions, Generators, Comprehension….
Today we start our journey working with tabular data - a favorite of social scientists!

Plans for Today

File Management in Python (Most pythonic way to load data in Python)
Data as Nested Lists: Motivating Numpy
Numpy
Intro do Pandas
- Series
- Acesssing Pandas elements
- Creating DataFrames
Discuss your final project.

File Management: How do we read files from our computer into our Python Environment?

Connection management functions:
- open(), iterate over, and close()
Reading/writing files
Using with() to manage connections.

Summary of file management

open(): opens a connection with files on our system.
- open() returns a special item type *_io.TextIOWrapper*
- This item is a iterator. We need to go through to convert inputs to a object in python.
close(): closes the connection.
write(): writes files on your system. Also line by line, as in open()
with(): wrapper for open and close that allows alias.

Example:

# Load libraries
import csv # convert a .csv to a nested list
import os  # library for managing our operating system. 

# Read in the gapminder data 
with open("gapminder.csv", mode="rt") as file:
    data = [row for row in csv.reader(file)]

TLDR:

Most often we will use high-level functions from Pandas to load data into Python objects.
Why are we learning these tools then?
- Very pythonic ~ see in other people’s code
- No direct equivalent in R or Stata
- Important when working non-tabular data - text, json, images, etc..
Reading: Check Section 3.3 of Python for Data Analysis to learn more about the topics covered in the notebook.
Notebook for File Management

From Nested Data to Dataframes: Motivating Numpy

Motivating Numpy

So far, all the data structures we saw are geared towards unidimensional data.

string: sequence of words
list: sequence of heterogeneous elements
dictionaries: key-value combinations.

Tabular data

Nested Lists

# Read in the gapminder data 
import csv
with open("../lecture_notes/week-04/gapminder.csv",mode="rt") as file:
    data = [row for row in csv.reader(file)]

Quizz

Look at this tabular data organized as Nested List. What is “wrong” here?

# let's see the data
print(data)

[['country', 'lifeExp', 'gdpPercap'], ['Guinea_Bissau', '39.21', '652.157'], ['Bolivia', '52.505', '2961.229'], ['Austria', '73.103', '20411.916'], ['Malawi', '43.352', '575.447'], ['Finland', '72.992', '17473.723'], ['North_Korea', '63.607', '2591.853'], ['Malaysia', '64.28', '5406.038'], ['Hungary', '69.393', '10888.176'], ['Congo', '52.502', '3312.788'], ['Morocco', '57.609', '2447.909'], ['Germany', '73.444', '20556.684'], ['Ecuador', '62.817', '5733.625'], ['Kuwait', '68.922', '65332.91'], ['New_Zealand', '73.989', '17262.623'], ['Mauritania', '52.302', '1356.671'], ['Uganda', '47.619', '810.384'], ['Equatorial Guinea', '42.96', '2469.167'], ['Croatia', '70.056', '9331.712'], ['Indonesia', '54.336', '1741.365'], ['Canada', '74.903', '22410.746'], ['Comoros', '52.382', '1314.38'], ['Montenegro', '70.299', '7208.065'], ['Slovenia', '71.601', '14074.582'], ['Trinidad and Tobago', '66.828', '7866.872'], ['Poland', '70.177', '8416.554'], ['Lesotho', '50.007', '780.553'], ['Italy', '74.014', '16245.209'], ['Tunisia', '60.721', '3477.21'], ['Kenya', '52.681', '1200.416'], ['Gambia', '44.401', '680.133'], ['Bosnia and Herzegovina', '67.708', '3484.779'], ['Libya', '59.304', '12013.579'], ['Greece', '73.733', '13969.037'], ['Ghana', '52.341', '1044.582'], ['Peru', '58.859', '5613.844'], ['Turkey', '59.696', '4469.453'], ['Reunion', '66.644', '4898.398'], ['Sri_Lanka', '66.526', '1854.731'], ['Cambodia', '47.903', '675.368'], ['Bulgaria', '69.744', '6384.055'], ['Lebanon', '65.866', '7269.216'], ['Togo', '51.499', '1153.82'], ['Yemen', '46.78', '1569.275'], ['Jamaica', '68.749', '6197.645'], ['Swaziland', '49.002', '3163.352'], ['Chile', '67.431', '6703.289'], ['Israel', '73.646', '14160.936'], ['Algeria', '59.03', '4426.026'], ['Czech_Republic', '71.511', '13920.011'], ['Djibouti', '46.381', '2697.833'], ['Singapore', '71.22', '17425.382'], ['Nigeria', '43.581', '1488.309'], ['Bangladesh', '49.834', '817.559'], ['DRC', '44.544', '648.343'], ['Cuba', '71.045', '6283.259'], ['Namibia', '53.491', '3675.582'], ['Sudan', '48.401', '1835.01'], ['Syria', '61.346', '3009.288'], ['Rwanda', '41.482', '675.669'], ['Puerto Rico', '72.739', '10863.164'], ['Albania', '68.433', '3255.367'], ['Vietnam', '57.48', '1017.713'], ['Mozambique', '40.38', '542.278'], ['Mali', '43.413', '673.093'], ['Saudi Arabia', '58.679', '20261.744'], ['Liberia', '42.476', '604.814'], ['Madagascar', '47.771', '1335.595'], ['Chad', '46.774', '1165.454'], ['Gabon', '51.221', '11529.865'], ['Mauritius', '64.953', '4768.942'], ['Zambia', '45.996', '1358.199'], ['Romania', '68.291', '7300.17'], ['Dominican Republic', '61.554', '2844.856'], ['Egypt', '56.243', '3074.031'], ['Senegal', '50.626', '1533.122'], ['Oman', '58.443', '12138.562'], ['Zimbabwe', '52.663', '635.858'], ['Botswana', '54.598', '5031.504'], ["Cote d'Ivoire", '48.436', '1912.825'], ['Afghanistan', '37.479', '802.675'], ['Mexico', '65.409', '7724.113'], ['Sao Tome and Principe', '57.896', '1382.782'], ['Myanmar', '53.322', '439.333'], ['Switzerland', '75.565', '27074.334'], ['United Kingdom', '73.923', '19380.473'], ['Japan', '74.827', '17750.87'], ['El Salvador', '59.633', '4431.847'], ['India', '53.166', '1057.296'], ['Thailand', '62.2', '3045.966'], ['Bahrain', '65.606', '18077.664'], ['Australia', '74.663', '19980.596'], ['Mongolia', '55.89', '1692.805'], ['Nepal', '48.986', '782.729'], ['Iran', '58.637', '7376.583'], ['Honduras', '57.921', '2834.413'], ['Guinea', '43.24', '776.067'], ['Venezuela', '66.581', '10088.516'], ['Iceland', '76.511', '20531.422'], ['Somalia', '40.989', '1140.793'], ['Burundi', '44.817', '471.663'], ['Panama', '67.802', '5754.827'], ['Costa Rica', '70.181', '5448.611'], ['Philippines', '60.967', '2174.771'], ['Denmark', '74.37', '21671.825'], ['Benin', '48.78', '1155.395'], ['Eritrea', '45.999', '541.003'], ['Belgium', '73.642', '19900.758'], ['West Bank and Gaza', '60.329', '3759.997'], ['South_Korea', '65.001', '8217.318'], ['Ethiopia', '44.476', '509.115'], ['Guatemala', '56.729', '4015.403'], ['Colombia', '63.898', '4195.343'], ['Cameroon', '48.129', '1774.634'], ['United States', '73.478', '26261.151'], ['Pakistan', '54.882', '1439.271'], ['China', '61.785', '1488.308'], ['Sierra Leone', '36.769', '1072.819'], ['Slovak Republic', '70.696', '10415.531'], ['Tanzania', '47.912', '849.281'], ['Paraguay', '66.809', '3239.607'], ['Argentina', '69.06', '8955.554'], ['Spain', '74.203', '14029.826'], ['Netherlands', '75.648', '21748.852'], ['France', '74.349', '18833.57'], ['Niger', '44.559', '781.077'], ['Central African Republic', '43.867', '958.785'], ['Serbia', '68.551', '9305.049'], ['Iraq', '56.582', '7811.809'], ['Uruguay', '70.782', '7100.133'], ['Angola', '37.883', '3607.101'], ['Sweden', '76.177', '19943.126'], ['Nicaragua', '58.349', '3424.656'], ['South Africa', '53.993', '7247.431'], ['Burkina Faso', '44.694', '843.991'], ['Haiti', '50.165', '1620.739'], ['Norway', '75.843', '26747.307'], ['Taiwan', '70.337', '10224.807'], ['Portugal', '70.42', '11354.092'], ['Jordan', '59.786', '3128.121'], ['Ireland', '73.017', '15758.606'], ['Brazil', '62.239', '5829.317']]

Here comes numpy …

Python has no native data structure to work with tabular data (!!!!).
Numpy:
- Introduces arrays (numerical matrices) to the Python world.
- Optimizes for mathematical operations with matrices.

source: Python Programming for Data Science

Why should you learn Numpy? Holds Python together!

Efficiency

Numpy leans toward less flexibility and more efficiency.
Lists gives you more flexibility and less efficiency.
Allows for easy vectorization of functions
Broadcasting for working with arrays with different dimensions

Materials

Coding:

Pandas

Motivation

Numpy offers a great flexibility and efficiency when dealing with data matrices.
Really efficient for mathematical operations.
Pretty bad for data wrangling tasks ~> numpy only accepts the same data type
The pandas package was designed to solve this limitation by providing data structures to deal with rectangular & heterogeneous data types.
Main Data Structures: pd.series() and pd.DataFrames()

Pandas Series

A pandas series is a one-dimensional labeled array.

Capable of holding different data types (e.g. integer, boolean, strings, etc.).
It holds two key components:
- index: names in the axis
- values: values in the series
A pandas series is nothing but a column in an excel sheet or an R data.frame (with an index)

Pandas Series Constructor

import pandas as pd

s = pd.Series(["Argentina", "France", "Germany","Spain", "Italy", "Brazil"],
                 index=[2022, 2018, 2014, 2010, 2006, 2002])
print(s)

2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object

You can feed to the constructor:
- list
- dictionaries
- scalar values
- ndarray

Pandas DataFrames

A pandas DataFrame is a two dimensional, relational data structure with the capacity to handle heterogeneous data types.

relational: each column value contained within a row entry corresponds with the same observation.
two dimensional: a matrix data structure
heterogeneous: different data types can be contained across each column series.

Constructor

import pandas as pd
# create a simple series
series = pd.Series(["Argentina", "France", "Germany","Spain", "Italy", "Brazil"],
                 index=[2022, 2018, 2014, 2010, 2006, 2002])
# create the dataframe
df = pd.DataFrame(series)
print(df)

              0
2022  Argentina
2018     France
2014    Germany
2010      Spain
2006      Italy
2002     Brazil

We will discuss:
- using lists of dictionaries to build dataframes row-wise
- using dictionary of lists dataframes column-wise

Pandas DataFrames vs R Dataframes

Important concepts:

Creating Dataframes
- row-wise: using a list of dictionaries
- column-wise: using a dictionaries of lists
Indexing for accessing Data Frames in Python
- No implicit indexing (d[1,2]) will throw you an error.
- .iloc[] = use the numerical index position to call to locations in the DataFrame.
- .loc[] = use the labels to call to the location in the data frame.

Notebook for Pandas

Final Project

What is it? A data science project, applying concepts learned throughout the course.
Involves collecting data, cleaning and analyzing it, and presenting your findings
The project is composed of three parts:
- a 2 page project proposal: (which should be discussed and approved by me)
- an in-class presentation,
- A 10-page project report.

Deadlines and Logistics

Requirement	Due	Length	Percentage
Project Proposal	November 15	2 pages	5%
Presentation	December 10	10-15 minutes	10%
Project Report	December 17	10 pages	25%

Groups of three students. You pick your groups.
Before November 8, you should have meet with me to discuss your proposal.
At lest one hour before our meeting, send me a draft of your proposal.
Send me an email with your group and when you are going to my office hours.
Our meeting should be before October 26

More information: https://tiagoventura.github.io/ppol5203/finalproject.html

In-class Exercise: your homework.