In this Notebook we cover
import pandas as pd
import numpy as np
pandas
Objects¶Recall numpy
offers a great flexibility and efficiency when dealing with data matrices (when compared to manipulating data represented as a nested list). However, as we saw, numpy
is limited in its capacity to deal with heterogeneous data types in a single data matrix. This is a limitation given the nature of most social science datasets.
The pandas
package was designed to deal with this issue. Built on top of the numpy
library, Pandas retains and expands upon numpy
's functionality.
The fundamental data constructs in a pandas
objects are:
pd.Series
pd.DataFrame
.Series
¶A pandas
series is a one-dimensional labeled array.
It is capable of holding different data types (e.g. integer, boolean, strings, etc.).
The axis in a series works as "index" --- similar to a list or numpy
array--- however, we can use other data types to serve as an index, which allows for some powerful ways for manipulating the array.
At it's core, a Pandas
Series
is nothing but a column in an excel sheet or an R
data.frame
.
To construct a pandas Series
, we can use the pd.Series()
constructor.
Notice: The input to create a pandas series is a list or an array.
If you are migrating from R
, this is a bit of a difference, since you can use simple vectors in R to create a dataframe column. Python does not have vectors as R
does, only lists and arrays. These are the basic inputs to create a Pandas.Series()
import pandas as pd
s = pd.Series(["Argentina", "France", "Germany","Spain", "Italy", "Brazil"],
index=[2022, 2018, 2014, 2010, 2006, 2002])
type(s)
print(s)
# notice series are homogenous and it gets implicit indexes
s_no_index = pd.Series(["Argentina", True, "Germany","Spain", "Italy", "Brazil"])
s_no_index
The Series
combines a sequence of
Series
index¶Indexing with Pandas series come on three flavors:
# index
print(s.index)
# implicit index
print(s_no_index.index)
s
# explicit index
s[2002]
# implicit position
s[:2]
# masking with a boolean
arr_mask = s.index>2016
s[s.index>2016]
# with or condition -- which requires a weird set of parenthesis
s[((s.index==2018) | (s.index==2022))]
s_no_index = pd.Series(["Argentina", True, "Germany","Spain", "Italy", "Brazil"])
s_no_index
s_no_index[:2]
Series
values¶s.values
# it is an numpy array
print(type(s.values))
print(type(s_no_index.values))
DataFrame
¶A pandas
DataFrame
is a two dimensional, relational data structure with the capacity to handle heterogeneous data types.
Put simply, a DataFrame
is a collection of pandas series where each index position corresponds to the same observation.
dict()
¶As input, we need to feed in a dictionary, where the keys are the column names and the values are the relational data input.
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c','d','e','f']}
my_dict
pd.DataFrame(my_dict)
Data must be relational. If the dimensions do not align, an error will be thrown.
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c']}
pd.DataFrame(my_dict)
When constructing a DataFrame from scratch, using the dict constructor can help ease typing.
pd.DataFrame(dict(A = [1,2,3],
B = ['a','b','c']))
list()
¶Likewise, we can simply input a list, and the DataFrame
will put a 0-based integer index by default.
my_list = [4,4,5,6,7]
pd.DataFrame(my_list)
The same holds if we feed in a nested list structure.
nested_list = np.random.randint(1,10,25).reshape(5,5).tolist()
nested_list
pd.DataFrame(nested_list)
To overwrite the default indexing protocol, we can provide a list of column names to correspond to each column index position.
col_names = [f'Var{i}' for i in range(1,6)]
col_names
D = pd.DataFrame(nested_list,
columns=col_names)
D
pd.Series()
¶And of course, you can use a series to construct a dataframe
# create a series
area = pd.Series({'California': 423967,
'Texas': 695662,
'Florida': 170312,
'New York': 141297,
'Pennsylvania': 119280})
# see areas
print(area)
type(area)
pd.DataFrame(area, columns=["area"])
# transpose your index
pd.DataFrame(area, columns=["area"]).T
In real cases, your data will hardly come from ONE list, or ONE dictionary.
Imagine you are scrapping data from the web, and every iteration, you want to add a new row to your dataframe. You have a Nested Data, in which your columns are repeating over every iteration.
For these cases, there in general two abstract approaches to go from nested data (as list or dictionaries) to a Pandas DataFrame
.
In this approach, your data goes from a dictionary of lists to a DataFrame
.
Your input is organized by collumns!
# create a dictionary of lists
dict_ = {"names":["Darrow", "Adrius", "Sevro", "Virginia", "Victra"],
"nickname":["The Reaper", "Jakal", "Goblin", "Mustang", "NaN"],
"house":["Mars", "Venus", "Mars", "Mercury", "Jupiter"],
"color": ["Red", "Gold", "Gold", "Gold", "Gold"],
"when_they_died":["alive", "mutant", "alive", "alive", "alive"],
"gender": ["M", "M", "M", "F", "F"]}
# create a dataframe
pd.DataFrame(dict_)
In this approach, your data goes from a list of dictionaries to a DataFrame.
Your input is organized by observations
# create a list of dictionaries
list_ = [{"names":"Darrow", "nickname":"The Reaper", "house":"Mars", "color":"red"}, # obs 1
{"names":"Adrius", "nickname":"Jakal", "house":"Venus", "color":"gold"}, # obs 2
{"names":"Sevro", "nickname":"Goblin", "house":"Mars", "color":"gold"}, # obs 3
{"names":"Virginia", "nickname":"Mustang", "house":"Mercury", "color":"gold"}, # obs 4
{"names":"Victra", "nickname":"NaN", "house":"Jupiter", "color":"gold"} # obs 5
]
# this looks like a json, which is a very common way to store large datasets!
list_
# create a dataframe
D = pd.DataFrame(list_)
D
How would this work in practice? Imagine for example a case in which you are scraping multiple websites. You would write a loop, and at each iteration, you would add a new element to you list, using your dictonary keys to save the tags. For example:
# create a contained
empty_list =[]
# write a for loop
for l in list_:
# grab your entries
unique_entry = {}
unique_entry["names"]=l["names"]
unique_entry["nickname"]=l["nickname"]
unique_entry["house"]=l["house"]
unique_entry["color"]=l["color"]
# append to you list
empty_list.append(unique_entry)
# convert to a dataframe
#empty_list
pd.DataFrame(empty_list)
The same building approach can be done with a list of lists + a list of names.
# list of lists + list of names
list_names= ["names","nickname", "house", "color"]
list_of_lists = [
["Darrow","The Reaper", "Mars", "red"], # obs 1
["Adrius", "Jakal", "Venus","gold"], # obs 2
["Sevro", "Goblin", "Mars", "gold"], # obs 3
["Virginia", "Mustang", "Mercury", "gold"], # obs 3
["Victra", "NaN","Jupiter", "gold"]
]
# create a dataframe
pd.DataFrame(list_of_lists, columns=list_names)
Keep these approaches in mind. The conversion of nested data to DataFrames
will be useful when you are collecting your own data, particularly when using webscrapping and loops.
DataFrame
index¶Important: This is where most people coming from R gets confused with Pandas in Python
Unlike with a numpy
array, we cannot simply call a row index position (implicit index). When doing this, Pandas looks for columns as index.
**No implict index with DataFrames!!!**
D[1,2]
This is because the internal index to a DataFrame
refers to the column index. This might be odd at first but if we think back to the behavior of Python dictionaries (which a DataFrame fundamentally is under the hood) we'll recall that the key is the default indexing features (as the immutable keys provide for efficient lookups in the dictionary object).
D['names']
Always remember that there are 2 indices in a DataFrame
that we must keep track of: the row index
and the column
index.
D
# Row index
D.index
# column index
D.columns
Memorize this for Pandas:
indexing refers to columns,
slicing refers to rows:
# index
D["color"]
# multiple columns
D[["names", "house"]]
# slice
D[0:2]
# slice with a modified index
D_index_name = D.set_index("names")
D_index_name["Darrow":"Sevro"]
We can use boolean masks as in series to access elements of the DataFrames
D[D["house"]=="Mars"]
To access the indices in a DataFrame
, we are better by relying on two build-in methods:
.iloc[]
= use the numerical index position to call to locations in the DataFrame
. (The i
is short for index
.).loc[]
= use the labels to call to the location in the data frame. # numerical position
D.iloc[:3,0:2] # D.iloc[row position,column position]
# index labels
D.loc[:3,['names','house']]
A few things to note about .loc[]
.loc[]
treats the index as a labeled feature rather than a numerical one. D.loc[:,'names':'house']
# boolean masks also work here
D.loc[D["house"]=="Mars", :]
We can redefine the row indices.
dict_
D2 = pd.DataFrame(dict_,
index=["d","a", "s", "v", "v2"])
D2
# We can use the named indices for look up (and as with numpy, column rearrangement).
D2.loc[["d","a","v"],"names":"house"]
# notice, we cannot use the numbers with loc anymore
D2.loc[1:2,"names":"house"]
Yes... you can also use loc, and iloc with index.
# get our old series
s
# iloc
s.iloc[0]
# loc
s.loc[2022]
We can redefine the index to work as a way to keep our unit of observation: consistent, clean, and easy to use.
dat = D.set_index('names')
dat
dat.loc['Darrow',:]
Reverting the index back to it's original 0-based index is straight forward with the .reset_index()
method.
dat = dat.reset_index()
dat
dat = D.set_index(keys=['names', 'house'])
dat
We can see that the index is composed of two levels.
dat.index
Under the hood, the hierarchical indices are actually tuples.
dat.loc[("Darrow","Mars"),:]
We can use boolean lookups on the level values
dat.loc[dat.index.get_level_values('house')=="Jupiter",:]
Finally, we can easily sort and order the index.
dat.sort_index()
As before, if we wish to revert the index back to a 0-based integer, we can with .reset_index()
# inplace to save in the same object
dat.reset_index(inplace=True)
dat
As seen, pandas
can keep track of column feature using column index. We can access the column index at any time using the .columns
attribut
dat.columns
Or we can simply redefine the dataframe using the list()
constructor (recall the a DataFrame
is really a dict
)
list(dat)
Overwriting column names: below let's set all of the columns to be lower case. Note that we can invoke a .str
method that gives access to all of the string data type methods.
dat.columns = dat.columns.str.upper()
dat.columns
dat
But note that the column index is not mutable. Recall that values are mutable in dictionary, but the keys are not.
dat.columns[dat.columns == "HOUSE"] = "house"
We either have to replace all the keys (as we do above), or use the .rename()
method to rename a specific data feature by passing it a dict
with the new renaming convention.
data.rename(columns = {'old_name':'new_name'})
dat.rename(columns={"NAMES":"name"},
inplace=True) # Makes the change in-place rather than making a copy
dat.columns
Similar to row indices, we can generate hierarchical column indices as well. (As we'll see this will be the default column index output when aggregating variables next time).
Pandas DataFrames
¶!jupyter nbconvert _week_5a_intro_to_pandas.ipynb --to html --template classic