PPOL 5203 Data Science I: Foundations

Introduction to Pandas

Tiago Ventura


In this Notebook we cover

  • Pandas data objects and how to construct them.
  • Using indexes for Pandas data objects.
In [38]:
import pandas as pd
import numpy as np

pandas Objects

Recall numpy offers a great flexibility and efficiency when dealing with data matrices (when compared to manipulating data represented as a nested list). However, as we saw, numpy is limited in its capacity to deal with heterogeneous data types in a single data matrix. This is a limitation given the nature of most social science datasets.

The pandas package was designed to deal with this issue. Built on top of the numpy library, Pandas retains and expands upon numpy's functionality.

The fundamental data constructs in a pandas objects are:

  • the pd.Series
  • the pd.DataFrame.

Series

  • A pandas series is a one-dimensional labeled array.

  • It is capable of holding different data types (e.g. integer, boolean, strings, etc.).

  • The axis in a series works as "index" --- similar to a list or numpy array--- however, we can use other data types to serve as an index, which allows for some powerful ways for manipulating the array.

At it's core, a Pandas Series is nothing but a column in an excel sheet or an R data.frame.

Constructor

To construct a pandas Series, we can use the pd.Series() constructor.

Notice: The input to create a pandas series is a list or an array.

If you are migrating from R, this is a bit of a difference, since you can use simple vectors in R to create a dataframe column. Python does not have vectors as R does, only lists and arrays. These are the basic inputs to create a Pandas.Series()

In [39]:
import pandas as pd
s = pd.Series(["Argentina", "France", "Germany","Spain", "Italy", "Brazil"],
                 index=[2022, 2018, 2014, 2010, 2006, 2002])
type(s)
print(s)
2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object
In [40]:
# notice series are homogenous and it gets implicit indexes
s_no_index = pd.Series(["Argentina", True, "Germany","Spain", "Italy", "Brazil"])
s_no_index
Out[40]:
0    Argentina
1         True
2      Germany
3        Spain
4        Italy
5       Brazil
dtype: object

Series Elements

The Series combines a sequence of

  • values with an explicit sequence of
  • indices, which we can access with the values and index attributes.

Series index

Indexing with Pandas series come on three flavors:

  • explicit index
  • implicit (deprecated)
  • mask indexing
In [41]:
# index
print(s.index)

# implicit index
print(s_no_index.index)
Int64Index([2022, 2018, 2014, 2010, 2006, 2002], dtype='int64')
RangeIndex(start=0, stop=6, step=1)
In [42]:
s
Out[42]:
2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object
In [43]:
# explicit index
s[2002]
Out[43]:
'Brazil'
In [44]:
# implicit position
s[:2]
Out[44]:
2022    Argentina
2018       France
dtype: object
In [45]:
# masking with a boolean
arr_mask = s.index>2016
In [46]:
s[s.index>2016]
Out[46]:
2022    Argentina
2018       France
dtype: object
In [47]:
# with or condition -- which requires a weird set of parenthesis
s[((s.index==2018) | (s.index==2022))]
Out[47]:
2022    Argentina
2018       France
dtype: object
In [48]:
s_no_index = pd.Series(["Argentina", True, "Germany","Spain", "Italy", "Brazil"])
s_no_index
Out[48]:
0    Argentina
1         True
2      Germany
3        Spain
4        Italy
5       Brazil
dtype: object
In [49]:
s_no_index[:2]
Out[49]:
0    Argentina
1         True
dtype: object

Series values

In [50]:
s.values
Out[50]:
array(['Argentina', 'France', 'Germany', 'Spain', 'Italy', 'Brazil'],
      dtype=object)
In [51]:
# it is an numpy array
print(type(s.values))
print(type(s_no_index.values))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

DataFrame

A pandas DataFrame is a two dimensional, relational data structure with the capacity to handle heterogeneous data types.

  • "relational" = each column value contained within a row entry corresponds with the same observation.
  • "two dimensional" = a matrix data structure (no $N$-dimensional arrays). The data construct can be accessed through row/column indices.
  • "heterogeneous" = different data types can be contained within each column series. This means, for example, string, integer, and boolean values can coexist in the same data structure and retain the specific properties of their data type class.

Put simply, a DataFrame is a collection of pandas series where each index position corresponds to the same observation.

**`Important:`** Each column in a pandas DataFrame is a Series

How to create a DataFrame?

Constructor

To create a pandas DataFrame, we call the pd.DataFrame() constructor.

Construction using dict()

As input, we need to feed in a dictionary, where the keys are the column names and the values are the relational data input.

In [52]:
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c','d','e','f']}
my_dict
Out[52]:
{'A': [1, 2, 3, 4, 5, 6],
 'B': [2, 3, 1, 0.3, 4, 1],
 'C': ['a', 'b', 'c', 'd', 'e', 'f']}
In [53]:
pd.DataFrame(my_dict)
Out[53]:
A B C
0 1 2.0 a
1 2 3.0 b
2 3 1.0 c
3 4 0.3 d
4 5 4.0 e
5 6 1.0 f

Data must be relational. If the dimensions do not align, an error will be thrown.

In [54]:
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c']}
pd.DataFrame(my_dict)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[54], line 2
      1 my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c']}
----> 2 pd.DataFrame(my_dict)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:666, in _extract_index(data)
    664 lengths = list(set(raw_lengths))
    665 if len(lengths) > 1:
--> 666     raise ValueError("All arrays must be of the same length")
    668 if have_dicts:
    669     raise ValueError(
    670         "Mixing dicts with non-Series may lead to ambiguous ordering."
    671     )

ValueError: All arrays must be of the same length

When constructing a DataFrame from scratch, using the dict constructor can help ease typing.

In [55]:
pd.DataFrame(dict(A = [1,2,3],
                  B = ['a','b','c']))
Out[55]:
A B
0 1 a
1 2 b
2 3 c

Construction using list()

Likewise, we can simply input a list, and the DataFrame will put a 0-based integer index by default.

In [56]:
my_list = [4,4,5,6,7]
pd.DataFrame(my_list)
Out[56]:
0
0 4
1 4
2 5
3 6
4 7

The same holds if we feed in a nested list structure.

In [57]:
nested_list = np.random.randint(1,10,25).reshape(5,5).tolist()
nested_list
Out[57]:
[[3, 1, 2, 9, 6],
 [2, 4, 8, 4, 6],
 [1, 3, 5, 1, 9],
 [7, 8, 5, 9, 3],
 [9, 9, 8, 3, 4]]
In [58]:
pd.DataFrame(nested_list)
Out[58]:
0 1 2 3 4
0 3 1 2 9 6
1 2 4 8 4 6
2 1 3 5 1 9
3 7 8 5 9 3
4 9 9 8 3 4

To overwrite the default indexing protocol, we can provide a list of column names to correspond to each column index position.

In [59]:
col_names = [f'Var{i}' for i in range(1,6)]
col_names
Out[59]:
['Var1', 'Var2', 'Var3', 'Var4', 'Var5']
In [60]:
D = pd.DataFrame(nested_list,
                 columns=col_names)
D
Out[60]:
Var1 Var2 Var3 Var4 Var5
0 3 1 2 9 6
1 2 4 8 4 6
2 1 3 5 1 9
3 7 8 5 9 3
4 9 9 8 3 4

Constructing using a pd.Series()

And of course, you can use a series to construct a dataframe

In [61]:
# create a series
area = pd.Series({'California': 423967, 
                  'Texas': 695662,
                  'Florida': 170312, 
                  'New York': 141297,
                  'Pennsylvania': 119280})
# see areas
print(area)
type(area)
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64
Out[61]:
pandas.core.series.Series
In [62]:
pd.DataFrame(area, columns=["area"])
Out[62]:
area
California 423967
Texas 695662
Florida 170312
New York 141297
Pennsylvania 119280
In [63]:
# transpose your index
pd.DataFrame(area, columns=["area"]).T
Out[63]:
California Texas Florida New York Pennsylvania
area 423967 695662 170312 141297 119280

From Nested Data to DataFrames

In real cases, your data will hardly come from ONE list, or ONE dictionary.

Imagine you are scrapping data from the web, and every iteration, you want to add a new row to your dataframe. You have a Nested Data, in which your columns are repeating over every iteration.

For these cases, there in general two abstract approaches to go from nested data (as list or dictionaries) to a Pandas DataFrame.

  • Column-Wise: From a dictionary of lists
  • Row-Wise: From a list of dictionaries

Column-Wise

In this approach, your data goes from a dictionary of lists to a DataFrame.

Your input is organized by collumns!

  • Keys -> Columns Names
  • Lists (values) -> Values for every observation i on your data.
In [64]:
# create a dictionary of lists
dict_ = {"names":["Darrow", "Adrius", "Sevro", "Virginia", "Victra"],
         "nickname":["The Reaper", "Jakal", "Goblin", "Mustang", "NaN"], 
         "house":["Mars", "Venus", "Mars", "Mercury", "Jupiter"], 
         "color": ["Red", "Gold", "Gold", "Gold", "Gold"], 
         "when_they_died":["alive", "mutant", "alive", "alive", "alive"], 
        "gender": ["M", "M", "M", "F", "F"]}

# create a dataframe
pd.DataFrame(dict_)
Out[64]:
names nickname house color when_they_died gender
0 Darrow The Reaper Mars Red alive M
1 Adrius Jakal Venus Gold mutant M
2 Sevro Goblin Mars Gold alive M
3 Virginia Mustang Mercury Gold alive F
4 Victra NaN Jupiter Gold alive F

Row-Wise

In this approach, your data goes from a list of dictionaries to a DataFrame.

Your input is organized by observations

  • Keys of each dictionary -> Columns names
  • Values of each dictionary -> Values for every observation i in column j on your data.
In [65]:
# create a list of dictionaries
list_ = [{"names":"Darrow", "nickname":"The Reaper", "house":"Mars", "color":"red"}, # obs 1
         
         {"names":"Adrius", "nickname":"Jakal", "house":"Venus", "color":"gold"}, # obs 2
         
         {"names":"Sevro", "nickname":"Goblin", "house":"Mars", "color":"gold"}, # obs 3
         
         {"names":"Virginia", "nickname":"Mustang", "house":"Mercury", "color":"gold"}, # obs 4
         
         {"names":"Victra", "nickname":"NaN", "house":"Jupiter", "color":"gold"} # obs 5
        ] 

# this looks like a json, which is a very common way to store large datasets!
list_
Out[65]:
[{'names': 'Darrow',
  'nickname': 'The Reaper',
  'house': 'Mars',
  'color': 'red'},
 {'names': 'Adrius', 'nickname': 'Jakal', 'house': 'Venus', 'color': 'gold'},
 {'names': 'Sevro', 'nickname': 'Goblin', 'house': 'Mars', 'color': 'gold'},
 {'names': 'Virginia',
  'nickname': 'Mustang',
  'house': 'Mercury',
  'color': 'gold'},
 {'names': 'Victra', 'nickname': 'NaN', 'house': 'Jupiter', 'color': 'gold'}]
In [66]:
# create a dataframe
D = pd.DataFrame(list_)
D
Out[66]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
2 Sevro Goblin Mars gold
3 Virginia Mustang Mercury gold
4 Victra NaN Jupiter gold

How would this work in practice? Imagine for example a case in which you are scraping multiple websites. You would write a loop, and at each iteration, you would add a new element to you list, using your dictonary keys to save the tags. For example:

In [67]:
# create a contained
empty_list =[]

# write a for loop
for l in list_:
    # grab your entries
    unique_entry = {}
    unique_entry["names"]=l["names"]
    unique_entry["nickname"]=l["nickname"]
    unique_entry["house"]=l["house"]
    unique_entry["color"]=l["color"]
    # append to you list
    empty_list.append(unique_entry)
    

# convert to a dataframe
#empty_list
pd.DataFrame(empty_list)
    
Out[67]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
2 Sevro Goblin Mars gold
3 Virginia Mustang Mercury gold
4 Victra NaN Jupiter gold

The same building approach can be done with a list of lists + a list of names.

In [68]:
# list of lists + list of names
list_names= ["names","nickname", "house", "color"]
list_of_lists = [
                 ["Darrow","The Reaper", "Mars", "red"], # obs 1
                 ["Adrius", "Jakal", "Venus","gold"], # obs 2
                 ["Sevro", "Goblin", "Mars", "gold"], # obs 3
                 ["Virginia", "Mustang", "Mercury", "gold"], # obs 3
                 ["Victra", "NaN","Jupiter", "gold"]
                ]
In [69]:
# create a dataframe
pd.DataFrame(list_of_lists, columns=list_names)
Out[69]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
2 Sevro Goblin Mars gold
3 Virginia Mustang Mercury gold
4 Victra NaN Jupiter gold

Keep these approaches in mind. The conversion of nested data to DataFrames will be useful when you are collecting your own data, particularly when using webscrapping and loops.

DataFrame index

Important: This is where most people coming from R gets confused with Pandas in Python

Unlike with a numpy array, we cannot simply call a row index position (implicit index). When doing this, Pandas looks for columns as index.

**No implict index with DataFrames!!!**

In [70]:
D[1,2]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (1, 2)

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[70], line 1
----> 1 D[1,2]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: (1, 2)

Explanation:

This is because the internal index to a DataFrame refers to the column index. This might be odd at first but if we think back to the behavior of Python dictionaries (which a DataFrame fundamentally is under the hood) we'll recall that the key is the default indexing features (as the immutable keys provide for efficient lookups in the dictionary object).

In [71]:
D['names']
Out[71]:
0      Darrow
1      Adrius
2       Sevro
3    Virginia
4      Victra
Name: names, dtype: object

Dataframes have two indices

Always remember that there are 2 indices in a DataFrame that we must keep track of: the row index and the column index.

In [72]:
D
Out[72]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
2 Sevro Goblin Mars gold
3 Virginia Mustang Mercury gold
4 Victra NaN Jupiter gold
In [73]:
# Row index
D.index
Out[73]:
RangeIndex(start=0, stop=5, step=1)
In [74]:
# column index
D.columns
Out[74]:
Index(['names', 'nickname', 'house', 'color'], dtype='object')

Slice and Index in Pandas

Memorize this for Pandas:

  • indexing refers to columns,

  • slicing refers to rows:

In [75]:
# index
D["color"]
Out[75]:
0     red
1    gold
2    gold
3    gold
4    gold
Name: color, dtype: object
In [76]:
# multiple columns
D[["names", "house"]]
Out[76]:
names house
0 Darrow Mars
1 Adrius Venus
2 Sevro Mars
3 Virginia Mercury
4 Victra Jupiter
In [77]:
# slice
D[0:2]
Out[77]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
In [78]:
# slice with a modified, non-positional index
# ALSO WORKS WITH SLICE!!!
D_index_name = D.set_index("names")
D_index_name["Darrow":"Sevro"]
Out[78]:
nickname house color
names
Darrow The Reaper Mars red
Adrius Jakal Venus gold
Sevro Goblin Mars gold

Acessing with Boolean

We can use boolean masks as in series to access elements of the DataFrames

In [79]:
D[D["house"]=="Mars"]
Out[79]:
names nickname house color
0 Darrow The Reaper Mars red
2 Sevro Goblin Mars gold

Better methods for acessing elements: loc and iloc

To access the indices in a DataFrame, we are better by relying on two build-in methods:

  • .iloc[] = use the numerical index position to call to locations in the DataFrame. (The i is short for index.)
  • .loc[] = use the labels to call to the location in the data frame.
In [ ]:
# numerical position
D.iloc[:3,0:2] # D.iloc[row position,column position]
In [ ]:
# index labels
D.loc[:3,['names','house']]

A few things to note about .loc[]

  • calls all named index positions. Above we get back all the requested rows (rather than the numerical range which returns one below the max value). This is because .loc[] treats the index as a labeled feature rather than a numerical one.
  • selecting ranges from labeled indices works the same as numerical indices. That is we can make calls to all variables in between (see below).
In [ ]:
D.loc[:,'names':'house']
In [ ]:
# boolean masks also work here
D.loc[D["house"]=="Mars", :]

We can redefine the row indices.

In [ ]:
dict_
In [ ]:
D2 = pd.DataFrame(dict_,
                  index=["d","a", "s", "v", "v2"])
D2
In [ ]:
# We can use the named indices for look up (and as with numpy, column rearrangement).
D2.loc[["d","a","v"],"names":"house"]
In [ ]:
# notice, we cannot use the numbers with loc anymore
D2.loc[1:2,"names":"house"]

And series?

Yes... you can also use loc, and iloc with index.

In [ ]:
# get our old series
s
In [ ]:
# iloc
s.iloc[0]
In [ ]:
# loc
s.loc[2022]

Setting the index as the unit of observation

We can redefine the index to work as a way to keep our unit of observation: consistent, clean, and easy to use.

In [ ]:
dat = D.set_index('names')
dat
In [ ]:
dat.loc['Darrow',:]

Reverting the index back to it's original 0-based index is straight forward with the .reset_index() method.

In [ ]:
dat = dat.reset_index()
dat

Hierarchical (multi-) index

In [ ]:
dat = D.set_index(keys=['names', 'house'])
dat

We can see that the index is composed of two levels.

In [ ]:
dat.index

Under the hood, the hierarchical indices are actually tuples.

In [ ]:
dat.loc[("Darrow","Mars"),:]

We can use boolean lookups on the level values

In [ ]:
dat.loc[dat.index.get_level_values('house')=="Jupiter",:]

Finally, we can easily sort and order the index.

In [ ]:
dat.sort_index()

As before, if we wish to revert the index back to a 0-based integer, we can with .reset_index()

In [ ]:
# inplace to save in the same object
dat.reset_index(inplace=True)
In [ ]:
dat

Column Indices ("column names")

As seen, pandas can keep track of column feature using column index. We can access the column index at any time using the .columns attribut

In [ ]:
dat.columns

Or we can simply redefine the dataframe using the list() constructor (recall the a DataFrame is really a dict)

In [ ]:
list(dat)

Overwriting column names: below let's set all of the columns to be lower case. Note that we can invoke a .str method that gives access to all of the string data type methods.

In [ ]:
dat.columns = dat.columns.str.upper()
dat.columns
In [ ]:
dat

But note that the column index is not mutable. Recall that values are mutable in dictionary, but the keys are not.

In [ ]:
dat.columns[dat.columns == "HOUSE"] = "house"

We either have to replace all the keys (as we do above), or use the .rename() method to rename a specific data feature by passing it a dict with the new renaming convention.

data.rename(columns = {'old_name':'new_name'})
In [ ]:
dat.rename(columns={"NAMES":"name"},
             inplace=True) # Makes the change in-place rather than making a copy
dat.columns

Similar to row indices, we can generate hierarchical column indices as well. (As we'll see this will be the default column index output when aggregating variables next time).

That's it for today! Next week we will see more and more of Pandas DataFrames

In [ ]:
!jupyter nbconvert _week_5a_intro_to_pandas.ipynb --to html --template classic