PPOL 5203 Data Science I: Foundations

Introduction to Pandas

Tiago Ventura


In this Notebook we cover

  • Pandas data objects and how to construct them.
  • Using indexes for Pandas data objects.
In [42]:
import pandas as pd
import numpy as np

pandas Objects

Recall numpy offers a great flexibility and efficiency when dealing with data matrices (when compared to manipulating data represented as a nested list). However, as we saw, numpy is limited in its capacity to deal with heterogeneous data types in a single data matrix. This is a limitation given the nature of most social science datasets.

The pandas package was designed to deal with this issue. Built on top of the numpy library, Pandas retains and expands upon numpy's functionality.

The fundamental data constructs in a pandas objects are:

  • the pd.Series
  • the pd.DataFrame.

Series

  • A pandas series is a one-dimensional labeled array.

  • It is capable of holding different data types (e.g. integer, boolean, strings, etc.).

  • The axis in a series works as "index" --- similar to a list or numpy array--- however, we can use other data types to serve as an index, which allows for some powerful ways for manipulating the array.

At it's core, a Pandas Series is nothing but a column in an excel sheet or an R data.frame.

Constructor

To construct a pandas Series, we can use the pd.Series() constructor.

Notice: The input to create a pandas series is a list or an array.

If you are migrating from R, this is a bit of a difference, since you can use simple vectors in R to create a dataframe column. Python does not have vectors as R does, only lists and arrays. These are the basic inputs to create a Pandas.Series()

In [44]:
import pandas as pd
s = pd.Series(["Argentina", "France", "Germany","Spain", "Italy", "Brazil"],
                 index=[2022, 2018, 2014, 2010, 2006, 2002])
type(s)
print(s)
2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object
In [45]:
# notice series are homogenous and it gets implicit indexes
s_no_index = pd.Series(["Argentina", True, "Germany","Spain", "Italy", "Brazil"])
s_no_index
Out[45]:
0    Argentina
1         True
2      Germany
3        Spain
4        Italy
5       Brazil
dtype: object

Series Elements

The Series combines a sequence of

  • values with an explicit sequence of
  • indices, which we can access with the values and index attributes.

Series index

Indexing with Pandas series come on three flavors:

  • explicit index
  • implicit (deprecated)
  • mask indexing
In [75]:
# index
print(s.index)

# implicit index
print(s_no_index.index)
Int64Index([2022, 2018, 2014, 2010, 2006, 2002], dtype='int64')
RangeIndex(start=0, stop=6, step=1)
Out[75]:
'Argentina'
In [47]:
s
Out[47]:
2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object
In [52]:
# explicit index
s[2002]
Out[52]:
'Brazil'
In [64]:
# implicit position
s[:2]
Out[64]:
2022    Argentina
2018       France
dtype: object
In [71]:
# masking with a boolean
arr_mask = s.index>2016
In [74]:
s[s.index>2016]
Out[74]:
2022    Argentina
2018       France
dtype: object
In [9]:
# with or condition -- which requires a weird set of parenthesis
s[((s.index==2018) | (s.index==2022))]
Out[9]:
2022    Argentina
2018       France
dtype: object
In [77]:
s_no_index = pd.Series(["Argentina", True, "Germany","Spain", "Italy", "Brazil"])
s_no_index
Out[77]:
0    Argentina
1         True
2      Germany
3        Spain
4        Italy
5       Brazil
dtype: object
In [79]:
s_no_index[:2]
Out[79]:
0    Argentina
1         True
dtype: object

Series values

In [80]:
s.values
Out[80]:
array(['Argentina', 'France', 'Germany', 'Spain', 'Italy', 'Brazil'],
      dtype=object)
In [84]:
# it is an numpy array
print(type(s.values))
print(type(s_no_index.values))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

DataFrame

A pandas DataFrame is a two dimensional, relational data structure with the capacity to handle heterogeneous data types.

  • "relational" = each column value contained within a row entry corresponds with the same observation.
  • "two dimensional" = a matrix data structure (no $N$-dimensional arrays). The data construct can be accessed through row/column indices.
  • "heterogeneous" = different data types can be contained within each column series. This means, for example, string, integer, and boolean values can coexist in the same data structure and retain the specific properties of their data type class.

Put simply, a DataFrame is a collection of pandas series where each index position corresponds to the same observation.

**`Important:`** Each column in a pandas DataFrame is a Series

How to create a DataFrame?

Constructor

To create a pandas DataFrame, we call the pd.DataFrame() constructor.

Construction using dict()

As input, we need to feed in a dictionary, where the keys are the column names and the values are the relational data input.

In [86]:
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c','d','e','f']}
my_dict
Out[86]:
{'A': [1, 2, 3, 4, 5, 6],
 'B': [2, 3, 1, 0.3, 4, 1],
 'C': ['a', 'b', 'c', 'd', 'e', 'f']}
In [87]:
pd.DataFrame(my_dict)
Out[87]:
A B C
0 1 2.0 a
1 2 3.0 b
2 3 1.0 c
3 4 0.3 d
4 5 4.0 e
5 6 1.0 f

Data must be relational. If the dimensions do not align, an error will be thrown.

In [88]:
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c']}
pd.DataFrame(my_dict)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[88], line 2
      1 my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c']}
----> 2 pd.DataFrame(my_dict)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:666, in _extract_index(data)
    664 lengths = list(set(raw_lengths))
    665 if len(lengths) > 1:
--> 666     raise ValueError("All arrays must be of the same length")
    668 if have_dicts:
    669     raise ValueError(
    670         "Mixing dicts with non-Series may lead to ambiguous ordering."
    671     )

ValueError: All arrays must be of the same length

When constructing a DataFrame from scratch, using the dict constructor can help ease typing.

In [89]:
pd.DataFrame(dict(A = [1,2,3],
                  B = ['a','b','c']))
Out[89]:
A B
0 1 a
1 2 b
2 3 c

Construction using list()

Likewise, we can simply input a list, and the DataFrame will put a 0-based integer index by default.

In [90]:
my_list = [4,4,5,6,7]
pd.DataFrame(my_list)
Out[90]:
0
0 4
1 4
2 5
3 6
4 7

The same holds if we feed in a nested list structure.

In [91]:
nested_list = np.random.randint(1,10,25).reshape(5,5).tolist()
nested_list
Out[91]:
[[8, 2, 8, 4, 8],
 [7, 5, 3, 9, 6],
 [3, 1, 8, 9, 3],
 [4, 8, 4, 9, 3],
 [6, 6, 2, 3, 9]]
In [92]:
pd.DataFrame(nested_list)
Out[92]:
0 1 2 3 4
0 8 2 8 4 8
1 7 5 3 9 6
2 3 1 8 9 3
3 4 8 4 9 3
4 6 6 2 3 9

To overwrite the default indexing protocol, we can provide a list of column names to correspond to each column index position.

In [94]:
col_names = [f'Var{i}' for i in range(1,6)]
col_names
Out[94]:
['Var1', 'Var2', 'Var3', 'Var4', 'Var5']
In [96]:
D = pd.DataFrame(nested_list,
                 columns=col_names)
D
Out[96]:
Var1 Var2 Var3 Var4 Var5
0 8 2 8 4 8
1 7 5 3 9 6
2 3 1 8 9 3
3 4 8 4 9 3
4 6 6 2 3 9

Constructing using a pd.Series()

And of course, you can use a series to construct a dataframe

In [97]:
# create a series
area = pd.Series({'California': 423967, 
                  'Texas': 695662,
                  'Florida': 170312, 
                  'New York': 141297,
                  'Pennsylvania': 119280})
# see areas
print(area)
type(area)
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64
Out[97]:
pandas.core.series.Series
In [98]:
pd.DataFrame(area, columns=["area"])
Out[98]:
area
California 423967
Texas 695662
Florida 170312
New York 141297
Pennsylvania 119280
In [99]:
# transpose your index
pd.DataFrame(area, columns=["area"]).T
Out[99]:
California Texas Florida New York Pennsylvania
area 423967 695662 170312 141297 119280

From Nested Data to DataFrames

In real cases, your data will hardly come from ONE list, or ONE dictionary.

Imagine you are scrapping data from the web, and every iteration, you want to add a new row to your dataframe. You have a Nested Data, in which your columns are repeating over every iteration.

For these cases, there in general two abstract approaches to go from nested data (as list or dictionaries) to a Pandas DataFrame.

  • Column-Wise: From a dictionary of lists
  • Row-Wise: From a list of dictionaries

Column-Wise

In this approach, your data goes from a dictionary of lists to a DataFrame.

Your input is organized by collumns!

  • Keys -> Columns Names
  • Lists (values) -> Values for every observation i on your data.
In [103]:
# create a dictionary of lists
dict_ = {"names":["Darrow", "Adrius", "Sevro", "Virginia", "Victra"],
         "nickname":["The Reaper", "Jakal", "Goblin", "Mustang", "NaN"], 
         "house":["Mars", "Venus", "Mars", "Mercury", "Jupiter"], 
         "color": ["Red", "Gold", "Gold", "Gold", "Gold"], 
         "when_they_died":["alive", "mutant", "alive", "alive", "alive"], 
        "gender": ["M", "M", "M", "F", "F"]}

# create a dataframe
pd.DataFrame(dict_)
Out[103]:
names nickname house color when_they_died gender
0 Darrow The Reaper Mars Red alive M
1 Adrius Jakal Venus Gold mutant M
2 Sevro Goblin Mars Gold alive M
3 Virginia Mustang Mercury Gold alive F
4 Victra NaN Jupiter Gold alive F

Row-Wise

In this approach, your data goes from a list of dictionaries to a DataFrame.

Your input is organized by observations

  • Keys of each dictionary -> Columns names
  • Values of each dictionary -> Values for every observation i in column j on your data.
In [106]:
# create a list of dictionaries
list_ = [{"names":"Darrow", "nickname":"The Reaper", "house":"Mars", "color":"red"}, # obs 1
         
         {"names":"Adrius", "nickname":"Jakal", "house":"Venus", "color":"gold"}, # obs 2
         
         {"names":"Sevro", "nickname":"Goblin", "house":"Mars", "color":"gold"}, # obs 3
         
         {"names":"Virginia", "nickname":"Mustang", "house":"Mercury", "color":"gold"}, # obs 4
         
         {"names":"Victra", "nickname":"NaN", "house":"Jupiter", "color":"gold"} # obs 5
        ] 

# this looks like a json, which is a very common way to store large datasets!
list_
Out[106]:
[{'names': 'Darrow',
  'nickname': 'The Reaper',
  'house': 'Mars',
  'color': 'red'},
 {'names': 'Adrius', 'nickname': 'Jakal', 'house': 'Venus', 'color': 'gold'},
 {'names': 'Sevro', 'nickname': 'Goblin', 'house': 'Mars', 'color': 'gold'},
 {'names': 'Virginia',
  'nickname': 'Mustang',
  'house': 'Mercury',
  'color': 'gold'},
 {'names': 'Victra', 'nickname': 'NaN', 'house': 'Jupiter', 'color': 'gold'}]
In [107]:
# create a dataframe
D = pd.DataFrame(list_)
D
Out[107]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
2 Sevro Goblin Mars gold
3 Virginia Mustang Mercury gold
4 Victra NaN Jupiter gold

How would this work in practice? Imagine for example a case in which you are scraping multiple websites. You would write a loop, and at each iteration, you would add a new element to you list, using your dictonary keys to save the tags. For example:

In [109]:
# create a contained
empty_list =[]

# write a for loop
for l in list_:
    # grab your entries
    unique_entry = {}
    unique_entry["names"]=l["names"]
    unique_entry["nickname"]=l["nickname"]
    unique_entry["house"]=l["house"]
    unique_entry["color"]=l["color"]
    # append to you list
    empty_list.append(unique_entry)
    

# convert to a dataframe
#empty_list
pd.DataFrame(empty_list)
    
Out[109]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
2 Sevro Goblin Mars gold
3 Virginia Mustang Mercury gold
4 Victra NaN Jupiter gold

The same building approach can be done with a list of lists + a list of names.

In [110]:
# list of lists + list of names
list_names= ["names","nickname", "house", "color"]
list_of_lists = [
                 ["Darrow","The Reaper", "Mars", "red"], # obs 1
                 ["Adrius", "Jakal", "Venus","gold"], # obs 2
                 ["Sevro", "Goblin", "Mars", "gold"], # obs 3
                 ["Virginia", "Mustang", "Mercury", "gold"], # obs 3
                 ["Victra", "NaN","Jupiter", "gold"]
                ]
In [111]:
# create a dataframe
pd.DataFrame(list_of_lists, columns=list_names)
Out[111]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
2 Sevro Goblin Mars gold
3 Virginia Mustang Mercury gold
4 Victra NaN Jupiter gold

Keep these approaches in mind. The conversion of nested data to DataFrames will be useful when you are collecting your own data, particularly when using webscrapping and loops.

DataFrame index

Important: This is where most people coming from R gets confused with Pandas in Python

Unlike with a numpy array, we cannot simply call a row index position (implicit index). When doing this, Pandas looks for columns as index.

**No implict index with DataFrames!!!**

In [112]:
D[1,2]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (1, 2)

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[112], line 1
----> 1 D[1,2]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: (1, 2)

Explanation:

This is because the internal index to a DataFrame refers to the column index. This might be odd at first but if we think back to the behavior of Python dictionaries (which a DataFrame fundamentally is under the hood) we'll recall that the key is the default indexing features (as the immutable keys provide for efficient lookups in the dictionary object).

In [113]:
D['names']
Out[113]:
0      Darrow
1      Adrius
2       Sevro
3    Virginia
4      Victra
Name: names, dtype: object

Dataframes have two indices

Always remember that there are 2 indices in a DataFrame that we must keep track of: the row index and the column index.

In [115]:
D
Out[115]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
2 Sevro Goblin Mars gold
3 Virginia Mustang Mercury gold
4 Victra NaN Jupiter gold
In [116]:
# Row index
D.index
Out[116]:
RangeIndex(start=0, stop=5, step=1)
In [117]:
# column index
D.columns
Out[117]:
Index(['names', 'nickname', 'house', 'color'], dtype='object')

Slice and Index in Pandas

Memorize this for Pandas:

  • indexing refers to columns,

  • slicing refers to rows:

In [120]:
# index
D["color"]
Out[120]:
0     red
1    gold
2    gold
3    gold
4    gold
Name: color, dtype: object
In [121]:
# multiple columns
D[["names", "house"]]
Out[121]:
names house
0 Darrow Mars
1 Adrius Venus
2 Sevro Mars
3 Virginia Mercury
4 Victra Jupiter
In [122]:
# slice
D[0:2]
Out[122]:
names nickname house color
0 Darrow The Reaper Mars red
1 Adrius Jakal Venus gold
In [124]:
# slice with a modified index
D_index_name = D.set_index("names")
D_index_name["Darrow":"Sevro"]
Out[124]:
nickname house color
names
Darrow The Reaper Mars red
Adrius Jakal Venus gold
Sevro Goblin Mars gold

Acessing with Boolean

We can use boolean masks as in series to access elements of the DataFrames

In [125]:
D[D["house"]=="Mars"]
Out[125]:
names nickname house color
0 Darrow The Reaper Mars red
2 Sevro Goblin Mars gold

Better methods for acessing elements: loc and iloc

To access the indices in a DataFrame, we are better by relying on two build-in methods:

  • .iloc[] = use the numerical index position to call to locations in the DataFrame. (The i is short for index.)
  • .loc[] = use the labels to call to the location in the data frame.
In [127]:
# numerical position
D.iloc[:3,0:2] # D.iloc[row position,column position]
Out[127]:
names nickname
0 Darrow The Reaper
1 Adrius Jakal
2 Sevro Goblin
In [128]:
# index labels
D.loc[:3,['names','house']]
Out[128]:
names house
0 Darrow Mars
1 Adrius Venus
2 Sevro Mars
3 Virginia Mercury

A few things to note about .loc[]

  • calls all named index positions. Above we get back all the requested rows (rather than the numerical range which returns one below the max value). This is because .loc[] treats the index as a labeled feature rather than a numerical one.
  • selecting ranges from labeled indices works the same as numerical indices. That is we can make calls to all variables in between (see below).
In [130]:
D.loc[:,'names':'house']
Out[130]:
names nickname house
0 Darrow The Reaper Mars
1 Adrius Jakal Venus
2 Sevro Goblin Mars
3 Virginia Mustang Mercury
4 Victra NaN Jupiter
In [131]:
# boolean masks also work here
D.loc[D["house"]=="Mars", :]
Out[131]:
names nickname house color
0 Darrow The Reaper Mars red
2 Sevro Goblin Mars gold

We can redefine the row indices.

In [132]:
dict_
Out[132]:
{'names': ['Darrow', 'Adrius', 'Sevro', 'Virginia', 'Victra'],
 'nickname': ['The Reaper', 'Jakal', 'Goblin', 'Mustang', 'NaN'],
 'house': ['Mars', 'Venus', 'Mars', 'Mercury', 'Jupiter'],
 'color': ['Red', 'Gold', 'Gold', 'Gold', 'Gold'],
 'when_they_died': ['alive', 'mutant', 'alive', 'alive', 'alive'],
 'gender': ['M', 'M', 'M', 'F', 'F']}
In [134]:
D2 = pd.DataFrame(dict_,
                  index=["d","a", "s", "v", "v2"])
D2
Out[134]:
names nickname house color when_they_died gender
d Darrow The Reaper Mars Red alive M
a Adrius Jakal Venus Gold mutant M
s Sevro Goblin Mars Gold alive M
v Virginia Mustang Mercury Gold alive F
v2 Victra NaN Jupiter Gold alive F
In [135]:
# We can use the named indices for look up (and as with numpy, column rearrangement).
D2.loc[["d","a","v"],"names":"house"]
Out[135]:
names nickname house
d Darrow The Reaper Mars
a Adrius Jakal Venus
v Virginia Mustang Mercury
In [136]:
# notice, we cannot use the numbers with loc anymore
D2.loc[1:2,"names":"house"]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[136], line 2
      1 # notice, we cannot use the numbers with loc anymore
----> 2 D2.loc[1:2,"names":"house"]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:1067, in _LocationIndexer.__getitem__(self, key)
   1065     if self._is_scalar_access(key):
   1066         return self.obj._get_value(*key, takeable=self._takeable)
-> 1067     return self._getitem_tuple(key)
   1068 else:
   1069     # we by definition only have the 0th axis
   1070     axis = self.axis or 0

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:1256, in _LocIndexer._getitem_tuple(self, tup)
   1253 if self._multi_take_opportunity(tup):
   1254     return self._multi_take(tup)
-> 1256 return self._getitem_tuple_same_dim(tup)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:924, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
    921 if com.is_null_slice(key):
    922     continue
--> 924 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
    925 # We should never have retval.ndim < self.ndim, as that should
    926 #  be handled by the _getitem_lowerdim call above.
    927 assert retval.ndim == self.ndim

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:1290, in _LocIndexer._getitem_axis(self, key, axis)
   1288 if isinstance(key, slice):
   1289     self._validate_key(key, axis)
-> 1290     return self._get_slice_axis(key, axis=axis)
   1291 elif com.is_bool_indexer(key):
   1292     return self._getbool_axis(key, axis=axis)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:1324, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1321     return obj.copy(deep=False)
   1323 labels = obj._get_axis(axis)
-> 1324 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1326 if isinstance(indexer, slice):
   1327     return self.obj._slice(indexer, axis=axis)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6559, in Index.slice_indexer(self, start, end, step, kind)
   6516 """
   6517 Compute the slice indexer for input labels and step.
   6518 
   (...)
   6555 slice(1, 3, None)
   6556 """
   6557 self._deprecated_arg(kind, "kind", "slice_indexer")
-> 6559 start_slice, end_slice = self.slice_locs(start, end, step=step)
   6561 # return a slice
   6562 if not is_scalar(start_slice):

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6767, in Index.slice_locs(self, start, end, step, kind)
   6765 start_slice = None
   6766 if start is not None:
-> 6767     start_slice = self.get_slice_bound(start, "left")
   6768 if start_slice is None:
   6769     start_slice = 0

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6676, in Index.get_slice_bound(self, label, side, kind)
   6672 original_label = label
   6674 # For datetime indices label may be a string that has to be converted
   6675 # to datetime boundary according to its resolution.
-> 6676 label = self._maybe_cast_slice_bound(label, side)
   6678 # we need to look up the label
   6679 try:

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6623, in Index._maybe_cast_slice_bound(self, label, side, kind)
   6618 # We are a plain index here (sub-class override this method if they
   6619 # wish to have special treatment for floats/ints, e.g. Float64Index and
   6620 # datetimelike Indexes
   6621 # reject them, if index does not contain label
   6622 if (is_float(label) or is_integer(label)) and label not in self:
-> 6623     raise self._invalid_indexer("slice", label)
   6625 return label

TypeError: cannot do slice indexing on Index with these indexers [1] of type int

And series?

Yes... you can also use loc, and iloc with index.

In [61]:
# get our old series
s
Out[61]:
2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object
In [64]:
# iloc
s.iloc[0]
Out[64]:
'Argentina'
In [65]:
# loc
s.loc[2022]
Out[65]:
'Argentina'

Setting the index as the unit of observation

We can redefine the index to work as a way to keep our unit of observation: consistent, clean, and easy to use.

In [137]:
dat = D.set_index('names')
dat
Out[137]:
nickname house color
names
Darrow The Reaper Mars red
Adrius Jakal Venus gold
Sevro Goblin Mars gold
Virginia Mustang Mercury gold
Victra NaN Jupiter gold
In [138]:
dat.loc['Darrow',:]
Out[138]:
nickname    The Reaper
house             Mars
color              red
Name: Darrow, dtype: object

Reverting the index back to it's original 0-based index is straight forward with the .reset_index() method.

In [140]:
dat = dat.reset_index()
dat
Out[140]:
index names nickname house color
0 0 Darrow The Reaper Mars red
1 1 Adrius Jakal Venus gold
2 2 Sevro Goblin Mars gold
3 3 Virginia Mustang Mercury gold
4 4 Victra NaN Jupiter gold
In [ ]:
 

Hierarchical (multi-) index

In [141]:
dat = D.set_index(keys=['names', 'house'])
dat
Out[141]:
nickname color
names house
Darrow Mars The Reaper red
Adrius Venus Jakal gold
Sevro Mars Goblin gold
Virginia Mercury Mustang gold
Victra Jupiter NaN gold

We can see that the index is composed of two levels.

In [142]:
dat.index
Out[142]:
MultiIndex([(  'Darrow',    'Mars'),
            (  'Adrius',   'Venus'),
            (   'Sevro',    'Mars'),
            ('Virginia', 'Mercury'),
            (  'Victra', 'Jupiter')],
           names=['names', 'house'])

Under the hood, the hierarchical indices are actually tuples.

In [143]:
dat.loc[("Darrow","Mars"),:]
Out[143]:
nickname    The Reaper
color              red
Name: (Darrow, Mars), dtype: object

We can use boolean lookups on the level values

In [144]:
dat.loc[dat.index.get_level_values('house')=="Jupiter",:]
Out[144]:
nickname color
names house
Victra Jupiter NaN gold

Finally, we can easily sort and order the index.

In [145]:
dat.sort_index()
Out[145]:
nickname color
names house
Adrius Venus Jakal gold
Darrow Mars The Reaper red
Sevro Mars Goblin gold
Victra Jupiter NaN gold
Virginia Mercury Mustang gold

As before, if we wish to revert the index back to a 0-based integer, we can with .reset_index()

In [146]:
# inplace to save in the same object
dat.reset_index(inplace=True)
In [71]:
dat
Out[71]:
names house nickname color
0 Darrow Mars The Reaper red
1 Adrius Venus Jakal gold
2 Sevro Mars Goblin gold
3 Virginia Mercury Mustang gold
4 Victra Jupiter NaN gold

Column Indices ("column names")

As seen, pandas can keep track of column feature using column index. We can access the column index at any time using the .columns attribut

In [148]:
dat.columns
Out[148]:
Index(['names', 'house', 'nickname', 'color'], dtype='object')

Or we can simply redefine the dataframe using the list() constructor (recall the a DataFrame is really a dict)

In [150]:
list(dat)
Out[150]:
['names', 'house', 'nickname', 'color']

Overwriting column names: below let's set all of the columns to be lower case. Note that we can invoke a .str method that gives access to all of the string data type methods.

In [151]:
dat.columns = dat.columns.str.upper()
dat.columns
Out[151]:
Index(['NAMES', 'HOUSE', 'NICKNAME', 'COLOR'], dtype='object')
In [152]:
dat
Out[152]:
NAMES HOUSE NICKNAME COLOR
0 Darrow Mars The Reaper red
1 Adrius Venus Jakal gold
2 Sevro Mars Goblin gold
3 Virginia Mercury Mustang gold
4 Victra Jupiter NaN gold

But note that the column index is not mutable. Recall that values are mutable in dictionary, but the keys are not.

In [153]:
dat.columns[dat.columns == "HOUSE"] = "house"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[153], line 1
----> 1 dat.columns[dat.columns == "HOUSE"] = "house"

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:5302, in Index.__setitem__(self, key, value)
   5300 @final
   5301 def __setitem__(self, key, value):
-> 5302     raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations

We either have to replace all the keys (as we do above), or use the .rename() method to rename a specific data feature by passing it a dict with the new renaming convention.

data.rename(columns = {'old_name':'new_name'})
In [81]:
dat.rename(columns={"NAMES":"name"},
             inplace=True) # Makes the change in-place rather than making a copy
dat.columns
Out[81]:
Index(['INDEX', 'name', 'HOUSE', 'NICKNAME', 'COLOR'], dtype='object')

Similar to row indices, we can generate hierarchical column indices as well. (As we'll see this will be the default column index output when aggregating variables next time).

That's it for today! Next week we will see more and more of Pandas DataFrames

In [41]:
!jupyter nbconvert _week_5a_intro_to_pandas.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week_5a_intro_to_pandas.ipynb to html
[NbConvertApp] Writing 405280 bytes to _week_5a_intro_to_pandas.html