PPOL 5203 Data Science I: Foundations

Introduction to Pandas

Tiago Ventura

In this Notebook we cover

Pandas data objects and how to construct them.
Using indexes for Pandas data objects.

import pandas as pd
import numpy as np

`pandas` Objects¶

Recall numpy offers a great flexibility and efficiency when dealing with data matrices (when compared to manipulating data represented as a nested list). However, as we saw, numpy is limited in its capacity to deal with heterogeneous data types in a single data matrix. This is a limitation given the nature of most social science datasets.

The pandas package was designed to deal with this issue. Built on top of the numpy library, Pandas retains and expands upon numpy's functionality.

The fundamental data constructs in a pandas objects are:

the pd.Series
the pd.DataFrame.

`Series`¶

A pandas series is a one-dimensional labeled array.
It is capable of holding different data types (e.g. integer, boolean, strings, etc.).
The axis in a series works as "index" --- similar to a list or numpy array--- however, we can use other data types to serve as an index, which allows for some powerful ways for manipulating the array.

At it's core, a Pandas Series is nothing but a column in an excel sheet or an R data.frame.

Constructor¶

To construct a pandas Series, we can use the pd.Series() constructor.

Notice: The input to create a pandas series is a list or an array.

If you are migrating from R, this is a bit of a difference, since you can use simple vectors in R to create a dataframe column. Python does not have vectors as R does, only lists and arrays. These are the basic inputs to create a Pandas.Series()

import pandas as pd
s = pd.Series(["Argentina", "France", "Germany","Spain", "Italy", "Brazil"],
                 index=[2022, 2018, 2014, 2010, 2006, 2002])
type(s)
print(s)

2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object

# notice series are homogenous and it gets implicit indexes
s_no_index = pd.Series(["Argentina", True, "Germany","Spain", "Italy", "Brazil"])
s_no_index

0    Argentina
1         True
2      Germany
3        Spain
4        Italy
5       Brazil
dtype: object

Series Elements¶

The Series combines a sequence of

values with an explicit sequence of
indices, which we can access with the values and index attributes.

`Series` index¶

Indexing with Pandas series come on three flavors:

explicit index
implicit (deprecated)
mask indexing

# index
print(s.index)

# implicit index
print(s_no_index.index)

Int64Index([2022, 2018, 2014, 2010, 2006, 2002], dtype='int64')
RangeIndex(start=0, stop=6, step=1)

'Argentina'

s

2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object

# explicit index
s[2002]

'Brazil'

# implicit position
s[:2]

2022    Argentina
2018       France
dtype: object

# masking with a boolean
arr_mask = s.index>2016

s[s.index>2016]

2022    Argentina
2018       France
dtype: object

# with or condition -- which requires a weird set of parenthesis
s[((s.index==2018) | (s.index==2022))]

2022    Argentina
2018       France
dtype: object

s_no_index = pd.Series(["Argentina", True, "Germany","Spain", "Italy", "Brazil"])
s_no_index

0    Argentina
1         True
2      Germany
3        Spain
4        Italy
5       Brazil
dtype: object

s_no_index[:2]

0    Argentina
1         True
dtype: object

`Series` values¶

s.values

array(['Argentina', 'France', 'Germany', 'Spain', 'Italy', 'Brazil'],
      dtype=object)

# it is an numpy array
print(type(s.values))
print(type(s_no_index.values))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

`DataFrame`¶

A pandas DataFrame is a two dimensional, relational data structure with the capacity to handle heterogeneous data types.

"relational" = each column value contained within a row entry corresponds with the same observation.
"two dimensional" = a matrix data structure (no $N$-dimensional arrays). The data construct can be accessed through row/column indices.
"heterogeneous" = different data types can be contained within each column series. This means, for example, string, integer, and boolean values can coexist in the same data structure and retain the specific properties of their data type class.

Put simply, a DataFrame is a collection of pandas series where each index position corresponds to the same observation.

**`Important:`** Each column in a pandas DataFrame is a Series

How to create a DataFrame?¶

Constructor¶

To create a pandas DataFrame, we call the pd.DataFrame() constructor.

Construction using `dict()`¶

As input, we need to feed in a dictionary, where the keys are the column names and the values are the relational data input.

my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c','d','e','f']}
my_dict

{'A': [1, 2, 3, 4, 5, 6],
 'B': [2, 3, 1, 0.3, 4, 1],
 'C': ['a', 'b', 'c', 'd', 'e', 'f']}

pd.DataFrame(my_dict)

Data must be relational. If the dimensions do not align, an error will be thrown.

my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c']}
pd.DataFrame(my_dict)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[88], line 2
      1 my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c']}
----> 2 pd.DataFrame(my_dict)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/internals/construction.py:666, in _extract_index(data)
    664 lengths = list(set(raw_lengths))
    665 if len(lengths) > 1:
--> 666     raise ValueError("All arrays must be of the same length")
    668 if have_dicts:
    669     raise ValueError(
    670         "Mixing dicts with non-Series may lead to ambiguous ordering."
    671     )

ValueError: All arrays must be of the same length

When constructing a DataFrame from scratch, using the dict constructor can help ease typing.

pd.DataFrame(dict(A = [1,2,3],
                  B = ['a','b','c']))

Construction using `list()`¶

Likewise, we can simply input a list, and the DataFrame will put a 0-based integer index by default.

my_list = [4,4,5,6,7]
pd.DataFrame(my_list)

The same holds if we feed in a nested list structure.

nested_list = np.random.randint(1,10,25).reshape(5,5).tolist()
nested_list

[[8, 2, 8, 4, 8],
 [7, 5, 3, 9, 6],
 [3, 1, 8, 9, 3],
 [4, 8, 4, 9, 3],
 [6, 6, 2, 3, 9]]

pd.DataFrame(nested_list)

To overwrite the default indexing protocol, we can provide a list of column names to correspond to each column index position.

col_names = [f'Var{i}' for i in range(1,6)]
col_names

['Var1', 'Var2', 'Var3', 'Var4', 'Var5']

D = pd.DataFrame(nested_list,
                 columns=col_names)
D

Constructing using a `pd.Series()`¶

And of course, you can use a series to construct a dataframe

# create a series
area = pd.Series({'California': 423967, 
                  'Texas': 695662,
                  'Florida': 170312, 
                  'New York': 141297,
                  'Pennsylvania': 119280})
# see areas
print(area)
type(area)

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64

pandas.core.series.Series

pd.DataFrame(area, columns=["area"])

# transpose your index
pd.DataFrame(area, columns=["area"]).T

From Nested Data to DataFrames¶

In real cases, your data will hardly come from ONE list, or ONE dictionary.

Imagine you are scrapping data from the web, and every iteration, you want to add a new row to your dataframe. You have a Nested Data, in which your columns are repeating over every iteration.

For these cases, there in general two abstract approaches to go from nested data (as list or dictionaries) to a Pandas DataFrame.

Column-Wise: From a dictionary of lists
Row-Wise: From a list of dictionaries

Column-Wise¶

In this approach, your data goes from a dictionary of lists to a DataFrame.

Your input is organized by collumns!

Keys -> Columns Names
Lists (values) -> Values for every observation i on your data.

# create a dictionary of lists
dict_ = {"names":["Darrow", "Adrius", "Sevro", "Virginia", "Victra"],
         "nickname":["The Reaper", "Jakal", "Goblin", "Mustang", "NaN"], 
         "house":["Mars", "Venus", "Mars", "Mercury", "Jupiter"], 
         "color": ["Red", "Gold", "Gold", "Gold", "Gold"], 
         "when_they_died":["alive", "mutant", "alive", "alive", "alive"], 
        "gender": ["M", "M", "M", "F", "F"]}

# create a dataframe
pd.DataFrame(dict_)

Row-Wise¶

In this approach, your data goes from a list of dictionaries to a DataFrame.

Your input is organized by observations

Keys of each dictionary -> Columns names
Values of each dictionary -> Values for every observation i in column j on your data.

# create a list of dictionaries
list_ = [{"names":"Darrow", "nickname":"The Reaper", "house":"Mars", "color":"red"}, # obs 1
         
         {"names":"Adrius", "nickname":"Jakal", "house":"Venus", "color":"gold"}, # obs 2
         
         {"names":"Sevro", "nickname":"Goblin", "house":"Mars", "color":"gold"}, # obs 3
         
         {"names":"Virginia", "nickname":"Mustang", "house":"Mercury", "color":"gold"}, # obs 4
         
         {"names":"Victra", "nickname":"NaN", "house":"Jupiter", "color":"gold"} # obs 5
        ] 

# this looks like a json, which is a very common way to store large datasets!
list_

[{'names': 'Darrow',
  'nickname': 'The Reaper',
  'house': 'Mars',
  'color': 'red'},
 {'names': 'Adrius', 'nickname': 'Jakal', 'house': 'Venus', 'color': 'gold'},
 {'names': 'Sevro', 'nickname': 'Goblin', 'house': 'Mars', 'color': 'gold'},
 {'names': 'Virginia',
  'nickname': 'Mustang',
  'house': 'Mercury',
  'color': 'gold'},
 {'names': 'Victra', 'nickname': 'NaN', 'house': 'Jupiter', 'color': 'gold'}]

# create a dataframe
D = pd.DataFrame(list_)
D

How would this work in practice? Imagine for example a case in which you are scraping multiple websites. You would write a loop, and at each iteration, you would add a new element to you list, using your dictonary keys to save the tags. For example:

# create a contained
empty_list =[]

# write a for loop
for l in list_:
    # grab your entries
    unique_entry = {}
    unique_entry["names"]=l["names"]
    unique_entry["nickname"]=l["nickname"]
    unique_entry["house"]=l["house"]
    unique_entry["color"]=l["color"]
    # append to you list
    empty_list.append(unique_entry)
    

# convert to a dataframe
#empty_list
pd.DataFrame(empty_list)

The same building approach can be done with a list of lists + a list of names.

# list of lists + list of names
list_names= ["names","nickname", "house", "color"]
list_of_lists = [
                 ["Darrow","The Reaper", "Mars", "red"], # obs 1
                 ["Adrius", "Jakal", "Venus","gold"], # obs 2
                 ["Sevro", "Goblin", "Mars", "gold"], # obs 3
                 ["Virginia", "Mustang", "Mercury", "gold"], # obs 3
                 ["Victra", "NaN","Jupiter", "gold"]
                ]

# create a dataframe
pd.DataFrame(list_of_lists, columns=list_names)

Keep these approaches in mind. The conversion of nested data to DataFrames will be useful when you are collecting your own data, particularly when using webscrapping and loops.

`DataFrame` index¶

Important: This is where most people coming from R gets confused with Pandas in Python

Unlike with a numpy array, we cannot simply call a row index position (implicit index). When doing this, Pandas looks for columns as index.

**No implict index with DataFrames!!!**

D[1,2]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (1, 2)

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[112], line 1
----> 1 D[1,2]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: (1, 2)

Explanation:¶

This is because the internal index to a DataFrame refers to the column index. This might be odd at first but if we think back to the behavior of Python dictionaries (which a DataFrame fundamentally is under the hood) we'll recall that the key is the default indexing features (as the immutable keys provide for efficient lookups in the dictionary object).

D['names']

0      Darrow
1      Adrius
2       Sevro
3    Virginia
4      Victra
Name: names, dtype: object

Dataframes have two indices¶

Always remember that there are 2 indices in a DataFrame that we must keep track of: the row index and the column index.

D

# Row index
D.index

RangeIndex(start=0, stop=5, step=1)

# column index
D.columns

Index(['names', 'nickname', 'house', 'color'], dtype='object')

Slice and Index in Pandas¶

Memorize this for Pandas:

indexing refers to columns,
slicing refers to rows:

# index
D["color"]

0     red
1    gold
2    gold
3    gold
4    gold
Name: color, dtype: object

# multiple columns
D[["names", "house"]]

# slice
D[0:2]

# slice with a modified index
D_index_name = D.set_index("names")
D_index_name["Darrow":"Sevro"]

Acessing with Boolean¶

We can use boolean masks as in series to access elements of the DataFrames

D[D["house"]=="Mars"]

Better methods for acessing elements: loc and iloc¶

To access the indices in a DataFrame, we are better by relying on two build-in methods:

.iloc[] = use the numerical index position to call to locations in the DataFrame. (The i is short for index.)
.loc[] = use the labels to call to the location in the data frame.

# numerical position
D.iloc[:3,0:2] # D.iloc[row position,column position]

# index labels
D.loc[:3,['names','house']]

A few things to note about .loc[]

calls all named index positions. Above we get back all the requested rows (rather than the numerical range which returns one below the max value). This is because .loc[] treats the index as a labeled feature rather than a numerical one.
selecting ranges from labeled indices works the same as numerical indices. That is we can make calls to all variables in between (see below).

D.loc[:,'names':'house']

# boolean masks also work here
D.loc[D["house"]=="Mars", :]

We can redefine the row indices.

dict_

{'names': ['Darrow', 'Adrius', 'Sevro', 'Virginia', 'Victra'],
 'nickname': ['The Reaper', 'Jakal', 'Goblin', 'Mustang', 'NaN'],
 'house': ['Mars', 'Venus', 'Mars', 'Mercury', 'Jupiter'],
 'color': ['Red', 'Gold', 'Gold', 'Gold', 'Gold'],
 'when_they_died': ['alive', 'mutant', 'alive', 'alive', 'alive'],
 'gender': ['M', 'M', 'M', 'F', 'F']}

D2 = pd.DataFrame(dict_,
                  index=["d","a", "s", "v", "v2"])
D2

# We can use the named indices for look up (and as with numpy, column rearrangement).
D2.loc[["d","a","v"],"names":"house"]

# notice, we cannot use the numbers with loc anymore
D2.loc[1:2,"names":"house"]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[136], line 2
      1 # notice, we cannot use the numbers with loc anymore
----> 2 D2.loc[1:2,"names":"house"]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:1067, in _LocationIndexer.__getitem__(self, key)
   1065     if self._is_scalar_access(key):
   1066         return self.obj._get_value(*key, takeable=self._takeable)
-> 1067     return self._getitem_tuple(key)
   1068 else:
   1069     # we by definition only have the 0th axis
   1070     axis = self.axis or 0

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:1256, in _LocIndexer._getitem_tuple(self, tup)
   1253 if self._multi_take_opportunity(tup):
   1254     return self._multi_take(tup)
-> 1256 return self._getitem_tuple_same_dim(tup)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:924, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
    921 if com.is_null_slice(key):
    922     continue
--> 924 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
    925 # We should never have retval.ndim < self.ndim, as that should
    926 #  be handled by the _getitem_lowerdim call above.
    927 assert retval.ndim == self.ndim

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:1290, in _LocIndexer._getitem_axis(self, key, axis)
   1288 if isinstance(key, slice):
   1289     self._validate_key(key, axis)
-> 1290     return self._get_slice_axis(key, axis=axis)
   1291 elif com.is_bool_indexer(key):
   1292     return self._getbool_axis(key, axis=axis)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexing.py:1324, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1321     return obj.copy(deep=False)
   1323 labels = obj._get_axis(axis)
-> 1324 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1326 if isinstance(indexer, slice):
   1327     return self.obj._slice(indexer, axis=axis)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6559, in Index.slice_indexer(self, start, end, step, kind)
   6516 """
   6517 Compute the slice indexer for input labels and step.
   6518 
   (...)
   6555 slice(1, 3, None)
   6556 """
   6557 self._deprecated_arg(kind, "kind", "slice_indexer")
-> 6559 start_slice, end_slice = self.slice_locs(start, end, step=step)
   6561 # return a slice
   6562 if not is_scalar(start_slice):

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6767, in Index.slice_locs(self, start, end, step, kind)
   6765 start_slice = None
   6766 if start is not None:
-> 6767     start_slice = self.get_slice_bound(start, "left")
   6768 if start_slice is None:
   6769     start_slice = 0

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6676, in Index.get_slice_bound(self, label, side, kind)
   6672 original_label = label
   6674 # For datetime indices label may be a string that has to be converted
   6675 # to datetime boundary according to its resolution.
-> 6676 label = self._maybe_cast_slice_bound(label, side)
   6678 # we need to look up the label
   6679 try:

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6623, in Index._maybe_cast_slice_bound(self, label, side, kind)
   6618 # We are a plain index here (sub-class override this method if they
   6619 # wish to have special treatment for floats/ints, e.g. Float64Index and
   6620 # datetimelike Indexes
   6621 # reject them, if index does not contain label
   6622 if (is_float(label) or is_integer(label)) and label not in self:
-> 6623     raise self._invalid_indexer("slice", label)
   6625 return label

TypeError: cannot do slice indexing on Index with these indexers [1] of type int

And series?¶

Yes... you can also use loc, and iloc with index.

# get our old series
s

2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object

# iloc
s.iloc[0]

'Argentina'

# loc
s.loc[2022]

'Argentina'

Setting the index as the unit of observation¶

We can redefine the index to work as a way to keep our unit of observation: consistent, clean, and easy to use.

dat = D.set_index('names')
dat

dat.loc['Darrow',:]

nickname    The Reaper
house             Mars
color              red
Name: Darrow, dtype: object

Reverting the index back to it's original 0-based index is straight forward with the .reset_index() method.

dat = dat.reset_index()
dat

Hierarchical (multi-) index¶

dat = D.set_index(keys=['names', 'house'])
dat

We can see that the index is composed of two levels.

dat.index

MultiIndex([(  'Darrow',    'Mars'),
            (  'Adrius',   'Venus'),
            (   'Sevro',    'Mars'),
            ('Virginia', 'Mercury'),
            (  'Victra', 'Jupiter')],
           names=['names', 'house'])

Under the hood, the hierarchical indices are actually tuples.

dat.loc[("Darrow","Mars"),:]

nickname    The Reaper
color              red
Name: (Darrow, Mars), dtype: object

We can use boolean lookups on the level values

dat.loc[dat.index.get_level_values('house')=="Jupiter",:]

Finally, we can easily sort and order the index.

dat.sort_index()

As before, if we wish to revert the index back to a 0-based integer, we can with .reset_index()

# inplace to save in the same object
dat.reset_index(inplace=True)

dat

Column Indices ("column names")¶

As seen, pandas can keep track of column feature using column index. We can access the column index at any time using the .columns attribut

dat.columns

Index(['names', 'house', 'nickname', 'color'], dtype='object')

Or we can simply redefine the dataframe using the list() constructor (recall the a DataFrame is really a dict)

list(dat)

['names', 'house', 'nickname', 'color']

Overwriting column names: below let's set all of the columns to be lower case. Note that we can invoke a .str method that gives access to all of the string data type methods.

dat.columns = dat.columns.str.upper()
dat.columns

Index(['NAMES', 'HOUSE', 'NICKNAME', 'COLOR'], dtype='object')

dat

But note that the column index is not mutable. Recall that values are mutable in dictionary, but the keys are not.

dat.columns[dat.columns == "HOUSE"] = "house"

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[153], line 1
----> 1 dat.columns[dat.columns == "HOUSE"] = "house"

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:5302, in Index.__setitem__(self, key, value)
   5300 @final
   5301 def __setitem__(self, key, value):
-> 5302     raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations

We either have to replace all the keys (as we do above), or use the .rename() method to rename a specific data feature by passing it a dict with the new renaming convention.

data.rename(columns = {'old_name':'new_name'})

dat.rename(columns={"NAMES":"name"},
             inplace=True) # Makes the change in-place rather than making a copy
dat.columns

Index(['INDEX', 'name', 'HOUSE', 'NICKNAME', 'COLOR'], dtype='object')

Similar to row indices, we can generate hierarchical column indices as well. (As we'll see this will be the default column index output when aggregating variables next time).

That's it for today! Next week we will see more and more of `Pandas DataFrames`¶

!jupyter nbconvert _week_5a_intro_to_pandas.ipynb --to html --template classic

[NbConvertApp] Converting notebook _week_5a_intro_to_pandas.ipynb to html
[NbConvertApp] Writing 405280 bytes to _week_5a_intro_to_pandas.html

	A	B	C
0	1	2.0	a
1	2	3.0	b
2	3	1.0	c
3	4	0.3	d
4	5	4.0	e
5	6	1.0	f

	area
California	423967
Texas	695662
Florida	170312
New York	141297
Pennsylvania	119280

	names	nickname	house	color	when_they_died	gender
0	Darrow	The Reaper	Mars	Red	alive	M
1	Adrius	Jakal	Venus	Gold	mutant	M
2	Sevro	Goblin	Mars	Gold	alive	M
3	Virginia	Mustang	Mercury	Gold	alive	F
4	Victra	NaN	Jupiter	Gold	alive	F

	nickname	house	color
names
Darrow	The Reaper	Mars	red
Adrius	Jakal	Venus	gold
Sevro	Goblin	Mars	gold

		nickname	color
names	house
Darrow	Mars	The Reaper	red
Adrius	Venus	Jakal	gold
Sevro	Mars	Goblin	gold
Virginia	Mercury	Mustang	gold
Victra	Jupiter	NaN	gold

	Var1	Var2	Var3	Var4	Var5
0	8	2	8	4	8
1	7	5	3	9	6
2	3	1	8	9	3
3	4	8	4	9	3
4	6	6	2	3	9

PPOL 5203 Data Science I: Foundations Introduction to Pandas Tiago Ventura

pandas Objects¶

Series¶

Constructor¶

Series Elements¶

Series index¶

Series values¶

DataFrame¶

How to create a DataFrame?¶

Constructor¶

Construction using dict()¶

Construction using list()¶

Constructing using a pd.Series()¶

From Nested Data to DataFrames¶

Column-Wise¶

Row-Wise¶

DataFrame index¶

Explanation:¶

Dataframes have two indices¶

Slice and Index in Pandas¶

Acessing with Boolean¶

Better methods for acessing elements: loc and iloc¶

And series?¶

Setting the index as the unit of observation¶

Hierarchical (multi-) index¶

Column Indices ("column names")¶

That's it for today! Next week we will see more and more of Pandas DataFrames¶

PPOL 5203 Data Science I: Foundations

Introduction to Pandas

Tiago Ventura

`pandas` Objects¶

`Series`¶

`Series` index¶

`Series` values¶

`DataFrame`¶

Construction using `dict()`¶

Construction using `list()`¶

Constructing using a `pd.Series()`¶

`DataFrame` index¶

That's it for today! Next week we will see more and more of `Pandas DataFrames`¶