PPOL 5203 Data Science I: Foundations

Data Wrangling in Pandas

Tiago Ventura

In this Notebook:

In this notebook, we will cover standard data wrangling methods using pandas:

Selecting Methods.
Filtering Methods.
Grouping and Summarization.
Recoding Variables.

Setup¶

import pandas as pd
import numpy as np

# set some options for pandas
pd.set_option('display.max_rows', 10)

Data Wrangling in `pandas`¶

Since you are also learning R throughout DSPP, let's provide you with an overview of the main Data Wrangling Functions in R using Tidyverse and Python using pandas.

Main (tidy) Data Wrangling Functions

`pandas`	`dplyr`$^\dagger$	Description
`.filter()`	`select()`	select column variables/index
`.drop()`	`select()`	drop selected column variables/index
`.rename()`	`rename()`	rename column variables/index
`.query()`	`filter()`	row-wise subset of a data frame by a values of a column variable/index
`.assign()`	`mutate()`	Create a new variable on the existing data frame
`.sort_values()`	`arrange()`	Arrange all data values along a specified (set of) column variable(s)/indices
`.groupby()`	`group_by()`	Index data frame by specific (set of) column variable(s)/index value(s)
`.agg()`	`summarize()`	aggregate data by specific function rules
`.pivot_table()`	`spread()`	cast the data from a "long" to a "wide" format
`pd.melt()`	`gather()`	cast the data from a "wide" to a "long" format
`.()`	`%>%`	piping, fluid programming, or the passing one function output to the next

If you want to fully embrace the tidyverse style from R in Python, you should check the dfply module. This modules ofers an alternative to data wrangling in Python, and mirrors the popular tidyverse functionalities from R.

We will not cover dfply in class because I believe you should dominate pandas as data scientists that are fluent in Python. However, feel free to learn and even use in your homeworks and assignment.

# load worldcup datasets
wc = pd.read_csv("WorldCups.csv")
wc_matches = pd.read_csv("WorldCupMatches.csv")

# see wc data
wc.head()

# see matched data
wc_matches.head()

Piping¶

If you started your data science career, or are simultaneously learning R, it is likely you were exposed to the tidyverse world, and the famous pipe. Here it is for you:

Pipe operators, introduced by tha magritt package in R, absolutely transformed the way of writing coding in R. It did for good reasons. And in the past few years, piping has also become more prevalent in Python.

Pipe operators allows you to:

Chain together data manipulations in a single operational sequence.

In pandas, piping allows you to chain together in a single workflow a sequence of methods. Pipe operators (.pipe()) in pandas were introduced relatively recently (version 0.16.2).

Let's see some examples using the world cup data. We want to count which country played more world cup games.

# read world cup data
#wc = pd.read_csv("WorldCupMatches.csv")

Method 1: sequentially overwrite the object¶

wc = pd.read_csv("WorldCupMatches.csv")

# select columns
wc = wc.filter(['Year','Home Team Name','Away Team Name'])

# make it tidy
wc = wc.melt(id_vars=["Year"], var_name="var", value_name="country")

# group by
wc = wc.groupby("country")

# count occurrences
wc = wc.size()

# move index to column with new name
wc = wc.reset_index(name="n")

# sort
wc = wc.sort_values(by="n", ascending=False)

# print 10
wc.head(10)

Method 2: Pandas Pipe¶

wc = pd.read_csv("WorldCupMatches.csv")

# select columns
wc_10 = ( # encapsulate
            wc.filter(['Year','Home Team Name','Away Team Name']).
             melt(id_vars=["Year"], var_name="var", value_name="country"). 
             groupby("country"). 
             size(). 
             reset_index(name="n").
             sort_values(by="n", ascending=False)
        ) # close

# print 10
wc_10.head(10)

country
Algeria       14
Angola         3
Argentina     81
Australia     13
Austria       29
Belgium       43
Bolivia        6
Brazil       108
Bulgaria      26
Cameroon      23
dtype: int64

Final notes in Piping¶

To understand pipes, you should always remember: **data in, data out**. That's what pipes do: apply methods sequentially to `pandas dataframes`.

It should be clear by now, but these are some of the advantages of piping your code:

Improves code structure
Eliminates intermediate variables
Can improve code readability
Memory efficiency by eliminating intermediate variables
Makes easier to add steps anywhere in the sequence of operations

Keep in mind pipes also have some disadvantages. In my view, specially when working in notebook in which you do not execute line by line, pipes can make codes a bit harder to debug and also embed errors that could be more easily perceived by examining intermediate variables.

Column-Wise Operations¶

For data wrangling tasks at the columns of your data frame, we will discuss:

Select columns
Drop columns
Create new columns
Rename columns

For all operations we will see index vs function-based implementation¶

Select Columns¶

Functionality:

Select specific variables/column indices

Implementation

Traditional indexing in pandas
.loc() methods
pd.filter() methods (allows piping).

Select via index¶

# load worldcup dataset
wc = pd.read_csv("WorldCups.csv")
wc_matches = pd.read_csv("WorldCupMatches.csv")

## simple index for columns
wc["Year"]

0     1930
1     1934
2     1938
3     1950
4     1954
      ... 
15    1998
16    2002
17    2006
18    2010
19    2014
Name: Year, Length: 20, dtype: int64

## using dot method
wc.Year

0     1930
1     1934
2     1938
3     1950
4     1954
      ... 
15    1998
16    2002
17    2006
18    2010
19    2014
Name: Year, Length: 20, dtype: int64

## using .loc method
wc.loc[:,"Year"]

0     1930
1     1934
2     1938
3     1950
4     1954
      ... 
15    1998
16    2002
17    2006
18    2010
19    2014
Name: Year, Length: 20, dtype: int64

Notice, in all cases, the output comes as a pandas series. If you would like the output to be a full data frame, or if you need to select multiple columns, you should give a list of indexes as inputs

wc[["Winner","Year"]]

`.loc()` methods for selecting columns¶

It allows you to select multible variables in between column names!

# .loc for a single or multiple columns
wc.loc[: , "Year":"Winner"]

# iloc for numeric positions
wc.iloc[0:10,0:3]

`.filter()` method¶

The filter methods in Pandas works similarly to the select function in the Tidyverse in R. It has the following advantages:

Allows for a piping approach
Can be combined with regex queries for selectins columns

# simple filter
wc.filter(["Year", "Winner"])

Using the like parameter: Select columns that contain a specific substring.

# like parameter. In R: data %>% select(contains("Away"))
wc_matches.filter(like="Away")

Using the Regex: You can also input regex queries for selecting columns. More and More flexibility

# starts with
wc_matches.filter(regex="^Away")

# ends with
wc_matches.filter(regex="Initials$")

Drop Columns¶

Functionality:

Drop specific variables/column indices

Implementation:

Indexing + Boolean operations
.loc() methods
pd.drop() methods (allows piping).

## select all but "Year" and "Country"
wc[wc.columns[~wc.columns.isin(["Year", "Country"])]]

# Indexing: Bit too much?!
# .isin returns a boolean.
wc.columns[~wc.columns.isin(["Year", "Country"])]

Index(['Winner', 'Runners-Up', 'Third', 'Fourth', 'GoalsScored',
       'QualifiedTeams', 'MatchesPlayed', 'Attendance'],
      dtype='object')

# What is going on here?
~wc.columns.isin(["Year", "Country"])

array([False, False,  True,  True,  True,  True,  True,  True,  True,
        True])

bol_col is an array of booleans. If you throw them directly as a index, Pandas interprets `wc[bol_col]` as trying to index the rows of the DataFrame, not the columns. This is because when you pass a boolean array directly to the DataFrame indexing operator [], pandas assumes you're trying to index rows based on the boolean array.

# loc methods
wc.loc[:,~wc.columns.isin(["Year", "Country"])]

# indexing -> error
bol_col = ~wc.columns.isin(["Year", "Country"])
bol_col
wc[bol_col]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[119], line 4
      2 bol_col = ~wc.columns.isin(["Year", "Country"])
      3 bol_col
----> 4 wc[bol_col]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3798, in DataFrame.__getitem__(self, key)
   3796 # Do we have a (boolean) 1d indexer?
   3797 if com.is_bool_indexer(key):
-> 3798     return self._getitem_bool_array(key)
   3800 # We are left with two options: a single key, and a collection of keys,
   3801 # We interpret tuples as collections only for non-MultiIndex
   3802 is_single_key = isinstance(key, tuple) or not is_list_like(key)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3845, in DataFrame._getitem_bool_array(self, key)
   3839     warnings.warn(
   3840         "Boolean Series key will be reindexed to match DataFrame index.",
   3841         UserWarning,
   3842         stacklevel=find_stack_level(),
   3843     )
   3844 elif len(key) != len(self.index):
-> 3845     raise ValueError(
   3846         f"Item wrong length {len(key)} instead of {len(self.index)}."
   3847     )
   3849 # check_bool_indexer will throw exception if Series key cannot
   3850 # be reindexed to match DataFrame rows
   3851 key = check_bool_indexer(self.index, key)

ValueError: Item wrong length 10 instead of 20.

# With .loc, it works
wc.loc[:,bol_col]

`pd.drop()` methods: easier way to go¶

wc.drop(columns=["Year"])

# see here why you need the argument columns
help(pd.DataFrame.drop)

Help on function drop in module pandas.core.frame:

drop(self, labels: 'IndexLabel' = None, *, axis: 'Axis' = 0, index: 'IndexLabel' = None, columns: 'IndexLabel' = None, level: 'Level' = None, inplace: 'bool' = False, errors: 'IgnoreRaise' = 'raise') -> 'DataFrame | None'
    Drop specified labels from rows or columns.
    
    Remove rows or columns by specifying label names and corresponding
    axis, or by specifying directly index or column names. When using a
    multi-index, labels on different levels can be removed by specifying
    the level. See the `user guide <advanced.shown_levels>`
    for more information about the now unused levels.
    
    Parameters
    ----------
    labels : single label or list-like
        Index or column labels to drop. A tuple will be used as a single
        label and not treated as a list-like.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Whether to drop labels from the index (0 or 'index') or
        columns (1 or 'columns').
    index : single label or list-like
        Alternative to specifying axis (``labels, axis=0``
        is equivalent to ``index=labels``).
    columns : single label or list-like
        Alternative to specifying axis (``labels, axis=1``
        is equivalent to ``columns=labels``).
    level : int or level name, optional
        For MultiIndex, level from which the labels will be removed.
    inplace : bool, default False
        If False, return a copy. Otherwise, do operation
        inplace and return None.
    errors : {'ignore', 'raise'}, default 'raise'
        If 'ignore', suppress error and only existing labels are
        dropped.
    
    Returns
    -------
    DataFrame or None
        DataFrame without the removed index or column labels or
        None if ``inplace=True``.
    
    Raises
    ------
    KeyError
        If any of the labels is not found in the selected axis.
    
    See Also
    --------
    DataFrame.loc : Label-location based indexer for selection by label.
    DataFrame.dropna : Return DataFrame with labels on given axis omitted
        where (all or any) data are missing.
    DataFrame.drop_duplicates : Return DataFrame with duplicate rows
        removed, optionally only considering certain columns.
    Series.drop : Return Series with specified index labels removed.
    
    Examples
    --------
    >>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
    ...                   columns=['A', 'B', 'C', 'D'])
    >>> df
       A  B   C   D
    0  0  1   2   3
    1  4  5   6   7
    2  8  9  10  11
    
    Drop columns
    
    >>> df.drop(['B', 'C'], axis=1)
       A   D
    0  0   3
    1  4   7
    2  8  11
    
    >>> df.drop(columns=['B', 'C'])
       A   D
    0  0   3
    1  4   7
    2  8  11
    
    Drop a row by index
    
    >>> df.drop([0, 1])
       A  B   C   D
    2  8  9  10  11
    
    Drop columns and/or rows of MultiIndex DataFrame
    
    >>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
    ...                              ['speed', 'weight', 'length']],
    ...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
    ...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
    >>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
    ...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
    ...                         [250, 150], [1.5, 0.8], [320, 250],
    ...                         [1, 0.8], [0.3, 0.2]])
    >>> df
                    big     small
    lama    speed   45.0    30.0
            weight  200.0   100.0
            length  1.5     1.0
    cow     speed   30.0    20.0
            weight  250.0   150.0
            length  1.5     0.8
    falcon  speed   320.0   250.0
            weight  1.0     0.8
            length  0.3     0.2
    
    Drop a specific index combination from the MultiIndex
    DataFrame, i.e., drop the combination ``'falcon'`` and
    ``'weight'``, which deletes only the corresponding row
    
    >>> df.drop(index=('falcon', 'weight'))
                    big     small
    lama    speed   45.0    30.0
            weight  200.0   100.0
            length  1.5     1.0
    cow     speed   30.0    20.0
            weight  250.0   150.0
            length  1.5     0.8
    falcon  speed   320.0   250.0
            length  0.3     0.2
    
    >>> df.drop(index='cow', columns='small')
                    big
    lama    speed   45.0
            weight  200.0
            length  1.5
    falcon  speed   320.0
            weight  1.0
            length  0.3
    
    >>> df.drop(index='length', level=1)
                    big     small
    lama    speed   45.0    30.0
            weight  200.0   100.0
    cow     speed   30.0    20.0
            weight  250.0   150.0
    falcon  speed   320.0   250.0
            weight  1.0     0.8

Create new columns¶

Functionality:

Create a new column/index given inputs and or transformations from other columns.

Implementation

Traditional index assignment. Advantages:

Looks like a dictionary operation
Overwrites the data frame

.assign() method. Advantage:

It returns a dataframe so you can chain/pipe operations
Can create multiple variables in a single call
Easy to combine with numpy + lambda functions
Improves readibility.

Let's see examples of both methods:

Transformation via index assignment¶

# With built in math operations
wc["av_goals_matches"] = wc["GoalsScored"]/wc["MatchesPlayed"]

wc.head()

## with numpy
wc["winner_and_hoster"] = np.where(wc.Country==wc.Winner, True, False)

wc[["Year", "Winner", "Country", "winner_and_hoster"]].head(5)

Transformation via `apply` methods by rows¶

As we saw before, apply() methods allow you to apply functions rowwise or columnwise in python. Here you are applying a certain data transformation to every row in your dataframe

# with an apply method + function
wc["av_goals_matches"] = wc.apply(lambda x: x["GoalsScored"]/x["MatchesPlayed"], axis=1)
wc.head()

`.assign()` method.¶

Allows you to create variables with methods chaining.

## when the winner was also hosting the world cup?
(
wc.
 # multiple variables
 assign(final = wc.Winner + " vs " + wc['Runners-Up'],
        winner_and_hoster_np = np.where(wc.Winner==wc.Country, True, False), 
        av_goals_matches = lambda x: x["GoalsScored"]/x["MatchesPlayed"], 
        ).
##allows for methods chaining
    filter(["Year", "Winner", "Country", "final",
            "winner_and_hoster_np", "av_goals_matches"]).
    head(5))

Pay attention here! Using newly created variables!¶

.assign also allows for the use of newly create variables in the same chain. To do that, you need to make use of the lambda function

# Notice calling the recently created variable final
(wc.
 assign(final = wc.Winner + " vs " + wc['Runners-Up'],
        best_three = lambda x: x["final"] + " and " + x["Third"]).
 filter(["best_three","final", "Country", "Third", "Runners-Up", "Winner"]).
 head(5)
)

**Alert:** Spend a few seconds trying to understand the use of the lambda function in the code above.

The combination of lambda (or a normal function) and .assign() is actually a nice property that resambles the properties of mutate in R. A lambda function (or a normal function) with .assign passes in the current state of the dataframe into the function. Because we create the variable in the state before, the lambda function allows us to access this new variable, and manipulate it in sequence. This also works for cases of grouped data, or chains in which we filter observations and transform the dataframe

To make this point clear, see how doing the same operation without a lambda function will throw an error:

# will throw an error
(wc.
 assign(final= wc.Winner + " vs " + wc['Runners-Up'],
        best_three = wc["final"] + " and " + wc["Third"]).
 filter(["best_three","final", "Country"])
)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'final'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[137], line 4
      1 # will throw an error
      2 (wc.
      3  assign(final= wc.Winner + " vs " + wc['Runners-Up'],
----> 4         best_three = wc["final"] + " and " + wc["Third"]).
      5  filter(["best_three","final", "Country"])
      6 )

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'final'

Renaming¶

Functionality:

Rename a columns directly or a via a function.

Implementation:

Use dictionaries!

# Pandas: renaming variables using the rename method
wc.rename(columns={"Year":"year"}).head(3)

{col: col.lower() for col in wc.columns}

{'Year': 'year',
 'Country': 'country',
 'Winner': 'winner',
 'Runners-Up': 'runners-up',
 'Third': 'third',
 'Fourth': 'fourth',
 'GoalsScored': 'goalsscored',
 'QualifiedTeams': 'qualifiedteams',
 'MatchesPlayed': 'matchesplayed',
 'Attendance': 'attendance',
 'av_goals_matches': 'av_goals_matches',
 'winner_and_hoster': 'winner_and_hoster'}

# we can use dictionary comprehension to apply functions in all
wc.rename(columns={col: col.lower() for col in wc.columns}).head(5)

# Or for only a set of the columns by given them as inputs
wc.rename(columns={col: col.lower() for col in ["Year", "Country"]}).head(5)

Practice¶

In this practice questions, we'll hone our data manipulation skills by examining conflict event data generated by the Armed Conflict Location & Event Data Project (ACLED). The aim is to practice some of the data manipulation functions covered in the lecture.

Data¶

ACLED is a "disaggregated data collection, analysis, and crisis mapping project. ACLED collects the dates, actors, locations, fatalities, and modalities of all reported political violence and protest events across Africa, South Asia, Southeast Asia, the Middle East, Central Asia and the Caucasus, Latin America and the Caribbean, and Southeastern and Eastern Europe and the Balkans." For this exercise, we'll focus just on the data pertaining to Africa. For more information regarding these data, please consult the ACLED methodology.

Download the ACLED data.

Open the Data
Create a new dataset with all columns with the words "event" or "actor"
Rename all columns replacing the the word event to "event_acled_south_america"
Create three new columns:
- a binary representation for fatalities,
- a single column combining the columns: "region", "country",
- a new column recoding the inter1 variable with three string labels. These labels should indicate when the iter1 variable is equal to its median, below the median and above the median values.

# read
acled_sa = pd.read_csv("acled_south_america.csv")
acled_sa.columns

Index(['event_id_cnty', 'event_date', 'year', 'time_precision',
       'disorder_type', 'event_type', 'sub_event_type', 'actor1',
       'assoc_actor_1', 'inter1', 'actor2', 'assoc_actor_2', 'inter2',
       'interaction', 'civilian_targeting', 'iso', 'region', 'country',
       'admin1', 'admin2', 'admin3', 'location', 'latitude', 'longitude',
       'geo_precision', 'source', 'source_scale', 'notes', 'fatalities',
       'tags', 'timestamp'],
      dtype='object')

# 2
colnames = acled_sa.filter(like="event").columns.to_list() + acled_sa.filter(like="actor").columns.to_list()
acled_sa[colnames]

# another solution
acled_sa.filter(regex="event|actor").head(5)

# rename
acled_sa = acled_sa.rename(columns={col: col.replace("event", "event_acled_south_america") for col in acled_sa.columns}).head(5)
acled_sa

# 4
import numpy as np
(acled_sa.
     assign(fatality_bin = np.where(acled_sa["fatalities"]>0, 1, 0), 
            full_location = acled_sa["region"]+ "-" + acled_sa["country"], 
            inter_recoded = np.where(acled_sa["inter1"]==acled_sa["inter1"].median(), "Median Value", 
                            np.where(acled_sa["inter1"]>acled_sa["inter1"].median(), 
                                          "Higher Median", "Below Median"))
           ).head(5)
)

# solutions using np.select
median_value = acled_sa["inter1"].median()

# Define the conditions
conditions = [
    acled_sa['inter1'] < median_value,
    acled_sa['inter1'] == median_value,
    acled_sa['inter1'] > median_value
]

# Define the  output values
choices = ['below median', 'at median', 'above median']

# Use np.select 
acled_sa['inter_recoded'] = np.select(conditions, choices, default='N/A')

# see
acled_sa.filter(["inter1", "inter_recoded"])

Row-Wise Operations¶

For data wrangling tasks at the rows of your data frame, we will discuss:

Subsetting
Filtering distinct values
Recoding values
Grouping and Summarizing
Grouping and Transforming
Sorting values

Subsetting¶

Functionality:

Slice the dataframe row-wise following a certain input.

Implementation:

index based implementation
.loc or .iloc methods
.query

Subsetting by index¶

# index based implementation
wc[wc.Year<1990]

# or with multiple condition
wc[(wc.Year<1990)&(wc.Year>1940)]

# see this
(wc.Year<1990)&(wc.Year>1940)

0     False
1     False
2     False
3      True
4      True
      ...  
15    False
16    False
17    False
18    False
19    False
Name: Year, Length: 20, dtype: bool

Subsetting with `.loc()` method¶

# pretty much the same + column names if you wish
wc.loc[(wc.Year<1990)&(wc.Year>1940),:]

Subsetting with `.query()`: a pipe approach¶

As before, you can use .query() methods to a more reable and pipeble approach. Notice the inside of the quotation marks use only the column name.

wc.query('Year<1990 & Year>1940')

Other types of subsetting¶

Subset by distinct entry¶

# Pandas: drop duplicative entries for a specific variable
# notice here you are actually deleting important rows
wc.drop_duplicates("Country")

Subset by sampling¶

# Pandas: randomly sample N number of rows from the data
wc.sample(10)

Recoding values¶

Functionality:

Recode a value given certain conditions. This type of transformation is one of the most importants in data cleaning.

Implementation:

There are many ways to recode variables in Python. We will showcase four of the most useful in my view.

First, we will see more generalized row-wise approach in pandas using:

map()
apply()

Then we will see two vectorized solutions using numpy:

np.where()
np.select()

Recode with map() + dictionaries¶

The map() function is used to substitute each value in a Series with another value.

It takes a series as input
Uses a function/dictionary to transform values
Returns a series.

# map + dictionary to recode to create a dummy for certain country

# create a map function
mapping = {"Brazil":1}

# map the values
wc["brazil_winner"]= wc["Winner"].map(mapping)

# Fill missing values with a default value (e.g., 0)
wc["brazil_winner"].fillna(0, inplace=True)

# see results
wc.tail(5)

Recode with apply() + function¶

The apply() function is used when you want to apply a function along the axis of a DataFrame (either rows or columns). This function can be both an in-built function or a user-defined function.

# apply + function to recode
def get_dummies(x):
    if x =="Brazil":
        return 1
    else:
        return 0
    
# apply function  
wc['Winner'].apply(get_dummies).head()

0    0
1    0
2    0
3    0
4    0
Name: Winner, dtype: int64

Notice, we can make it more general by providing a argument to the function

# apply + function to recode
def get_dummies(x, country):
    if x == country:
        return 1
    else:
        return 0
    
# Apply function with Uruguay now
wc["winner_uruguay"] = wc['Winner'].apply(get_dummies, country="Uruguay").head(10)

Here we are using apply for a column, so we do not need to indicate axis=1. That's also why we are inputing the columns as a serie. Notice we could do the same with assign:

# using apply inside of assign
wc.assign(winner_germany=wc['Winner'].apply(get_dummies, country="Germany"))

Summary of `apply()` and `map()`¶

apply() is used to apply a function along an axis of the DataFrame or on values of Series.
- you are free to use lambda functions with map
- you can also apply functions column-wise changing the axis=1 argument.
map() is used to substitute each value in a Series with another value.

Recode using numpy¶

`np.where`: if-else approach.¶

np.where is similar to ifelse in R
Useful if there’s only 1-2 (True/False conditions)
sintax: np.where(condition, true, false)
condition can be anything that returns a boolean.

Let's see some examples:

# create a new variable
wc["winner_brazil"]=np.where(wc["Winner"]=="Brazil", 1, 0)
wc.head()

# notice we can easily use np.where with assign
wc.assign(winner_brazil_=np.where(wc["Winner"]=="Brazil", 1, 0)).head(10)

# using string methods
(wc
 .assign(south_america=np.where(wc["Winner"].str.contains("Brazil|Uruguay|Argentina"), 1, 0))
     .filter(["Winner", "south_america"])
     .head(10)
)

`np.select`: for multiple conditions¶

np.select: similar to case_when in R.
useful for when there’s multiple conditions to be recoded
sintax: np.select(conditon, choicelist, default)

# step one: create a list of conditions
condition = [wc["Winner"]==wc["Country"], 
             wc["Runners-Up"]==wc["Country"],
             wc["Third"]==wc["Country"]
            ]
# step two: create the choice list
choice_list = [1, 2, 3]

# recode
(wc
    .assign(where_is_hoster=np.select(condition, choice_list, default="4+"))
    .filter(["where_is_hoster"])
)

Group by: split-apply-combine¶

Functionality:

Grouping data by specific variables/column indices
Summarize/aggregate data by specific group features

How it works

“group by” is a common data wrangling process that exists in any language (R, Python, SQL) and refers to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.

From Python Data Science Handbook by Jake VanderPlas

groupby (it just splits!)¶

The groupby() method in pandas splits your dataset in smaller parts. It generates an iterable where each group is broken up into a tuple (group,data).

# load worldcup dataset
wc = pd.read_csv("WorldCups.csv")
wc_matches = pd.read_csv("WorldCupMatches.csv")

#groupby object
g = wc.groupby(["Winner"])
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x16c15a210>

The output of a groupby method is a flexible abstraction: although more complicated things are happening under the hood, in many ways, it can be treated as a collection of DataFrames. And as we learned, any collector can be iterated over!

# Iteration for groups
for group, data in g:
    print(group)

Argentina
Brazil
England
France
Germany
Germany FR
Italy
Spain
Uruguay

/var/folders/jy/10_nyhkn3nv_rrbnd8f_fr940000gp/T/ipykernel_52085/2794934486.py:2: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
  for group, data in g:

# iteration for data grouped
for group, data in g:
    print(data.head(2))

    Year    Country     Winner   Runners-Up   Third   Fourth  GoalsScored  \
10  1978  Argentina  Argentina  Netherlands  Brazil    Italy          102   
12  1986     Mexico  Argentina   Germany FR  France  Belgium          132   

    QualifiedTeams  MatchesPlayed Attendance  
10              16             38  1.545.791  
12              24             52  2.394.031  
   Year Country  Winner      Runners-Up   Third      Fourth  GoalsScored  \
5  1958  Sweden  Brazil          Sweden  France  Germany FR          126   
6  1962   Chile  Brazil  Czechoslovakia   Chile  Yugoslavia           89   

   QualifiedTeams  MatchesPlayed Attendance  
5              16             35    819.810  
6              16             32    893.172  
   Year  Country   Winner  Runners-Up     Third        Fourth  GoalsScored  \
7  1966  England  England  Germany FR  Portugal  Soviet Union           89   

   QualifiedTeams  MatchesPlayed Attendance  
7              16             32  1.563.135  
    Year Country  Winner Runners-Up    Third       Fourth  GoalsScored  \
15  1998  France  France     Brazil  Croatia  Netherlands          171   

    QualifiedTeams  MatchesPlayed Attendance  
15              32             64  2.785.100  
    Year Country   Winner Runners-Up        Third  Fourth  GoalsScored  \
19  2014  Brazil  Germany  Argentina  Netherlands  Brazil          171   

    QualifiedTeams  MatchesPlayed Attendance  
19              32             64  3.386.810  
   Year      Country      Winner   Runners-Up    Third   Fourth  GoalsScored  \
4  1954  Switzerland  Germany FR      Hungary  Austria  Uruguay          140   
9  1974      Germany  Germany FR  Netherlands   Poland   Brazil           97   

   QualifiedTeams  MatchesPlayed Attendance  
4              16             26    768.607  
9              16             38  1.865.753  
   Year Country Winner      Runners-Up    Third   Fourth  GoalsScored  \
1  1934   Italy  Italy  Czechoslovakia  Germany  Austria           70   
2  1938  France  Italy         Hungary   Brazil   Sweden           84   

   QualifiedTeams  MatchesPlayed Attendance  
1              16             17    363.000  
2              15             18    375.700  
    Year       Country Winner   Runners-Up    Third   Fourth  GoalsScored  \
18  2010  South Africa  Spain  Netherlands  Germany  Uruguay          145   

    QualifiedTeams  MatchesPlayed Attendance  
18              32             64  3.178.856  
   Year  Country   Winner Runners-Up   Third      Fourth  GoalsScored  \
0  1930  Uruguay  Uruguay  Argentina     USA  Yugoslavia           70   
3  1950   Brazil  Uruguay     Brazil  Sweden       Spain           88   

   QualifiedTeams  MatchesPlayed Attendance  
0              13             18    590.549  
3              13             22  1.045.246

/var/folders/jy/10_nyhkn3nv_rrbnd8f_fr940000gp/T/ipykernel_52085/4250474981.py:2: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
  for group, data in g:

And we can acess a specific group:

g.get_group("Argentina")

Aggreations (summarize)¶

The power of a grouping function (like .groupby() shines when coupled with an aggregation operation.

An aggregation is any operation that reduces the dimensionality of the data!

`pandas`: `.groupby()` + built-in methods¶

# mean of all numeric variables grouping by winners
wc.groupby(["Winner"]).mean()

/var/folders/jy/10_nyhkn3nv_rrbnd8f_fr940000gp/T/ipykernel_52085/2212436017.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  wc.groupby(["Winner"]).mean()

or select a specific variable to perform the aggregation step on.

# With a specific input
wc.groupby(["Winner"])["GoalsScored"].mean()

Winner
Argentina     117.000000
Brazil        122.400000
England        89.000000
France        171.000000
Germany       171.000000
Germany FR    117.333333
Italy         111.750000
Spain         145.000000
Uruguay        79.000000
Name: GoalsScored, dtype: float64

Notice the results come in a index structure. You can easily convert back to a fully formated dataframe by:

# reseting the index
wc.groupby(["Winner"])["GoalsScored"].mean().reset_index()

## as_index=false
wc.groupby(["Winner"], as_index=False)["GoalsScored"].mean()

Pandas offers many built-in methods to perform aggregations. Here is a list:

Built-in Methods¶

See user guide from Pandas Documentation

Method	Functionality
`.any`	Compute whether any of the values in the groups are truthy
`.all`	Compute whether all of the values in the groups are truthy
`.count`	Compute the number of non-NA values in the groups
`.cov`	Compute the covariance of the groups
`.first`	Compute the first occurring value in each group
`.idxmax`	Compute the index of the maximum value in each group
`.idxmin`	Compute the index of the minimum value in each group
`.last`	Compute the last occurring value in each group
`.max`	Compute the maximum value in each group
`.mean`	Compute the mean of each group
`.median`	Compute the median of each group
`.min`	Compute the minimum value in each group
`.nunique`	Compute the number of unique values in each group
`.prod`	Compute the product of the values in each group
`.quantile`	Compute a given quantile of the values in each group
`.sem`	Compute the standard error of the mean of the values in each group
`.size`	Compute the number of values in each group
`.skew`	Compute the skew of the values in each group
`.std`	Compute the standard deviation of the values in each group
`.sum`	Compute the sum of the values in each group
`.var`	Compute the variance of the values in each group

Let's see some examples:

# max goal by team as home team
(wc_matches
     .groupby(["Home Team Name"])
     ["Home Team Goals"]
     .max()
     .reset_index()
     .sort_values("Home Team Goals", ascending=False))

# number of matches as home team
(wc_matches
     .groupby(["Home Team Name"], as_index=False)
     ["Home Team Name"]
     .size()
     .sort_values("size", ascending=False)
)

`pandas`: `.groupby()` + `agg()`¶

Alternatively, we can specify a whole range of operations to aggregate by (along with specific variable columns) using the .aggregate()/.agg() method. To keep track of which operations correspond which variable, pandas will generate a hierarchical index for column entries.

wc.groupby(["Winner"])["GoalsScored"].agg(["mean","std","median"]).reset_index()

We can also user-defined functions into the aggregate() function as well.

def mean_add_50(x):
    return np.mean(x) + 50

wc.groupby(["Winner"])["GoalsScored"].agg(["mean","std","median",mean_add_50])

agg() + .rename() provides an easy workflow to rename your newly created variables

(wc.groupby(["Winner"])["GoalsScored"].
    agg(["mean","std","median",mean_add_50]).
    rename(columns={"mean": "goals_mean", 
                    "std": "goals_std",
                    "median": "goals_median", 
                    "mean_add_50":"mean_50_goals"})
)

`pandas`: `.groupby()` + `.apply()`¶

Even though I cover this in more details on the miscellaneous notebook, here is a good moment to bring the .apply() methods from pandas one more time.

The apply method lets you apply an arbitrary function by row or by columns on your data frame. As you can imagine, it also allow you to apply results by at a groupby object, and returns a summarizes dataset.

The function should take a DataFrame and returns either a Pandas object (e.g., DataFrame, Series) or a scalar

Let's an example with the mean add function:

(wc.groupby(["Country"])["GoalsScored"]. 
     apply(mean_add_50). 
     reset_index()
)

multi-index grouping¶

We can also group by more than one variable (i.e. implement a multi-index on the rows).

wc.groupby(["Winner", "Year"])["GoalsScored"].mean().reset_index().head(5)

`pandas`: `.groupby()` + `.transform()`¶

Other times we want to implement data manipulations by some grouping variable but retain structure of the original data. Put differently, our aim is not to aggregate but to perform some operation across specific groups.

To do so, we will combine .groupby() and .transform() methods. Let's see an example:

# create a new column
# notive you need to select the column you want to transform. 
wc.groupby(["Winner"])["GoalsScored"].transform("mean")

0      79.000000
1     111.750000
2     111.750000
3      79.000000
4     117.333333
         ...    
15    171.000000
16    122.400000
17    111.750000
18    145.000000
19    171.000000
Name: GoalsScored, Length: 20, dtype: float64

# easily combined with assign
# create a new column
(wc.assign(goals_score_wc_mean_wc=wc.groupby(["Winner"])["GoalsScored"].
    transform("mean")))

# also: very useful with lambda functions
wc.groupby("Winner")["GoalsScored"].transform(lambda x: x - x.mean())

0     -9.000000
1    -41.750000
2    -27.750000
3      9.000000
4     22.666667
        ...    
15     0.000000
16    38.600000
17    35.250000
18     0.000000
19     0.000000
Name: GoalsScored, Length: 20, dtype: float64

Sorting values¶

# Pandas: sort values by a column variable (ascending)
wc.sort_values('Country').head(3)

# Pandas: sort values by a column variable (descending)
wc.sort_values('Year',ascending=False).head(3)

# Pandas: sort values by more than one column variable 
wc.sort_values(['Winner', "Country"]).head(3)

That was a lot! but you will get to keep this notebook for you!¶

And remember of the cheat sheet for pandas

Practice¶

Using the same ACLED data from the previous exercise, answer:

What are the different event types recorded?
How many events are recorded for each year?
What’s the most common event type in the data?
Which countries had the highest number of reported fatalities?

# 1
acled_sa = pd.read_csv("acled_south_america.csv")
acled_sa["event_type"].drop_duplicates().to_list()

['Protests',
 'Battles',
 'Strategic developments',
 'Violence against civilians',
 'Riots',
 'Explosions/Remote violence']

# 2
(acled_sa.groupby("year", as_index=False)
     ["year"]
     .size()
        )

# 3 
(acled_sa.groupby("event_type", as_index=False)
     ["event_type"]
     .size()
     .sort_values("size", ascending=False)
     .reset_index(drop=True)
        )

acled_sa.groupby("country", as_index=False)["fatalities"].sum().sort_values("fatalities", ascending=False)

## Add your code here

!jupyter nbconvert _week-6c-data_wrangling_pandas.ipynb --to html --template classic

[NbConvertApp] Converting notebook _week-6c-data_wrangling_pandas.ipynb to html
[NbConvertApp] Writing 513144 bytes to _week-6c-data_wrangling_pandas.html

	Year	Datetime	Stage	Stadium	City	Home Team Name	Home Team Goals	Away Team Goals	Away Team Name	Attendance	Half-time Home Goals	Referee	Assistant 1	Assistant 2	RoundID	MatchID	Home Team Initials	Away Team Initials
0	1930	13 Jul 1930 - 15:00	Group 1	Pocitos	Montevideo	France	4	1	Mexico	4444.0	3	LOMBARDI Domingo (URU)	CRISTOPHE Henry (BEL)	REGO Gilberto (BRA)	201	1096	FRA	MEX
1	1930	13 Jul 1930 - 15:00	Group 4	Parque Central	Montevideo	USA	3	0	Belgium	18346.0	2	MACIAS Jose (ARG)	MATEUCCI Francisco (URU)	WARNKEN Alberto (CHI)	201	1090	USA	BEL
2	1930	14 Jul 1930 - 12:45	Group 2	Parque Central	Montevideo	Yugoslavia	2	1	Brazil	24059.0	2	TEJADA Anibal (URU)	VALLARINO Ricardo (URU)	BALWAY Thomas (FRA)	201	1093	YUG	BRA
3	1930	14 Jul 1930 - 14:50	Group 3	Pocitos	Montevideo	Romania	3	1	Peru	2549.0	1	WARNKEN Alberto (CHI)	LANGENUS Jean (BEL)	MATEUCCI Francisco (URU)	201	1098	ROU	PER
4	1930	15 Jul 1930 - 16:00	Group 1	Parque Central	Montevideo	Argentina	1	0	France	23409.0	0	REGO Gilberto (BRA)	SAUCEDO Ulises (BOL)	RADULESCU Constantin (ROU)	201	1085	ARG	FRA

	Home Team Initials	Away Team Initials
0	FRA	MEX
1	USA	BEL
2	YUG	BRA
3	ROU	PER
4	ARG	FRA
...	...	...
847	NED	CRC
848	BRA	GER
849	NED	ARG
850	BRA	NED
851	GER	ARG

	event_id_cnty	event_date	event_type	sub_event_type	actor1	assoc_actor_1	actor2	assoc_actor_2
0	GUF216	29 September 2023	Protests	Peaceful protest	Protesters (French Guiana)	NaN	NaN	NaN
1	ARG14050	29 September 2023	Protests	Peaceful protest	Protesters (Argentina)	CGT: General Confederation of Labour; CTA: Arg...	NaN	NaN
2	ARG14057	29 September 2023	Protests	Peaceful protest	Protesters (Argentina)	Labor Group (Argentina)	NaN	NaN
3	ARG14104	29 September 2023	Protests	Peaceful protest	Protesters (Argentina)	NaN	NaN	NaN
4	BOL4911	29 September 2023	Protests	Peaceful protest	Protesters (Bolivia)	Labor Group (Bolivia)	NaN	NaN
...	...	...	...	...	...	...	...	...
167582	VEN8468	01 January 2018	Protests	Excessive force against protesters	Protesters (Venezuela)	Women (Venezuela)	Military Forces of Venezuela (1999-) GNB: Vene...	NaN
167583	BRA31568	01 January 2018	Explosions/Remote violence	Remote explosive/landmine/IED	Unidentified Gang (Brazil)	NaN	NaN	NaN
167584	VEN15145	01 January 2018	Violence against civilians	Attack	Unidentified Armed Group (Venezuela)	NaN	Civilians (Venezuela)	NaN
167585	COL12093	01 January 2018	Violence against civilians	Attack	Unidentified Armed Group (Colombia)	NaN	Civilians (Colombia)	NaN
167586	COL3347	01 January 2018	Explosions/Remote violence	Remote explosive/landmine/IED	Unidentified Armed Group (Colombia)	NaN	Civilians (Colombia)	NaN

	event_id_cnty	event_date	event_type	sub_event_type	actor1	assoc_actor_1	actor2	assoc_actor_2
0	GUF216	29 September 2023	Protests	Peaceful protest	Protesters (French Guiana)	NaN	NaN	NaN
1	ARG14050	29 September 2023	Protests	Peaceful protest	Protesters (Argentina)	CGT: General Confederation of Labour; CTA: Arg...	NaN	NaN
2	ARG14057	29 September 2023	Protests	Peaceful protest	Protesters (Argentina)	Labor Group (Argentina)	NaN	NaN
3	ARG14104	29 September 2023	Protests	Peaceful protest	Protesters (Argentina)	NaN	NaN	NaN
4	BOL4911	29 September 2023	Protests	Peaceful protest	Protesters (Bolivia)	Labor Group (Bolivia)	NaN	NaN

	event_acled_south_america_id_cnty	event_acled_south_america_date	year	time_precision	disorder_type	event_acled_south_america_type	sub_event_acled_south_america_type	actor1	assoc_actor_1	inter1	...	location	latitude	longitude	geo_precision	source	source_scale	notes	tags	timestamp
0	GUF216	29 September 2023	2023	1	Demonstrations	Protests	Peaceful protest	Protesters (French Guiana)	NaN	6	...	Saint-Laurent-du-Maroni	5.5010	-54.0294	1	France Info	International	On 29 September 2023, in Saint-Laurent-du-Maro...	crowd size=around 100	1696264529
1	ARG14050	29 September 2023	2023	1	Demonstrations	Protests	Peaceful protest	Protesters (Argentina)	CGT: General Confederation of Labour; CTA: Arg...	6	...	Buenos Aires - Comuna 1	-34.6036	-58.3817	1	Diario Jornada; La Prensa (Argentina)	National	On 29 September 2023, in Buenos Aires - Comuna...	crowd size=no report	1696273132
2	ARG14057	29 September 2023	2023	1	Demonstrations	Protests	Peaceful protest	Protesters (Argentina)	Labor Group (Argentina)	6	...	Cordoba	-31.4355	-64.2009	1	La Voz del Interior (Argentina)	Subnational	On 29 September 2023, in Cordoba (Capital, Cor...	crowd size=no report	1696273132
3	ARG14104	29 September 2023	2023	1	Demonstrations	Protests	Peaceful protest	Protesters (Argentina)	NaN	6	...	La Plata - Villa Elvira	-34.9294	-57.9192	1	Diario El Dia	Subnational	On 29 September 2023, in La Plata - Villa Elvi...	crowd size=no report	1696273132
4	BOL4911	29 September 2023	2023	1	Demonstrations	Protests	Peaceful protest	Protesters (Bolivia)	Labor Group (Bolivia)	6	...	Sucre	-19.0333	-65.2627	1	Correo del Sur	Subnational	On 29 September 2023, in Sucre (Chuquisaca), u...	crowd size=no report	1696273132

	Year	Country	Winner	Runners-Up	Third	Fourth	GoalsScored	QualifiedTeams	MatchesPlayed	Attendance
0	1930	Uruguay	Uruguay	Argentina	USA	Yugoslavia	70	13	18	590.549
1	1934	Italy	Italy	Czechoslovakia	Germany	Austria	70	16	17	363.000
2	1938	France	Italy	Hungary	Brazil	Sweden	84	15	18	375.700
3	1950	Brazil	Uruguay	Brazil	Sweden	Spain	88	13	22	1.045.246
4	1954	Switzerland	Germany FR	Hungary	Austria	Uruguay	140	16	26	768.607

	Home Team Initials	Away Team Initials
0	FRA	MEX
1	USA	BEL
2	YUG	BRA
3	ROU	PER
4	ARG	FRA
...	...	...
847	NED	CRC
848	BRA	GER
849	NED	ARG
850	BRA	NED
851	GER	ARG

	best_three	final	Country	Third	Runners-Up	Winner
0	Uruguay vs Argentina and USA	Uruguay vs Argentina	Uruguay	USA	Argentina	Uruguay
1	Italy vs Czechoslovakia and Germany	Italy vs Czechoslovakia	Italy	Germany	Czechoslovakia	Italy
2	Italy vs Hungary and Brazil	Italy vs Hungary	France	Brazil	Hungary	Italy
3	Uruguay vs Brazil and Sweden	Uruguay vs Brazil	Brazil	Sweden	Brazil	Uruguay
4	Germany FR vs Hungary and Austria	Germany FR vs Hungary	Switzerland	Austria	Hungary	Germany FR

	Year	Country	Winner	Runners-Up	Third	Fourth	GoalsScored	QualifiedTeams	MatchesPlayed	Attendance	av_goals_matches	winner_and_hoster
7	1966	England	England	Germany FR	Portugal	Soviet Union	89	16	32	1.563.135	2.781250	True
3	1950	Brazil	Uruguay	Brazil	Sweden	Spain	88	13	22	1.045.246	4.000000	False
10	1978	Argentina	Argentina	Netherlands	Brazil	Italy	102	16	38	1.545.791	2.684211	True
5	1958	Sweden	Brazil	Sweden	France	Germany FR	126	16	35	819.810	3.600000	False
2	1938	France	Italy	Hungary	Brazil	Sweden	84	15	18	375.700	4.666667	False
9	1974	Germany	Germany FR	Netherlands	Poland	Brazil	97	16	38	1.865.753	2.552632	False
12	1986	Mexico	Argentina	Germany FR	France	Belgium	132	24	52	2.394.031	2.538462	False
6	1962	Chile	Brazil	Czechoslovakia	Chile	Yugoslavia	89	16	32	893.172	2.781250	False
14	1994	USA	Brazil	Italy	Sweden	Bulgaria	141	24	52	3.587.538	2.711538	False
19	2014	Brazil	Germany	Argentina	Netherlands	Brazil	171	32	64	3.386.810	2.671875	False

	Year	Country	Winner	Runners-Up	Third	Fourth	GoalsScored	QualifiedTeams	MatchesPlayed	Attendance	av_goals_matches	winner_and_hoster	brazil_winner
15	1998	France	France	Brazil	Croatia	Netherlands	171	32	64	2.785.100	2.671875	True	0.0
16	2002	Korea/Japan	Brazil	Germany	Turkey	Korea Republic	161	32	64	2.705.197	2.515625	False	1.0
17	2006	Germany	Italy	France	Germany	Portugal	147	32	64	3.359.439	2.296875	False	0.0
18	2010	South Africa	Spain	Netherlands	Germany	Uruguay	145	32	64	3.178.856	2.265625	False	0.0
19	2014	Brazil	Germany	Argentina	Netherlands	Brazil	171	32	64	3.386.810	2.671875	False	0.0

	Year	GoalsScored	QualifiedTeams	MatchesPlayed
Winner
Argentina	1982.000000	117.000000	20.000000	45.000000
Brazil	1977.200000	122.400000	20.800000	43.000000
England	1966.000000	89.000000	16.000000	32.000000
France	1998.000000	171.000000	32.000000	64.000000
Germany	2014.000000	171.000000	32.000000	64.000000
Germany FR	1972.666667	117.333333	18.666667	38.666667
Italy	1965.000000	111.750000	21.750000	37.750000
Spain	2010.000000	145.000000	32.000000	64.000000
Uruguay	1940.000000	79.000000	13.000000	20.000000

	Winner	mean	std	median
0	Argentina	117.000000	21.213203	117.0
1	Brazil	122.400000	30.476220	126.0
2	England	89.000000	NaN	89.0
3	France	171.000000	NaN	171.0
4	Germany	171.000000	NaN	171.0
5	Germany FR	117.333333	21.594752	115.0
6	Italy	111.750000	40.532908	115.0
7	Spain	145.000000	NaN	145.0
8	Uruguay	79.000000	12.727922	79.0

	Country	GoalsScored
0	Argentina	152.0
1	Brazil	179.5
2	Chile	139.0
3	England	139.0
4	France	177.5
...	...	...
10	Spain	196.0
11	Sweden	176.0
12	Switzerland	190.0
13	USA	191.0
14	Uruguay	120.0

	Winner	Year	GoalsScored
0	Argentina	1978	102.0
1	Argentina	1986	132.0
2	Brazil	1958	126.0
3	Brazil	1962	89.0
4	Brazil	1970	95.0

	year	size
0	2018	28758
1	2019	26441
2	2020	26615
3	2021	30190
4	2022	32039
5	2023	23544

	Home Team Initials	Away Team Initials
0	FRA	MEX
1	USA	BEL
2	YUG	BRA
3	ROU	PER
4	ARG	FRA
...	...	...
847	NED	CRC
848	BRA	GER
849	NED	ARG
850	BRA	NED
851	GER	ARG

	event_type	size
0	Protests	80850
1	Battles	34042
2	Violence against civilians	30001
3	Riots	13447
4	Strategic developments	6917
5	Explosions/Remote violence	2330

	country	fatalities
2	Brazil	35842
4	Colombia	10553
13	Venezuela	8824
10	Peru	231
9	Paraguay	132
5	Ecuador	130
0	Argentina	117
3	Chile	108
1	Bolivia	88
11	Suriname	12
8	Guyana	11
12	Uruguay	7
7	French Guiana	2
6	Falkland Islands	0

PPOL 5203 Data Science I: Foundations Data Wrangling in Pandas Tiago Ventura

Setup¶

Data Wrangling in pandas¶

Piping¶

Method 1: sequentially overwrite the object¶

Method 2: Pandas Pipe¶

Final notes in Piping¶

Column-Wise Operations¶

For all operations we will see index vs function-based implementation¶

Select Columns¶

Select via index¶

.loc() methods for selecting columns¶

.filter() method¶

Drop Columns¶

pd.drop() methods: easier way to go¶

Create new columns¶

Transformation via index assignment¶

Transformation via apply methods by rows¶

.assign() method.¶

Pay attention here! Using newly created variables!¶

Renaming¶

Practice¶

Data¶

Row-Wise Operations¶

Subsetting¶

Subsetting by index¶

Subsetting with .loc() method¶

Subsetting with .query(): a pipe approach¶

Other types of subsetting¶

Subset by distinct entry¶

Subset by sampling¶

Recoding values¶

Recode with map() + dictionaries¶

Recode with apply() + function¶

Summary of apply() and map()¶

Recode using numpy¶

np.where: if-else approach.¶

np.select: for multiple conditions¶

Group by: split-apply-combine¶

groupby (it just splits!)¶

Aggreations (summarize)¶

pandas: .groupby() + built-in methods¶

Built-in Methods¶

pandas: .groupby() + agg()¶

pandas: .groupby() + .apply()¶

multi-index grouping¶

pandas: .groupby() + .transform()¶

Sorting values¶

That was a lot! but you will get to keep this notebook for you!¶

Practice¶

PPOL 5203 Data Science I: Foundations

Data Wrangling in Pandas

Tiago Ventura

Data Wrangling in `pandas`¶

`.loc()` methods for selecting columns¶

`.filter()` method¶

`pd.drop()` methods: easier way to go¶

Transformation via `apply` methods by rows¶

`.assign()` method.¶

Subsetting with `.loc()` method¶

Subsetting with `.query()`: a pipe approach¶

Summary of `apply()` and `map()`¶

`np.where`: if-else approach.¶

`np.select`: for multiple conditions¶

`pandas`: `.groupby()` + built-in methods¶

`pandas`: `.groupby()` + `agg()`¶

`pandas`: `.groupby()` + `.apply()`¶

`pandas`: `.groupby()` + `.transform()`¶

	Home Team Initials	Away Team Initials
0	FRA	MEX
1	USA	BEL
2	YUG	BRA
3	ROU	PER
4	ARG	FRA
...	...	...
847	NED	CRC
848	BRA	GER
849	NED	ARG
850	BRA	NED
851	GER	ARG