\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "07e09ea3",
"metadata": {},
"source": [
"**In this Notebook:**\n",
"\n",
"In this notebook, we will cover standard data wrangling methods using pandas:\n",
"\n",
"- Selecting Methods.\n",
"- Filtering Methods.\n",
"- Grouping and Summarization.\n",
"- Recoding Variables.\n"
]
},
{
"cell_type": "markdown",
"id": "1474fc7e",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 91,
"id": "908ae4fa",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# set some options for pandas\n",
"pd.set_option('display.max_rows', 10)"
]
},
{
"cell_type": "markdown",
"id": "2aa82f65",
"metadata": {},
"source": [
"## Data Wrangling in `pandas`\n"
]
},
{
"cell_type": "markdown",
"id": "bc274882",
"metadata": {},
"source": [
"Since you are also learning R throughout DSPP, let's provide you with an overview of the main Data Wrangling Functions in R using Tidyverse and Python using `pandas`. \n",
"\n",
"**Main (tidy) Data Wrangling Functions **\n",
"\n",
"| [`pandas`](https://pandas.pydata.org/) | [`dplyr`](https://dplyr.tidyverse.org/)$^\\dagger$ | Description |\n",
"|:---------------:|:-------------:|:-----------------------------|\n",
"| `.filter()` | `select()` | select column variables/index |\n",
"| `.drop()` | `select()` | drop selected column variables/index |\n",
"| `.rename()` | `rename()` | rename column variables/index |\n",
"| `.query()` | `filter()` | row-wise subset of a data frame by a values of a column variable/index |\n",
"| `.assign()` |`mutate()` | Create a new variable on the existing data frame |\n",
"| `.sort_values()`| `arrange()` | Arrange all data values along a specified (set of) column variable(s)/indices |\n",
"| `.groupby()` | `group_by()` | Index data frame by specific (set of) column variable(s)/index value(s)|\n",
"| `.agg()` | `summarize()` | aggregate data by specific function rules |\n",
"| `.pivot_table()` | `spread()` | cast the data from a \"long\" to a \"wide\" format |\n",
"| `pd.melt()` | `gather()` | cast the data from a \"wide\" to a \"long\" format |\n",
"| `.()` | `%>%` | piping, fluid programming, or the passing one function output to the next |\n",
"\n",
"\n",
"If you want to fully embrace the tidyverse style from R in Python, you should check the [`dfply` module](https://github.com/kieferk/dfply). This modules ofers an alternative to data wrangling in Python, and mirrors the popular tidyverse functionalities from R. \n",
"\n",
"We will not cover `dfply` in class because I believe you should dominate `pandas` as data scientists that are fluent in Python. However, feel free to learn and even use in your homeworks and assignment. "
]
},
{
"cell_type": "code",
"execution_count": 93,
"id": "f77a8fd8",
"metadata": {},
"outputs": [],
"source": [
"# load worldcup datasets\n",
"wc = pd.read_csv(\"WorldCups.csv\")\n",
"wc_matches = pd.read_csv(\"WorldCupMatches.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 94,
"id": "ef562ba2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance \n",
"0 70 13 18 590.549 \n",
"1 70 16 17 363.000 \n",
"2 84 15 18 375.700 \n",
"3 88 13 22 1.045.246 \n",
"4 140 16 26 768.607 "
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# see wc data\n",
"wc.head()"
]
},
{
"cell_type": "code",
"execution_count": 95,
"id": "8f547188",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Datetime | \n",
" Stage | \n",
" Stadium | \n",
" City | \n",
" Home Team Name | \n",
" Home Team Goals | \n",
" Away Team Goals | \n",
" Away Team Name | \n",
" Win conditions | \n",
" Attendance | \n",
" Half-time Home Goals | \n",
" Half-time Away Goals | \n",
" Referee | \n",
" Assistant 1 | \n",
" Assistant 2 | \n",
" RoundID | \n",
" MatchID | \n",
" Home Team Initials | \n",
" Away Team Initials | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" 13 Jul 1930 - 15:00 | \n",
" Group 1 | \n",
" Pocitos | \n",
" Montevideo | \n",
" France | \n",
" 4 | \n",
" 1 | \n",
" Mexico | \n",
" | \n",
" 4444.0 | \n",
" 3 | \n",
" 0 | \n",
" LOMBARDI Domingo (URU) | \n",
" CRISTOPHE Henry (BEL) | \n",
" REGO Gilberto (BRA) | \n",
" 201 | \n",
" 1096 | \n",
" FRA | \n",
" MEX | \n",
"
\n",
" \n",
" 1 | \n",
" 1930 | \n",
" 13 Jul 1930 - 15:00 | \n",
" Group 4 | \n",
" Parque Central | \n",
" Montevideo | \n",
" USA | \n",
" 3 | \n",
" 0 | \n",
" Belgium | \n",
" | \n",
" 18346.0 | \n",
" 2 | \n",
" 0 | \n",
" MACIAS Jose (ARG) | \n",
" MATEUCCI Francisco (URU) | \n",
" WARNKEN Alberto (CHI) | \n",
" 201 | \n",
" 1090 | \n",
" USA | \n",
" BEL | \n",
"
\n",
" \n",
" 2 | \n",
" 1930 | \n",
" 14 Jul 1930 - 12:45 | \n",
" Group 2 | \n",
" Parque Central | \n",
" Montevideo | \n",
" Yugoslavia | \n",
" 2 | \n",
" 1 | \n",
" Brazil | \n",
" | \n",
" 24059.0 | \n",
" 2 | \n",
" 0 | \n",
" TEJADA Anibal (URU) | \n",
" VALLARINO Ricardo (URU) | \n",
" BALWAY Thomas (FRA) | \n",
" 201 | \n",
" 1093 | \n",
" YUG | \n",
" BRA | \n",
"
\n",
" \n",
" 3 | \n",
" 1930 | \n",
" 14 Jul 1930 - 14:50 | \n",
" Group 3 | \n",
" Pocitos | \n",
" Montevideo | \n",
" Romania | \n",
" 3 | \n",
" 1 | \n",
" Peru | \n",
" | \n",
" 2549.0 | \n",
" 1 | \n",
" 0 | \n",
" WARNKEN Alberto (CHI) | \n",
" LANGENUS Jean (BEL) | \n",
" MATEUCCI Francisco (URU) | \n",
" 201 | \n",
" 1098 | \n",
" ROU | \n",
" PER | \n",
"
\n",
" \n",
" 4 | \n",
" 1930 | \n",
" 15 Jul 1930 - 16:00 | \n",
" Group 1 | \n",
" Parque Central | \n",
" Montevideo | \n",
" Argentina | \n",
" 1 | \n",
" 0 | \n",
" France | \n",
" | \n",
" 23409.0 | \n",
" 0 | \n",
" 0 | \n",
" REGO Gilberto (BRA) | \n",
" SAUCEDO Ulises (BOL) | \n",
" RADULESCU Constantin (ROU) | \n",
" 201 | \n",
" 1085 | \n",
" ARG | \n",
" FRA | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Datetime Stage Stadium City \\\n",
"0 1930 13 Jul 1930 - 15:00 Group 1 Pocitos Montevideo \n",
"1 1930 13 Jul 1930 - 15:00 Group 4 Parque Central Montevideo \n",
"2 1930 14 Jul 1930 - 12:45 Group 2 Parque Central Montevideo \n",
"3 1930 14 Jul 1930 - 14:50 Group 3 Pocitos Montevideo \n",
"4 1930 15 Jul 1930 - 16:00 Group 1 Parque Central Montevideo \n",
"\n",
" Home Team Name Home Team Goals Away Team Goals Away Team Name \\\n",
"0 France 4 1 Mexico \n",
"1 USA 3 0 Belgium \n",
"2 Yugoslavia 2 1 Brazil \n",
"3 Romania 3 1 Peru \n",
"4 Argentina 1 0 France \n",
"\n",
" Win conditions Attendance Half-time Home Goals Half-time Away Goals \\\n",
"0 4444.0 3 0 \n",
"1 18346.0 2 0 \n",
"2 24059.0 2 0 \n",
"3 2549.0 1 0 \n",
"4 23409.0 0 0 \n",
"\n",
" Referee Assistant 1 \\\n",
"0 LOMBARDI Domingo (URU) CRISTOPHE Henry (BEL) \n",
"1 MACIAS Jose (ARG) MATEUCCI Francisco (URU) \n",
"2 TEJADA Anibal (URU) VALLARINO Ricardo (URU) \n",
"3 WARNKEN Alberto (CHI) LANGENUS Jean (BEL) \n",
"4 REGO Gilberto (BRA) SAUCEDO Ulises (BOL) \n",
"\n",
" Assistant 2 RoundID MatchID Home Team Initials \\\n",
"0 REGO Gilberto (BRA) 201 1096 FRA \n",
"1 WARNKEN Alberto (CHI) 201 1090 USA \n",
"2 BALWAY Thomas (FRA) 201 1093 YUG \n",
"3 MATEUCCI Francisco (URU) 201 1098 ROU \n",
"4 RADULESCU Constantin (ROU) 201 1085 ARG \n",
"\n",
" Away Team Initials \n",
"0 MEX \n",
"1 BEL \n",
"2 BRA \n",
"3 PER \n",
"4 FRA "
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# see matched data\n",
"wc_matches.head()"
]
},
{
"cell_type": "markdown",
"id": "9b152245",
"metadata": {},
"source": [
"## Piping\n",
"\n",
"If you started your data science career, or are simultaneously learning R, it is likely you were exposed to the `tidyverse` world, and the famous `pipe`. Here it is for you:\n",
"\n",
"\n",
"\n",
"\n",
"Pipe operators, introduced by tha magritt package in R, absolutely transformed the way of writing coding in R. It did for good reasons. And in the past few years, piping has also become more prevalent in Python. \n",
"\n",
"Pipe operators allows you to:\n",
"\n",
"- **Chain together data manipulations in a single operational sequence.**\n",
"\n",
"In pandas, piping allows you to chain together in a single workflow a sequence of methods. Pipe operators (`.pipe()`) in `pandas` were introduced relatively recently (version 0.16.2).\n",
"\n",
"Let's see some examples using the world cup data. We want to count which country played more world cup games."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "048aa05f",
"metadata": {},
"outputs": [],
"source": [
"# read world cup data\n",
"#wc = pd.read_csv(\"WorldCupMatches.csv\")"
]
},
{
"cell_type": "markdown",
"id": "d9a735c2",
"metadata": {},
"source": [
"#### _Method 1_: sequentially overwrite the object"
]
},
{
"cell_type": "code",
"execution_count": 96,
"id": "cfc1464e",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" n | \n",
"
\n",
" \n",
" \n",
" \n",
" 7 | \n",
" Brazil | \n",
" 108 | \n",
"
\n",
" \n",
" 39 | \n",
" Italy | \n",
" 83 | \n",
"
\n",
" \n",
" 2 | \n",
" Argentina | \n",
" 81 | \n",
"
\n",
" \n",
" 25 | \n",
" England | \n",
" 62 | \n",
"
\n",
" \n",
" 29 | \n",
" Germany FR | \n",
" 62 | \n",
"
\n",
" \n",
" 26 | \n",
" France | \n",
" 61 | \n",
"
\n",
" \n",
" 66 | \n",
" Spain | \n",
" 59 | \n",
"
\n",
" \n",
" 45 | \n",
" Mexico | \n",
" 54 | \n",
"
\n",
" \n",
" 47 | \n",
" Netherlands | \n",
" 54 | \n",
"
\n",
" \n",
" 74 | \n",
" Uruguay | \n",
" 52 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country n\n",
"7 Brazil 108\n",
"39 Italy 83\n",
"2 Argentina 81\n",
"25 England 62\n",
"29 Germany FR 62\n",
"26 France 61\n",
"66 Spain 59\n",
"45 Mexico 54\n",
"47 Netherlands 54\n",
"74 Uruguay 52"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc = pd.read_csv(\"WorldCupMatches.csv\")\n",
"\n",
"# select columns\n",
"wc = wc.filter(['Year','Home Team Name','Away Team Name'])\n",
"\n",
"# make it tidy\n",
"wc = wc.melt(id_vars=[\"Year\"], var_name=\"var\", value_name=\"country\")\n",
"\n",
"# group by\n",
"wc = wc.groupby(\"country\")\n",
"\n",
"# count occurrences\n",
"wc = wc.size()\n",
"\n",
"# move index to column with new name\n",
"wc = wc.reset_index(name=\"n\")\n",
"\n",
"# sort\n",
"wc = wc.sort_values(by=\"n\", ascending=False)\n",
"\n",
"# print 10\n",
"wc.head(10)"
]
},
{
"cell_type": "markdown",
"id": "7f5fabe4",
"metadata": {},
"source": [
"#### Method 2: Pandas Pipe"
]
},
{
"cell_type": "code",
"execution_count": 99,
"id": "ed9353c9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"country\n",
"Algeria 14\n",
"Angola 3\n",
"Argentina 81\n",
"Australia 13\n",
"Austria 29\n",
"Belgium 43\n",
"Bolivia 6\n",
"Brazil 108\n",
"Bulgaria 26\n",
"Cameroon 23\n",
"dtype: int64"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc = pd.read_csv(\"WorldCupMatches.csv\")\n",
"\n",
"# select columns\n",
"wc_10 = ( # encapsulate\n",
" wc.filter(['Year','Home Team Name','Away Team Name']).\n",
" melt(id_vars=[\"Year\"], var_name=\"var\", value_name=\"country\"). \n",
" groupby(\"country\"). \n",
" size(). \n",
" reset_index(name=\"n\").\n",
" sort_values(by=\"n\", ascending=False)\n",
" ) # close\n",
"\n",
"# print 10\n",
"wc_10.head(10)"
]
},
{
"cell_type": "markdown",
"id": "b0aba9fd",
"metadata": {},
"source": [
"#### Final notes in Piping\n",
"\n",
"
\n",
"\n",
"To understand pipes, you should always remember: **data in, data out**. That's what pipes do: apply methods sequentially to `pandas dataframes`. \n",
" \n",
"
\n"
]
},
{
"cell_type": "markdown",
"id": "20cd3b1e",
"metadata": {},
"source": [
"It should be clear by now, but these are some of the advantages of piping your code: \n",
" \n",
"- Improves code structure\n",
"- Eliminates intermediate variables\t\n",
"- Can improve code readability\n",
"- Memory efficiency by eliminating intermediate variables\n",
"- Makes easier to add steps anywhere in the sequence of operations\n",
"\n",
"Keep in mind pipes also have some disadvantages. In my view, specially when working in notebook in which you do not execute line by line, pipes can make codes a bit harder to debug and also embed errors that could be more easily perceived by examining intermediate variables. "
]
},
{
"cell_type": "markdown",
"id": "0b0eefe1",
"metadata": {},
"source": [
"### Column-Wise Operations\n",
"\n",
"For data wrangling tasks at the columns of your data frame, we will discuss: \n",
"\n",
"- Select columns\n",
"- Drop columns\n",
"- Create new columns\n",
"- Rename columns\n",
"\n",
"\n",
"### For all operations we will see index vs function-based implementation"
]
},
{
"cell_type": "markdown",
"id": "9ef25853",
"metadata": {},
"source": [
"### Select Columns\n",
"\n",
"**Functionality:**\n",
"\n",
"- Select specific variables/column indices\n",
"\n",
"**Implementation**\n",
"\n",
"- Traditional indexing in pandas\n",
"- `.loc()` methods\n",
"- `pd.filter()` methods (allows piping). \n"
]
},
{
"cell_type": "markdown",
"id": "e7027ba9",
"metadata": {},
"source": [
"#### Select via index"
]
},
{
"cell_type": "code",
"execution_count": 100,
"id": "dbc59d95",
"metadata": {},
"outputs": [],
"source": [
"# load worldcup dataset\n",
"wc = pd.read_csv(\"WorldCups.csv\")\n",
"wc_matches = pd.read_csv(\"WorldCupMatches.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 101,
"id": "f78ee536",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1930\n",
"1 1934\n",
"2 1938\n",
"3 1950\n",
"4 1954\n",
" ... \n",
"15 1998\n",
"16 2002\n",
"17 2006\n",
"18 2010\n",
"19 2014\n",
"Name: Year, Length: 20, dtype: int64"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## simple index for columns\n",
"wc[\"Year\"]"
]
},
{
"cell_type": "code",
"execution_count": 102,
"id": "82fddbe5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1930\n",
"1 1934\n",
"2 1938\n",
"3 1950\n",
"4 1954\n",
" ... \n",
"15 1998\n",
"16 2002\n",
"17 2006\n",
"18 2010\n",
"19 2014\n",
"Name: Year, Length: 20, dtype: int64"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## using dot method\n",
"wc.Year"
]
},
{
"cell_type": "code",
"execution_count": 103,
"id": "3c4fa022",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1930\n",
"1 1934\n",
"2 1938\n",
"3 1950\n",
"4 1954\n",
" ... \n",
"15 1998\n",
"16 2002\n",
"17 2006\n",
"18 2010\n",
"19 2014\n",
"Name: Year, Length: 20, dtype: int64"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## using .loc method\n",
"wc.loc[:,\"Year\"]"
]
},
{
"cell_type": "markdown",
"id": "8d1987b7",
"metadata": {},
"source": [
"Notice, in all cases, the output comes as a `pandas series`. If you would like the output to be a full data frame, or if you need to select multiple columns, you should give a list of indexes as inputs"
]
},
{
"cell_type": "code",
"execution_count": 105,
"id": "721044a7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" Year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Uruguay | \n",
" 1930 | \n",
"
\n",
" \n",
" 1 | \n",
" Italy | \n",
" 1934 | \n",
"
\n",
" \n",
" 2 | \n",
" Italy | \n",
" 1938 | \n",
"
\n",
" \n",
" 3 | \n",
" Uruguay | \n",
" 1950 | \n",
"
\n",
" \n",
" 4 | \n",
" Germany FR | \n",
" 1954 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" France | \n",
" 1998 | \n",
"
\n",
" \n",
" 16 | \n",
" Brazil | \n",
" 2002 | \n",
"
\n",
" \n",
" 17 | \n",
" Italy | \n",
" 2006 | \n",
"
\n",
" \n",
" 18 | \n",
" Spain | \n",
" 2010 | \n",
"
\n",
" \n",
" 19 | \n",
" Germany | \n",
" 2014 | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Winner Year\n",
"0 Uruguay 1930\n",
"1 Italy 1934\n",
"2 Italy 1938\n",
"3 Uruguay 1950\n",
"4 Germany FR 1954\n",
".. ... ...\n",
"15 France 1998\n",
"16 Brazil 2002\n",
"17 Italy 2006\n",
"18 Spain 2010\n",
"19 Germany 2014\n",
"\n",
"[20 rows x 2 columns]"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc[[\"Winner\",\"Year\"]]"
]
},
{
"cell_type": "markdown",
"id": "2c9e7622",
"metadata": {},
"source": [
"#### `.loc()` methods for selecting columns\n",
"\n",
"It allows you to select multible variables in between column names!"
]
},
{
"cell_type": "code",
"execution_count": 106,
"id": "7bfe7854",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" 1998 | \n",
" France | \n",
" France | \n",
"
\n",
" \n",
" 16 | \n",
" 2002 | \n",
" Korea/Japan | \n",
" Brazil | \n",
"
\n",
" \n",
" 17 | \n",
" 2006 | \n",
" Germany | \n",
" Italy | \n",
"
\n",
" \n",
" 18 | \n",
" 2010 | \n",
" South Africa | \n",
" Spain | \n",
"
\n",
" \n",
" 19 | \n",
" 2014 | \n",
" Brazil | \n",
" Germany | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 3 columns
\n",
"
"
],
"text/plain": [
" Year Country Winner\n",
"0 1930 Uruguay Uruguay\n",
"1 1934 Italy Italy\n",
"2 1938 France Italy\n",
"3 1950 Brazil Uruguay\n",
"4 1954 Switzerland Germany FR\n",
".. ... ... ...\n",
"15 1998 France France\n",
"16 2002 Korea/Japan Brazil\n",
"17 2006 Germany Italy\n",
"18 2010 South Africa Spain\n",
"19 2014 Brazil Germany\n",
"\n",
"[20 rows x 3 columns]"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# .loc for a single or multiple columns\n",
"wc.loc[: , \"Year\":\"Winner\"]"
]
},
{
"cell_type": "code",
"execution_count": 107,
"id": "0790cf61",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
"
\n",
" \n",
" 5 | \n",
" 1958 | \n",
" Sweden | \n",
" Brazil | \n",
"
\n",
" \n",
" 6 | \n",
" 1962 | \n",
" Chile | \n",
" Brazil | \n",
"
\n",
" \n",
" 7 | \n",
" 1966 | \n",
" England | \n",
" England | \n",
"
\n",
" \n",
" 8 | \n",
" 1970 | \n",
" Mexico | \n",
" Brazil | \n",
"
\n",
" \n",
" 9 | \n",
" 1974 | \n",
" Germany | \n",
" Germany FR | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner\n",
"0 1930 Uruguay Uruguay\n",
"1 1934 Italy Italy\n",
"2 1938 France Italy\n",
"3 1950 Brazil Uruguay\n",
"4 1954 Switzerland Germany FR\n",
"5 1958 Sweden Brazil\n",
"6 1962 Chile Brazil\n",
"7 1966 England England\n",
"8 1970 Mexico Brazil\n",
"9 1974 Germany Germany FR"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# iloc for numeric positions\n",
"wc.iloc[0:10,0:3]"
]
},
{
"cell_type": "markdown",
"id": "26c983dc",
"metadata": {},
"source": [
"#### `.filter()` method\n",
"\n",
"The `filter` methods in Pandas works similarly to the `select` function in the Tidyverse in R. It has the following advantages: \n",
"\n",
"- Allows for a piping approach\n",
"- Can be combined with regex queries for selectins columns"
]
},
{
"cell_type": "code",
"execution_count": 108,
"id": "715ea259",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Winner | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" Italy | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Uruguay | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Germany FR | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" 1998 | \n",
" France | \n",
"
\n",
" \n",
" 16 | \n",
" 2002 | \n",
" Brazil | \n",
"
\n",
" \n",
" 17 | \n",
" 2006 | \n",
" Italy | \n",
"
\n",
" \n",
" 18 | \n",
" 2010 | \n",
" Spain | \n",
"
\n",
" \n",
" 19 | \n",
" 2014 | \n",
" Germany | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Year Winner\n",
"0 1930 Uruguay\n",
"1 1934 Italy\n",
"2 1938 Italy\n",
"3 1950 Uruguay\n",
"4 1954 Germany FR\n",
".. ... ...\n",
"15 1998 France\n",
"16 2002 Brazil\n",
"17 2006 Italy\n",
"18 2010 Spain\n",
"19 2014 Germany\n",
"\n",
"[20 rows x 2 columns]"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# simple filter\n",
"wc.filter([\"Year\", \"Winner\"])"
]
},
{
"cell_type": "markdown",
"id": "4bf56abe",
"metadata": {},
"source": [
"**Using the like parameter:** Select columns that contain a specific substring."
]
},
{
"cell_type": "code",
"execution_count": 109,
"id": "57078e79",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Away Team Goals | \n",
" Away Team Name | \n",
" Half-time Away Goals | \n",
" Away Team Initials | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Mexico | \n",
" 0 | \n",
" MEX | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" Belgium | \n",
" 0 | \n",
" BEL | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Brazil | \n",
" 0 | \n",
" BRA | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" Peru | \n",
" 0 | \n",
" PER | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" France | \n",
" 0 | \n",
" FRA | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 847 | \n",
" 0 | \n",
" Costa Rica | \n",
" 0 | \n",
" CRC | \n",
"
\n",
" \n",
" 848 | \n",
" 7 | \n",
" Germany | \n",
" 5 | \n",
" GER | \n",
"
\n",
" \n",
" 849 | \n",
" 0 | \n",
" Argentina | \n",
" 0 | \n",
" ARG | \n",
"
\n",
" \n",
" 850 | \n",
" 3 | \n",
" Netherlands | \n",
" 2 | \n",
" NED | \n",
"
\n",
" \n",
" 851 | \n",
" 0 | \n",
" Argentina | \n",
" 0 | \n",
" ARG | \n",
"
\n",
" \n",
"
\n",
"
852 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Away Team Goals Away Team Name Half-time Away Goals Away Team Initials\n",
"0 1 Mexico 0 MEX\n",
"1 0 Belgium 0 BEL\n",
"2 1 Brazil 0 BRA\n",
"3 1 Peru 0 PER\n",
"4 0 France 0 FRA\n",
".. ... ... ... ...\n",
"847 0 Costa Rica 0 CRC\n",
"848 7 Germany 5 GER\n",
"849 0 Argentina 0 ARG\n",
"850 3 Netherlands 2 NED\n",
"851 0 Argentina 0 ARG\n",
"\n",
"[852 rows x 4 columns]"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# like parameter. In R: data %>% select(contains(\"Away\"))\n",
"wc_matches.filter(like=\"Away\")"
]
},
{
"cell_type": "markdown",
"id": "5851289c",
"metadata": {},
"source": [
"**Using the Regex:** You can also input regex queries for selecting columns. More and More flexibility"
]
},
{
"cell_type": "code",
"execution_count": 110,
"id": "1caf62e9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Away Team Goals | \n",
" Away Team Name | \n",
" Away Team Initials | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Mexico | \n",
" MEX | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" Belgium | \n",
" BEL | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Brazil | \n",
" BRA | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" Peru | \n",
" PER | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" France | \n",
" FRA | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 847 | \n",
" 0 | \n",
" Costa Rica | \n",
" CRC | \n",
"
\n",
" \n",
" 848 | \n",
" 7 | \n",
" Germany | \n",
" GER | \n",
"
\n",
" \n",
" 849 | \n",
" 0 | \n",
" Argentina | \n",
" ARG | \n",
"
\n",
" \n",
" 850 | \n",
" 3 | \n",
" Netherlands | \n",
" NED | \n",
"
\n",
" \n",
" 851 | \n",
" 0 | \n",
" Argentina | \n",
" ARG | \n",
"
\n",
" \n",
"
\n",
"
852 rows × 3 columns
\n",
"
"
],
"text/plain": [
" Away Team Goals Away Team Name Away Team Initials\n",
"0 1 Mexico MEX\n",
"1 0 Belgium BEL\n",
"2 1 Brazil BRA\n",
"3 1 Peru PER\n",
"4 0 France FRA\n",
".. ... ... ...\n",
"847 0 Costa Rica CRC\n",
"848 7 Germany GER\n",
"849 0 Argentina ARG\n",
"850 3 Netherlands NED\n",
"851 0 Argentina ARG\n",
"\n",
"[852 rows x 3 columns]"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# starts with\n",
"wc_matches.filter(regex=\"^Away\")"
]
},
{
"cell_type": "code",
"execution_count": 111,
"id": "ea192241",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Home Team Initials | \n",
" Away Team Initials | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" FRA | \n",
" MEX | \n",
"
\n",
" \n",
" 1 | \n",
" USA | \n",
" BEL | \n",
"
\n",
" \n",
" 2 | \n",
" YUG | \n",
" BRA | \n",
"
\n",
" \n",
" 3 | \n",
" ROU | \n",
" PER | \n",
"
\n",
" \n",
" 4 | \n",
" ARG | \n",
" FRA | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 847 | \n",
" NED | \n",
" CRC | \n",
"
\n",
" \n",
" 848 | \n",
" BRA | \n",
" GER | \n",
"
\n",
" \n",
" 849 | \n",
" NED | \n",
" ARG | \n",
"
\n",
" \n",
" 850 | \n",
" BRA | \n",
" NED | \n",
"
\n",
" \n",
" 851 | \n",
" GER | \n",
" ARG | \n",
"
\n",
" \n",
"
\n",
"
852 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Home Team Initials Away Team Initials\n",
"0 FRA MEX\n",
"1 USA BEL\n",
"2 YUG BRA\n",
"3 ROU PER\n",
"4 ARG FRA\n",
".. ... ...\n",
"847 NED CRC\n",
"848 BRA GER\n",
"849 NED ARG\n",
"850 BRA NED\n",
"851 GER ARG\n",
"\n",
"[852 rows x 2 columns]"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# ends with\n",
"wc_matches.filter(regex=\"Initials$\")"
]
},
{
"cell_type": "markdown",
"id": "e7a32aac",
"metadata": {},
"source": [
"### Drop Columns\n",
"\n",
"**Functionality:**\n",
"\n",
"Drop specific variables/column indices\n",
"\n",
"**Implementation:**\n",
"\n",
"- Indexing + Boolean operations\n",
"- .loc() methods\n",
"- pd.drop() methods (allows piping)."
]
},
{
"cell_type": "code",
"execution_count": 112,
"id": "f8749af8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
"
\n",
" \n",
" 1 | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
"
\n",
" \n",
" 2 | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
"
\n",
" \n",
" 3 | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
"
\n",
" \n",
" 4 | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" France | \n",
" Brazil | \n",
" Croatia | \n",
" Netherlands | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 2.785.100 | \n",
"
\n",
" \n",
" 16 | \n",
" Brazil | \n",
" Germany | \n",
" Turkey | \n",
" Korea Republic | \n",
" 161 | \n",
" 32 | \n",
" 64 | \n",
" 2.705.197 | \n",
"
\n",
" \n",
" 17 | \n",
" Italy | \n",
" France | \n",
" Germany | \n",
" Portugal | \n",
" 147 | \n",
" 32 | \n",
" 64 | \n",
" 3.359.439 | \n",
"
\n",
" \n",
" 18 | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
"
\n",
" \n",
" 19 | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 8 columns
\n",
"
"
],
"text/plain": [
" Winner Runners-Up Third Fourth GoalsScored \\\n",
"0 Uruguay Argentina USA Yugoslavia 70 \n",
"1 Italy Czechoslovakia Germany Austria 70 \n",
"2 Italy Hungary Brazil Sweden 84 \n",
"3 Uruguay Brazil Sweden Spain 88 \n",
"4 Germany FR Hungary Austria Uruguay 140 \n",
".. ... ... ... ... ... \n",
"15 France Brazil Croatia Netherlands 171 \n",
"16 Brazil Germany Turkey Korea Republic 161 \n",
"17 Italy France Germany Portugal 147 \n",
"18 Spain Netherlands Germany Uruguay 145 \n",
"19 Germany Argentina Netherlands Brazil 171 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"0 13 18 590.549 \n",
"1 16 17 363.000 \n",
"2 15 18 375.700 \n",
"3 13 22 1.045.246 \n",
"4 16 26 768.607 \n",
".. ... ... ... \n",
"15 32 64 2.785.100 \n",
"16 32 64 2.705.197 \n",
"17 32 64 3.359.439 \n",
"18 32 64 3.178.856 \n",
"19 32 64 3.386.810 \n",
"\n",
"[20 rows x 8 columns]"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## select all but \"Year\" and \"Country\"\n",
"wc[wc.columns[~wc.columns.isin([\"Year\", \"Country\"])]]"
]
},
{
"cell_type": "code",
"execution_count": 113,
"id": "9999d5bc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Winner', 'Runners-Up', 'Third', 'Fourth', 'GoalsScored',\n",
" 'QualifiedTeams', 'MatchesPlayed', 'Attendance'],\n",
" dtype='object')"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Indexing: Bit too much?!\n",
"# .isin returns a boolean.\n",
"wc.columns[~wc.columns.isin([\"Year\", \"Country\"])]"
]
},
{
"cell_type": "code",
"execution_count": 114,
"id": "7a457c2c",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([False, False, True, True, True, True, True, True, True,\n",
" True])"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# What is going on here?\n",
"~wc.columns.isin([\"Year\", \"Country\"])"
]
},
{
"cell_type": "markdown",
"id": "d45f1bfb",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"bol_col is an array of booleans. If you throw them directly as a index, Pandas interprets `wc[bol_col]` as trying to index the rows of the DataFrame, not the columns. This is because when you pass a boolean array directly to the DataFrame indexing operator [], pandas assumes you're trying to index rows based on the boolean array.\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 115,
"id": "3c469c67",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
"
\n",
" \n",
" 1 | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
"
\n",
" \n",
" 2 | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
"
\n",
" \n",
" 3 | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
"
\n",
" \n",
" 4 | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" France | \n",
" Brazil | \n",
" Croatia | \n",
" Netherlands | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 2.785.100 | \n",
"
\n",
" \n",
" 16 | \n",
" Brazil | \n",
" Germany | \n",
" Turkey | \n",
" Korea Republic | \n",
" 161 | \n",
" 32 | \n",
" 64 | \n",
" 2.705.197 | \n",
"
\n",
" \n",
" 17 | \n",
" Italy | \n",
" France | \n",
" Germany | \n",
" Portugal | \n",
" 147 | \n",
" 32 | \n",
" 64 | \n",
" 3.359.439 | \n",
"
\n",
" \n",
" 18 | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
"
\n",
" \n",
" 19 | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 8 columns
\n",
"
"
],
"text/plain": [
" Winner Runners-Up Third Fourth GoalsScored \\\n",
"0 Uruguay Argentina USA Yugoslavia 70 \n",
"1 Italy Czechoslovakia Germany Austria 70 \n",
"2 Italy Hungary Brazil Sweden 84 \n",
"3 Uruguay Brazil Sweden Spain 88 \n",
"4 Germany FR Hungary Austria Uruguay 140 \n",
".. ... ... ... ... ... \n",
"15 France Brazil Croatia Netherlands 171 \n",
"16 Brazil Germany Turkey Korea Republic 161 \n",
"17 Italy France Germany Portugal 147 \n",
"18 Spain Netherlands Germany Uruguay 145 \n",
"19 Germany Argentina Netherlands Brazil 171 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"0 13 18 590.549 \n",
"1 16 17 363.000 \n",
"2 15 18 375.700 \n",
"3 13 22 1.045.246 \n",
"4 16 26 768.607 \n",
".. ... ... ... \n",
"15 32 64 2.785.100 \n",
"16 32 64 2.705.197 \n",
"17 32 64 3.359.439 \n",
"18 32 64 3.178.856 \n",
"19 32 64 3.386.810 \n",
"\n",
"[20 rows x 8 columns]"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# loc methods\n",
"wc.loc[:,~wc.columns.isin([\"Year\", \"Country\"])]"
]
},
{
"cell_type": "code",
"execution_count": 119,
"id": "2fa44e72",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Item wrong length 10 instead of 20.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[119], line 4\u001b[0m\n\u001b[1;32m 2\u001b[0m bol_col \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m~\u001b[39mwc\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39misin([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mYear\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCountry\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[1;32m 3\u001b[0m bol_col\n\u001b[0;32m----> 4\u001b[0m wc[bol_col]\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3798\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3796\u001b[0m \u001b[38;5;66;03m# Do we have a (boolean) 1d indexer?\u001b[39;00m\n\u001b[1;32m 3797\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m com\u001b[38;5;241m.\u001b[39mis_bool_indexer(key):\n\u001b[0;32m-> 3798\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_bool_array(key)\n\u001b[1;32m 3800\u001b[0m \u001b[38;5;66;03m# We are left with two options: a single key, and a collection of keys,\u001b[39;00m\n\u001b[1;32m 3801\u001b[0m \u001b[38;5;66;03m# We interpret tuples as collections only for non-MultiIndex\u001b[39;00m\n\u001b[1;32m 3802\u001b[0m is_single_key \u001b[38;5;241m=\u001b[39m \u001b[38;5;28misinstance\u001b[39m(key, \u001b[38;5;28mtuple\u001b[39m) \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_list_like(key)\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3845\u001b[0m, in \u001b[0;36mDataFrame._getitem_bool_array\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3839\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[1;32m 3840\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mBoolean Series key will be reindexed to match DataFrame index.\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 3841\u001b[0m \u001b[38;5;167;01mUserWarning\u001b[39;00m,\n\u001b[1;32m 3842\u001b[0m stacklevel\u001b[38;5;241m=\u001b[39mfind_stack_level(),\n\u001b[1;32m 3843\u001b[0m )\n\u001b[1;32m 3844\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(key) \u001b[38;5;241m!=\u001b[39m \u001b[38;5;28mlen\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex):\n\u001b[0;32m-> 3845\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 3846\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mItem wrong length \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mlen\u001b[39m(key)\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m instead of \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mlen\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex)\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 3847\u001b[0m )\n\u001b[1;32m 3849\u001b[0m \u001b[38;5;66;03m# check_bool_indexer will throw exception if Series key cannot\u001b[39;00m\n\u001b[1;32m 3850\u001b[0m \u001b[38;5;66;03m# be reindexed to match DataFrame rows\u001b[39;00m\n\u001b[1;32m 3851\u001b[0m key \u001b[38;5;241m=\u001b[39m check_bool_indexer(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex, key)\n",
"\u001b[0;31mValueError\u001b[0m: Item wrong length 10 instead of 20."
]
}
],
"source": [
"# indexing -> error\n",
"bol_col = ~wc.columns.isin([\"Year\", \"Country\"])\n",
"bol_col\n",
"wc[bol_col]"
]
},
{
"cell_type": "code",
"execution_count": 120,
"id": "e448a7e7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
"
\n",
" \n",
" 1 | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
"
\n",
" \n",
" 2 | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
"
\n",
" \n",
" 3 | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
"
\n",
" \n",
" 4 | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" France | \n",
" Brazil | \n",
" Croatia | \n",
" Netherlands | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 2.785.100 | \n",
"
\n",
" \n",
" 16 | \n",
" Brazil | \n",
" Germany | \n",
" Turkey | \n",
" Korea Republic | \n",
" 161 | \n",
" 32 | \n",
" 64 | \n",
" 2.705.197 | \n",
"
\n",
" \n",
" 17 | \n",
" Italy | \n",
" France | \n",
" Germany | \n",
" Portugal | \n",
" 147 | \n",
" 32 | \n",
" 64 | \n",
" 3.359.439 | \n",
"
\n",
" \n",
" 18 | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
"
\n",
" \n",
" 19 | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 8 columns
\n",
"
"
],
"text/plain": [
" Winner Runners-Up Third Fourth GoalsScored \\\n",
"0 Uruguay Argentina USA Yugoslavia 70 \n",
"1 Italy Czechoslovakia Germany Austria 70 \n",
"2 Italy Hungary Brazil Sweden 84 \n",
"3 Uruguay Brazil Sweden Spain 88 \n",
"4 Germany FR Hungary Austria Uruguay 140 \n",
".. ... ... ... ... ... \n",
"15 France Brazil Croatia Netherlands 171 \n",
"16 Brazil Germany Turkey Korea Republic 161 \n",
"17 Italy France Germany Portugal 147 \n",
"18 Spain Netherlands Germany Uruguay 145 \n",
"19 Germany Argentina Netherlands Brazil 171 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"0 13 18 590.549 \n",
"1 16 17 363.000 \n",
"2 15 18 375.700 \n",
"3 13 22 1.045.246 \n",
"4 16 26 768.607 \n",
".. ... ... ... \n",
"15 32 64 2.785.100 \n",
"16 32 64 2.705.197 \n",
"17 32 64 3.359.439 \n",
"18 32 64 3.178.856 \n",
"19 32 64 3.386.810 \n",
"\n",
"[20 rows x 8 columns]"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# With .loc, it works\n",
"wc.loc[:,bol_col]"
]
},
{
"cell_type": "markdown",
"id": "c372fd66",
"metadata": {},
"source": [
"#### `pd.drop()` methods: easier way to go"
]
},
{
"cell_type": "code",
"execution_count": 121,
"id": "e4025f7a",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
"
\n",
" \n",
" 1 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
"
\n",
" \n",
" 2 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
"
\n",
" \n",
" 3 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
"
\n",
" \n",
" 4 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" France | \n",
" France | \n",
" Brazil | \n",
" Croatia | \n",
" Netherlands | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 2.785.100 | \n",
"
\n",
" \n",
" 16 | \n",
" Korea/Japan | \n",
" Brazil | \n",
" Germany | \n",
" Turkey | \n",
" Korea Republic | \n",
" 161 | \n",
" 32 | \n",
" 64 | \n",
" 2.705.197 | \n",
"
\n",
" \n",
" 17 | \n",
" Germany | \n",
" Italy | \n",
" France | \n",
" Germany | \n",
" Portugal | \n",
" 147 | \n",
" 32 | \n",
" 64 | \n",
" 3.359.439 | \n",
"
\n",
" \n",
" 18 | \n",
" South Africa | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
"
\n",
" \n",
" 19 | \n",
" Brazil | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 9 columns
\n",
"
"
],
"text/plain": [
" Country Winner Runners-Up Third Fourth \\\n",
"0 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 Italy Italy Czechoslovakia Germany Austria \n",
"2 France Italy Hungary Brazil Sweden \n",
"3 Brazil Uruguay Brazil Sweden Spain \n",
"4 Switzerland Germany FR Hungary Austria Uruguay \n",
".. ... ... ... ... ... \n",
"15 France France Brazil Croatia Netherlands \n",
"16 Korea/Japan Brazil Germany Turkey Korea Republic \n",
"17 Germany Italy France Germany Portugal \n",
"18 South Africa Spain Netherlands Germany Uruguay \n",
"19 Brazil Germany Argentina Netherlands Brazil \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance \n",
"0 70 13 18 590.549 \n",
"1 70 16 17 363.000 \n",
"2 84 15 18 375.700 \n",
"3 88 13 22 1.045.246 \n",
"4 140 16 26 768.607 \n",
".. ... ... ... ... \n",
"15 171 32 64 2.785.100 \n",
"16 161 32 64 2.705.197 \n",
"17 147 32 64 3.359.439 \n",
"18 145 32 64 3.178.856 \n",
"19 171 32 64 3.386.810 \n",
"\n",
"[20 rows x 9 columns]"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc.drop(columns=[\"Year\"])"
]
},
{
"cell_type": "code",
"execution_count": 122,
"id": "8ef26407",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Help on function drop in module pandas.core.frame:\n",
"\n",
"drop(self, labels: 'IndexLabel' = None, *, axis: 'Axis' = 0, index: 'IndexLabel' = None, columns: 'IndexLabel' = None, level: 'Level' = None, inplace: 'bool' = False, errors: 'IgnoreRaise' = 'raise') -> 'DataFrame | None'\n",
" Drop specified labels from rows or columns.\n",
" \n",
" Remove rows or columns by specifying label names and corresponding\n",
" axis, or by specifying directly index or column names. When using a\n",
" multi-index, labels on different levels can be removed by specifying\n",
" the level. See the `user guide `\n",
" for more information about the now unused levels.\n",
" \n",
" Parameters\n",
" ----------\n",
" labels : single label or list-like\n",
" Index or column labels to drop. A tuple will be used as a single\n",
" label and not treated as a list-like.\n",
" axis : {0 or 'index', 1 or 'columns'}, default 0\n",
" Whether to drop labels from the index (0 or 'index') or\n",
" columns (1 or 'columns').\n",
" index : single label or list-like\n",
" Alternative to specifying axis (``labels, axis=0``\n",
" is equivalent to ``index=labels``).\n",
" columns : single label or list-like\n",
" Alternative to specifying axis (``labels, axis=1``\n",
" is equivalent to ``columns=labels``).\n",
" level : int or level name, optional\n",
" For MultiIndex, level from which the labels will be removed.\n",
" inplace : bool, default False\n",
" If False, return a copy. Otherwise, do operation\n",
" inplace and return None.\n",
" errors : {'ignore', 'raise'}, default 'raise'\n",
" If 'ignore', suppress error and only existing labels are\n",
" dropped.\n",
" \n",
" Returns\n",
" -------\n",
" DataFrame or None\n",
" DataFrame without the removed index or column labels or\n",
" None if ``inplace=True``.\n",
" \n",
" Raises\n",
" ------\n",
" KeyError\n",
" If any of the labels is not found in the selected axis.\n",
" \n",
" See Also\n",
" --------\n",
" DataFrame.loc : Label-location based indexer for selection by label.\n",
" DataFrame.dropna : Return DataFrame with labels on given axis omitted\n",
" where (all or any) data are missing.\n",
" DataFrame.drop_duplicates : Return DataFrame with duplicate rows\n",
" removed, optionally only considering certain columns.\n",
" Series.drop : Return Series with specified index labels removed.\n",
" \n",
" Examples\n",
" --------\n",
" >>> df = pd.DataFrame(np.arange(12).reshape(3, 4),\n",
" ... columns=['A', 'B', 'C', 'D'])\n",
" >>> df\n",
" A B C D\n",
" 0 0 1 2 3\n",
" 1 4 5 6 7\n",
" 2 8 9 10 11\n",
" \n",
" Drop columns\n",
" \n",
" >>> df.drop(['B', 'C'], axis=1)\n",
" A D\n",
" 0 0 3\n",
" 1 4 7\n",
" 2 8 11\n",
" \n",
" >>> df.drop(columns=['B', 'C'])\n",
" A D\n",
" 0 0 3\n",
" 1 4 7\n",
" 2 8 11\n",
" \n",
" Drop a row by index\n",
" \n",
" >>> df.drop([0, 1])\n",
" A B C D\n",
" 2 8 9 10 11\n",
" \n",
" Drop columns and/or rows of MultiIndex DataFrame\n",
" \n",
" >>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],\n",
" ... ['speed', 'weight', 'length']],\n",
" ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],\n",
" ... [0, 1, 2, 0, 1, 2, 0, 1, 2]])\n",
" >>> df = pd.DataFrame(index=midx, columns=['big', 'small'],\n",
" ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20],\n",
" ... [250, 150], [1.5, 0.8], [320, 250],\n",
" ... [1, 0.8], [0.3, 0.2]])\n",
" >>> df\n",
" big small\n",
" lama speed 45.0 30.0\n",
" weight 200.0 100.0\n",
" length 1.5 1.0\n",
" cow speed 30.0 20.0\n",
" weight 250.0 150.0\n",
" length 1.5 0.8\n",
" falcon speed 320.0 250.0\n",
" weight 1.0 0.8\n",
" length 0.3 0.2\n",
" \n",
" Drop a specific index combination from the MultiIndex\n",
" DataFrame, i.e., drop the combination ``'falcon'`` and\n",
" ``'weight'``, which deletes only the corresponding row\n",
" \n",
" >>> df.drop(index=('falcon', 'weight'))\n",
" big small\n",
" lama speed 45.0 30.0\n",
" weight 200.0 100.0\n",
" length 1.5 1.0\n",
" cow speed 30.0 20.0\n",
" weight 250.0 150.0\n",
" length 1.5 0.8\n",
" falcon speed 320.0 250.0\n",
" length 0.3 0.2\n",
" \n",
" >>> df.drop(index='cow', columns='small')\n",
" big\n",
" lama speed 45.0\n",
" weight 200.0\n",
" length 1.5\n",
" falcon speed 320.0\n",
" weight 1.0\n",
" length 0.3\n",
" \n",
" >>> df.drop(index='length', level=1)\n",
" big small\n",
" lama speed 45.0 30.0\n",
" weight 200.0 100.0\n",
" cow speed 30.0 20.0\n",
" weight 250.0 150.0\n",
" falcon speed 320.0 250.0\n",
" weight 1.0 0.8\n",
"\n"
]
}
],
"source": [
"# see here why you need the argument columns\n",
"help(pd.DataFrame.drop)"
]
},
{
"cell_type": "markdown",
"id": "fa1d2f33",
"metadata": {},
"source": [
"### Create new columns\n",
"\n",
"**Functionality:**\n",
"\n",
"- Create a new column/index given inputs and or transformations from other columns. \n",
"\n",
"**Implementation**\n",
"\n",
"Traditional index assignment. Advantages:\n",
" \n",
"+ Looks like a dictionary operation\n",
"+ Overwrites the data frame\n",
"\n",
"\n",
"`.assign()` method. Advantage:\n",
" \n",
"+ It returns a dataframe so you can chain/pipe operations\n",
"+ Can create multiple variables in a single call\n",
"+ Easy to combine with numpy + lambda functions\n",
"+ Improves readibility.\n",
" \n",
"Let's see examples of both methods:\n"
]
},
{
"cell_type": "markdown",
"id": "ec33c1a0",
"metadata": {},
"source": [
"#### Transformation via index assignment"
]
},
{
"cell_type": "code",
"execution_count": 123,
"id": "96d886de",
"metadata": {},
"outputs": [],
"source": [
"# With built in math operations\n",
"wc[\"av_goals_matches\"] = wc[\"GoalsScored\"]/wc[\"MatchesPlayed\"]"
]
},
{
"cell_type": "code",
"execution_count": 124,
"id": "fd1720a0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \n",
"0 70 13 18 590.549 3.888889 \n",
"1 70 16 17 363.000 4.117647 \n",
"2 84 15 18 375.700 4.666667 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 "
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc.head()"
]
},
{
"cell_type": "code",
"execution_count": 126,
"id": "f07a6e5e",
"metadata": {},
"outputs": [],
"source": [
"## with numpy\n",
"wc[\"winner_and_hoster\"] = np.where(wc.Country==wc.Winner, True, False)"
]
},
{
"cell_type": "code",
"execution_count": 127,
"id": "224fc1b8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Winner | \n",
" Country | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" Italy | \n",
" France | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Uruguay | \n",
" Brazil | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Germany FR | \n",
" Switzerland | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Winner Country winner_and_hoster\n",
"0 1930 Uruguay Uruguay True\n",
"1 1934 Italy Italy True\n",
"2 1938 Italy France False\n",
"3 1950 Uruguay Brazil False\n",
"4 1954 Germany FR Switzerland False"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc[[\"Year\", \"Winner\", \"Country\", \"winner_and_hoster\"]].head(5)"
]
},
{
"cell_type": "markdown",
"id": "c10290dc",
"metadata": {},
"source": [
"#### Transformation via `apply` methods by rows\n",
"\n",
"\n",
"As we saw before, `apply()` methods allow you to apply functions rowwise or columnwise in python. Here you are applying a certain data transformation to every row in your dataframe "
]
},
{
"cell_type": "code",
"execution_count": 130,
"id": "6ff64fe8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"0 70 13 18 590.549 3.888889 \n",
"1 70 16 17 363.000 4.117647 \n",
"2 84 15 18 375.700 4.666667 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
"\n",
" winner_and_hoster \n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False "
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# with an apply method + function\n",
"wc[\"av_goals_matches\"] = wc.apply(lambda x: x[\"GoalsScored\"]/x[\"MatchesPlayed\"], axis=1)\n",
"wc.head()"
]
},
{
"cell_type": "markdown",
"id": "bcfbf0e0",
"metadata": {},
"source": [
"#### `.assign()` method.\n",
"\n",
"Allows you to create variables with methods chaining. "
]
},
{
"cell_type": "code",
"execution_count": 133,
"id": "d6e9fda6",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Winner | \n",
" Country | \n",
" final | \n",
" winner_and_hoster_np | \n",
" av_goals_matches | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Uruguay vs Argentina | \n",
" True | \n",
" 3.888889 | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Italy vs Czechoslovakia | \n",
" True | \n",
" 4.117647 | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" Italy | \n",
" France | \n",
" Italy vs Hungary | \n",
" False | \n",
" 4.666667 | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Uruguay | \n",
" Brazil | \n",
" Uruguay vs Brazil | \n",
" False | \n",
" 4.000000 | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Germany FR | \n",
" Switzerland | \n",
" Germany FR vs Hungary | \n",
" False | \n",
" 5.384615 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Winner Country final \\\n",
"0 1930 Uruguay Uruguay Uruguay vs Argentina \n",
"1 1934 Italy Italy Italy vs Czechoslovakia \n",
"2 1938 Italy France Italy vs Hungary \n",
"3 1950 Uruguay Brazil Uruguay vs Brazil \n",
"4 1954 Germany FR Switzerland Germany FR vs Hungary \n",
"\n",
" winner_and_hoster_np av_goals_matches \n",
"0 True 3.888889 \n",
"1 True 4.117647 \n",
"2 False 4.666667 \n",
"3 False 4.000000 \n",
"4 False 5.384615 "
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## when the winner was also hosting the world cup?\n",
"(\n",
"wc.\n",
" # multiple variables\n",
" assign(final = wc.Winner + \" vs \" + wc['Runners-Up'],\n",
" winner_and_hoster_np = np.where(wc.Winner==wc.Country, True, False), \n",
" av_goals_matches = lambda x: x[\"GoalsScored\"]/x[\"MatchesPlayed\"], \n",
" ).\n",
"##allows for methods chaining\n",
" filter([\"Year\", \"Winner\", \"Country\", \"final\",\n",
" \"winner_and_hoster_np\", \"av_goals_matches\"]).\n",
" head(5))"
]
},
{
"cell_type": "markdown",
"id": "ea91fb4d",
"metadata": {},
"source": [
"#### Pay attention here! Using newly created variables!\n",
"\n",
"`.assign` also allows for the use of newly create variables in the same chain. To do that, you need to make use of the lambda function"
]
},
{
"cell_type": "code",
"execution_count": 136,
"id": "62420826",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" best_three | \n",
" final | \n",
" Country | \n",
" Third | \n",
" Runners-Up | \n",
" Winner | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Uruguay vs Argentina and USA | \n",
" Uruguay vs Argentina | \n",
" Uruguay | \n",
" USA | \n",
" Argentina | \n",
" Uruguay | \n",
"
\n",
" \n",
" 1 | \n",
" Italy vs Czechoslovakia and Germany | \n",
" Italy vs Czechoslovakia | \n",
" Italy | \n",
" Germany | \n",
" Czechoslovakia | \n",
" Italy | \n",
"
\n",
" \n",
" 2 | \n",
" Italy vs Hungary and Brazil | \n",
" Italy vs Hungary | \n",
" France | \n",
" Brazil | \n",
" Hungary | \n",
" Italy | \n",
"
\n",
" \n",
" 3 | \n",
" Uruguay vs Brazil and Sweden | \n",
" Uruguay vs Brazil | \n",
" Brazil | \n",
" Sweden | \n",
" Brazil | \n",
" Uruguay | \n",
"
\n",
" \n",
" 4 | \n",
" Germany FR vs Hungary and Austria | \n",
" Germany FR vs Hungary | \n",
" Switzerland | \n",
" Austria | \n",
" Hungary | \n",
" Germany FR | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" best_three final Country \\\n",
"0 Uruguay vs Argentina and USA Uruguay vs Argentina Uruguay \n",
"1 Italy vs Czechoslovakia and Germany Italy vs Czechoslovakia Italy \n",
"2 Italy vs Hungary and Brazil Italy vs Hungary France \n",
"3 Uruguay vs Brazil and Sweden Uruguay vs Brazil Brazil \n",
"4 Germany FR vs Hungary and Austria Germany FR vs Hungary Switzerland \n",
"\n",
" Third Runners-Up Winner \n",
"0 USA Argentina Uruguay \n",
"1 Germany Czechoslovakia Italy \n",
"2 Brazil Hungary Italy \n",
"3 Sweden Brazil Uruguay \n",
"4 Austria Hungary Germany FR "
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Notice calling the recently created variable final\n",
"(wc.\n",
" assign(final = wc.Winner + \" vs \" + wc['Runners-Up'],\n",
" best_three = lambda x: x[\"final\"] + \" and \" + x[\"Third\"]).\n",
" filter([\"best_three\",\"final\", \"Country\", \"Third\", \"Runners-Up\", \"Winner\"]).\n",
" head(5)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "71bed059",
"metadata": {},
"source": [
"\n",
"\n",
"**Alert:** Spend a few seconds trying to understand the use of the lambda function in the code above. \n",
"\n",
"
\n",
"\n",
"The combination of lambda (or a normal function) and `.assign()` is actually a nice property that resambles the properties of `mutate` in R. A lambda function (or a normal function) with .assign passes in the **current state of the dataframe** into the function. Because we create the variable in the state before, the lambda function allows us to access this new variable, and manipulate it in sequence. This also works for cases of grouped data, or chains in which we filter observations and transform the dataframe\n",
"\n",
"To make this point clear, see how doing the same operation without a lambda function will throw an error: \n"
]
},
{
"cell_type": "code",
"execution_count": 137,
"id": "fcb2b49c",
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "'final'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3801\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 3802\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine\u001b[38;5;241m.\u001b[39mget_loc(casted_key)\n\u001b[1;32m 3803\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:138\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:165\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32mpandas/_libs/hashtable_class_helper.pxi:5745\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32mpandas/_libs/hashtable_class_helper.pxi:5753\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'final'",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[137], line 4\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# will throw an error\u001b[39;00m\n\u001b[1;32m 2\u001b[0m (wc\u001b[38;5;241m.\u001b[39m\n\u001b[1;32m 3\u001b[0m assign(final\u001b[38;5;241m=\u001b[39m wc\u001b[38;5;241m.\u001b[39mWinner \u001b[38;5;241m+\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m vs \u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m+\u001b[39m wc[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mRunners-Up\u001b[39m\u001b[38;5;124m'\u001b[39m],\n\u001b[0;32m----> 4\u001b[0m best_three \u001b[38;5;241m=\u001b[39m wc[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfinal\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m+\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m and \u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m+\u001b[39m wc[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mThird\u001b[39m\u001b[38;5;124m\"\u001b[39m])\u001b[38;5;241m.\u001b[39m\n\u001b[1;32m 5\u001b[0m \u001b[38;5;28mfilter\u001b[39m([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbest_three\u001b[39m\u001b[38;5;124m\"\u001b[39m,\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfinal\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCountry\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[1;32m 6\u001b[0m )\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3807\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3805\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mnlevels \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m 3806\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_multilevel(key)\n\u001b[0;32m-> 3807\u001b[0m indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mget_loc(key)\n\u001b[1;32m 3808\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_integer(indexer):\n\u001b[1;32m 3809\u001b[0m indexer \u001b[38;5;241m=\u001b[39m [indexer]\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3804\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3802\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine\u001b[38;5;241m.\u001b[39mget_loc(casted_key)\n\u001b[1;32m 3803\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[0;32m-> 3804\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[1;32m 3805\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[1;32m 3806\u001b[0m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[1;32m 3807\u001b[0m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[1;32m 3808\u001b[0m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[1;32m 3809\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_indexing_error(key)\n",
"\u001b[0;31mKeyError\u001b[0m: 'final'"
]
}
],
"source": [
"# will throw an error\n",
"(wc.\n",
" assign(final= wc.Winner + \" vs \" + wc['Runners-Up'],\n",
" best_three = wc[\"final\"] + \" and \" + wc[\"Third\"]).\n",
" filter([\"best_three\",\"final\", \"Country\"])\n",
")"
]
},
{
"cell_type": "markdown",
"id": "4bb472fc",
"metadata": {},
"source": [
"### Renaming \n",
"\n",
"**Functionality:**\n",
"\n",
"Rename a columns directly or a via a function. \n",
"\n",
"**Implementation:**\n",
"\n",
"- Use dictionaries!"
]
},
{
"cell_type": "code",
"execution_count": 139,
"id": "6687c475",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia 70 \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria 70 \n",
"2 1938 France Italy Hungary Brazil Sweden 84 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"0 13 18 590.549 3.888889 \n",
"1 16 17 363.000 4.117647 \n",
"2 15 18 375.700 4.666667 \n",
"\n",
" winner_and_hoster \n",
"0 True \n",
"1 True \n",
"2 False "
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pandas: renaming variables using the rename method\n",
"wc.rename(columns={\"Year\":\"year\"}).head(3)"
]
},
{
"cell_type": "code",
"execution_count": 140,
"id": "405d7ba0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Year': 'year',\n",
" 'Country': 'country',\n",
" 'Winner': 'winner',\n",
" 'Runners-Up': 'runners-up',\n",
" 'Third': 'third',\n",
" 'Fourth': 'fourth',\n",
" 'GoalsScored': 'goalsscored',\n",
" 'QualifiedTeams': 'qualifiedteams',\n",
" 'MatchesPlayed': 'matchesplayed',\n",
" 'Attendance': 'attendance',\n",
" 'av_goals_matches': 'av_goals_matches',\n",
" 'winner_and_hoster': 'winner_and_hoster'}"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{col: col.lower() for col in wc.columns}"
]
},
{
"cell_type": "code",
"execution_count": 142,
"id": "21287b68",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" country | \n",
" winner | \n",
" runners-up | \n",
" third | \n",
" fourth | \n",
" goalsscored | \n",
" qualifiedteams | \n",
" matchesplayed | \n",
" attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year country winner runners-up third fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"\n",
" goalsscored qualifiedteams matchesplayed attendance av_goals_matches \\\n",
"0 70 13 18 590.549 3.888889 \n",
"1 70 16 17 363.000 4.117647 \n",
"2 84 15 18 375.700 4.666667 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
"\n",
" winner_and_hoster \n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False "
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can use dictionary comprehension to apply functions in all\n",
"wc.rename(columns={col: col.lower() for col in wc.columns}).head(5)"
]
},
{
"cell_type": "code",
"execution_count": 143,
"id": "efb454be",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year country Winner Runners-Up Third Fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"0 70 13 18 590.549 3.888889 \n",
"1 70 16 17 363.000 4.117647 \n",
"2 84 15 18 375.700 4.666667 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
"\n",
" winner_and_hoster \n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False "
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Or for only a set of the columns by given them as inputs\n",
"wc.rename(columns={col: col.lower() for col in [\"Year\", \"Country\"]}).head(5)"
]
},
{
"cell_type": "markdown",
"id": "9e04ff7c",
"metadata": {},
"source": [
"### Practice\n",
"\n",
"In this practice questions, we'll hone our data manipulation skills by examining conflict event data generated by the [Armed Conflict Location & Event Data Project (ACLED)](https://acleddata.com/#/dashboard). The aim is to practice some of the data manipulation functions covered in the lecture. \n",
"\n",
"#### Data\n",
"\n",
"ACLED is a \"disaggregated data collection, analysis, and crisis mapping project. ACLED collects the dates, actors, locations, fatalities, and modalities of all reported political violence and protest events across Africa, South Asia, Southeast Asia, the Middle East, Central Asia and the Caucasus, Latin America and the Caribbean, and Southeastern and Eastern Europe and the Balkans.\" For this exercise, we'll focus just on the data pertaining to _Africa_. For more information regarding these data, please consult the [ACLED methodology](https://acleddata.com/resources/methodology/).\n",
"\n",
"Download the ACLED data.\n",
"\n",
"- Open the Data\n",
"- Create a new dataset with all columns with the words \"event\" or \"actor\"\n",
"- Rename all columns replacing the the word event to \"event_acled_south_america\"\n",
"- Create three new columns: \n",
"\n",
" - a binary representation for fatalities, \n",
" - a single column combining the columns: \"region\",\t\"country\",\n",
" - a new column recoding the inter1 variable with three string labels. These labels should indicate when the iter1 variable is equal to its median, below the median and above the median values."
]
},
{
"cell_type": "code",
"execution_count": 144,
"id": "7a93d943",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['event_id_cnty', 'event_date', 'year', 'time_precision',\n",
" 'disorder_type', 'event_type', 'sub_event_type', 'actor1',\n",
" 'assoc_actor_1', 'inter1', 'actor2', 'assoc_actor_2', 'inter2',\n",
" 'interaction', 'civilian_targeting', 'iso', 'region', 'country',\n",
" 'admin1', 'admin2', 'admin3', 'location', 'latitude', 'longitude',\n",
" 'geo_precision', 'source', 'source_scale', 'notes', 'fatalities',\n",
" 'tags', 'timestamp'],\n",
" dtype='object')"
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# read\n",
"acled_sa = pd.read_csv(\"acled_south_america.csv\")\n",
"acled_sa.columns"
]
},
{
"cell_type": "code",
"execution_count": 146,
"id": "95b12ab0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" event_id_cnty | \n",
" event_date | \n",
" event_type | \n",
" sub_event_type | \n",
" actor1 | \n",
" assoc_actor_1 | \n",
" actor2 | \n",
" assoc_actor_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GUF216 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (French Guiana) | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" ARG14050 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" CGT: General Confederation of Labour; CTA: Arg... | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" ARG14057 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" Labor Group (Argentina) | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" ARG14104 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" BOL4911 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Bolivia) | \n",
" Labor Group (Bolivia) | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 167582 | \n",
" VEN8468 | \n",
" 01 January 2018 | \n",
" Protests | \n",
" Excessive force against protesters | \n",
" Protesters (Venezuela) | \n",
" Women (Venezuela) | \n",
" Military Forces of Venezuela (1999-) GNB: Vene... | \n",
" NaN | \n",
"
\n",
" \n",
" 167583 | \n",
" BRA31568 | \n",
" 01 January 2018 | \n",
" Explosions/Remote violence | \n",
" Remote explosive/landmine/IED | \n",
" Unidentified Gang (Brazil) | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 167584 | \n",
" VEN15145 | \n",
" 01 January 2018 | \n",
" Violence against civilians | \n",
" Attack | \n",
" Unidentified Armed Group (Venezuela) | \n",
" NaN | \n",
" Civilians (Venezuela) | \n",
" NaN | \n",
"
\n",
" \n",
" 167585 | \n",
" COL12093 | \n",
" 01 January 2018 | \n",
" Violence against civilians | \n",
" Attack | \n",
" Unidentified Armed Group (Colombia) | \n",
" NaN | \n",
" Civilians (Colombia) | \n",
" NaN | \n",
"
\n",
" \n",
" 167586 | \n",
" COL3347 | \n",
" 01 January 2018 | \n",
" Explosions/Remote violence | \n",
" Remote explosive/landmine/IED | \n",
" Unidentified Armed Group (Colombia) | \n",
" NaN | \n",
" Civilians (Colombia) | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
167587 rows × 8 columns
\n",
"
"
],
"text/plain": [
" event_id_cnty event_date event_type \\\n",
"0 GUF216 29 September 2023 Protests \n",
"1 ARG14050 29 September 2023 Protests \n",
"2 ARG14057 29 September 2023 Protests \n",
"3 ARG14104 29 September 2023 Protests \n",
"4 BOL4911 29 September 2023 Protests \n",
"... ... ... ... \n",
"167582 VEN8468 01 January 2018 Protests \n",
"167583 BRA31568 01 January 2018 Explosions/Remote violence \n",
"167584 VEN15145 01 January 2018 Violence against civilians \n",
"167585 COL12093 01 January 2018 Violence against civilians \n",
"167586 COL3347 01 January 2018 Explosions/Remote violence \n",
"\n",
" sub_event_type \\\n",
"0 Peaceful protest \n",
"1 Peaceful protest \n",
"2 Peaceful protest \n",
"3 Peaceful protest \n",
"4 Peaceful protest \n",
"... ... \n",
"167582 Excessive force against protesters \n",
"167583 Remote explosive/landmine/IED \n",
"167584 Attack \n",
"167585 Attack \n",
"167586 Remote explosive/landmine/IED \n",
"\n",
" actor1 \\\n",
"0 Protesters (French Guiana) \n",
"1 Protesters (Argentina) \n",
"2 Protesters (Argentina) \n",
"3 Protesters (Argentina) \n",
"4 Protesters (Bolivia) \n",
"... ... \n",
"167582 Protesters (Venezuela) \n",
"167583 Unidentified Gang (Brazil) \n",
"167584 Unidentified Armed Group (Venezuela) \n",
"167585 Unidentified Armed Group (Colombia) \n",
"167586 Unidentified Armed Group (Colombia) \n",
"\n",
" assoc_actor_1 \\\n",
"0 NaN \n",
"1 CGT: General Confederation of Labour; CTA: Arg... \n",
"2 Labor Group (Argentina) \n",
"3 NaN \n",
"4 Labor Group (Bolivia) \n",
"... ... \n",
"167582 Women (Venezuela) \n",
"167583 NaN \n",
"167584 NaN \n",
"167585 NaN \n",
"167586 NaN \n",
"\n",
" actor2 assoc_actor_2 \n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"... ... ... \n",
"167582 Military Forces of Venezuela (1999-) GNB: Vene... NaN \n",
"167583 NaN NaN \n",
"167584 Civilians (Venezuela) NaN \n",
"167585 Civilians (Colombia) NaN \n",
"167586 Civilians (Colombia) NaN \n",
"\n",
"[167587 rows x 8 columns]"
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 2\n",
"colnames = acled_sa.filter(like=\"event\").columns.to_list() + acled_sa.filter(like=\"actor\").columns.to_list()\n",
"acled_sa[colnames]"
]
},
{
"cell_type": "code",
"execution_count": 147,
"id": "916dd800",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" event_id_cnty | \n",
" event_date | \n",
" event_type | \n",
" sub_event_type | \n",
" actor1 | \n",
" assoc_actor_1 | \n",
" actor2 | \n",
" assoc_actor_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GUF216 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (French Guiana) | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" ARG14050 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" CGT: General Confederation of Labour; CTA: Arg... | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" ARG14057 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" Labor Group (Argentina) | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" ARG14104 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" BOL4911 | \n",
" 29 September 2023 | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Bolivia) | \n",
" Labor Group (Bolivia) | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" event_id_cnty event_date event_type sub_event_type \\\n",
"0 GUF216 29 September 2023 Protests Peaceful protest \n",
"1 ARG14050 29 September 2023 Protests Peaceful protest \n",
"2 ARG14057 29 September 2023 Protests Peaceful protest \n",
"3 ARG14104 29 September 2023 Protests Peaceful protest \n",
"4 BOL4911 29 September 2023 Protests Peaceful protest \n",
"\n",
" actor1 \\\n",
"0 Protesters (French Guiana) \n",
"1 Protesters (Argentina) \n",
"2 Protesters (Argentina) \n",
"3 Protesters (Argentina) \n",
"4 Protesters (Bolivia) \n",
"\n",
" assoc_actor_1 actor2 assoc_actor_2 \n",
"0 NaN NaN NaN \n",
"1 CGT: General Confederation of Labour; CTA: Arg... NaN NaN \n",
"2 Labor Group (Argentina) NaN NaN \n",
"3 NaN NaN NaN \n",
"4 Labor Group (Bolivia) NaN NaN "
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# another solution\n",
"acled_sa.filter(regex=\"event|actor\").head(5)"
]
},
{
"cell_type": "code",
"execution_count": 148,
"id": "c9320a12",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" event_acled_south_america_id_cnty | \n",
" event_acled_south_america_date | \n",
" year | \n",
" time_precision | \n",
" disorder_type | \n",
" event_acled_south_america_type | \n",
" sub_event_acled_south_america_type | \n",
" actor1 | \n",
" assoc_actor_1 | \n",
" inter1 | \n",
" ... | \n",
" location | \n",
" latitude | \n",
" longitude | \n",
" geo_precision | \n",
" source | \n",
" source_scale | \n",
" notes | \n",
" fatalities | \n",
" tags | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GUF216 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (French Guiana) | \n",
" NaN | \n",
" 6 | \n",
" ... | \n",
" Saint-Laurent-du-Maroni | \n",
" 5.5010 | \n",
" -54.0294 | \n",
" 1 | \n",
" France Info | \n",
" International | \n",
" On 29 September 2023, in Saint-Laurent-du-Maro... | \n",
" 0 | \n",
" crowd size=around 100 | \n",
" 1696264529 | \n",
"
\n",
" \n",
" 1 | \n",
" ARG14050 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" CGT: General Confederation of Labour; CTA: Arg... | \n",
" 6 | \n",
" ... | \n",
" Buenos Aires - Comuna 1 | \n",
" -34.6036 | \n",
" -58.3817 | \n",
" 1 | \n",
" Diario Jornada; La Prensa (Argentina) | \n",
" National | \n",
" On 29 September 2023, in Buenos Aires - Comuna... | \n",
" 0 | \n",
" crowd size=no report | \n",
" 1696273132 | \n",
"
\n",
" \n",
" 2 | \n",
" ARG14057 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" Labor Group (Argentina) | \n",
" 6 | \n",
" ... | \n",
" Cordoba | \n",
" -31.4355 | \n",
" -64.2009 | \n",
" 1 | \n",
" La Voz del Interior (Argentina) | \n",
" Subnational | \n",
" On 29 September 2023, in Cordoba (Capital, Cor... | \n",
" 0 | \n",
" crowd size=no report | \n",
" 1696273132 | \n",
"
\n",
" \n",
" 3 | \n",
" ARG14104 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" NaN | \n",
" 6 | \n",
" ... | \n",
" La Plata - Villa Elvira | \n",
" -34.9294 | \n",
" -57.9192 | \n",
" 1 | \n",
" Diario El Dia | \n",
" Subnational | \n",
" On 29 September 2023, in La Plata - Villa Elvi... | \n",
" 0 | \n",
" crowd size=no report | \n",
" 1696273132 | \n",
"
\n",
" \n",
" 4 | \n",
" BOL4911 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Bolivia) | \n",
" Labor Group (Bolivia) | \n",
" 6 | \n",
" ... | \n",
" Sucre | \n",
" -19.0333 | \n",
" -65.2627 | \n",
" 1 | \n",
" Correo del Sur | \n",
" Subnational | \n",
" On 29 September 2023, in Sucre (Chuquisaca), u... | \n",
" 0 | \n",
" crowd size=no report | \n",
" 1696273132 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 31 columns
\n",
"
"
],
"text/plain": [
" event_acled_south_america_id_cnty event_acled_south_america_date year \\\n",
"0 GUF216 29 September 2023 2023 \n",
"1 ARG14050 29 September 2023 2023 \n",
"2 ARG14057 29 September 2023 2023 \n",
"3 ARG14104 29 September 2023 2023 \n",
"4 BOL4911 29 September 2023 2023 \n",
"\n",
" time_precision disorder_type event_acled_south_america_type \\\n",
"0 1 Demonstrations Protests \n",
"1 1 Demonstrations Protests \n",
"2 1 Demonstrations Protests \n",
"3 1 Demonstrations Protests \n",
"4 1 Demonstrations Protests \n",
"\n",
" sub_event_acled_south_america_type actor1 \\\n",
"0 Peaceful protest Protesters (French Guiana) \n",
"1 Peaceful protest Protesters (Argentina) \n",
"2 Peaceful protest Protesters (Argentina) \n",
"3 Peaceful protest Protesters (Argentina) \n",
"4 Peaceful protest Protesters (Bolivia) \n",
"\n",
" assoc_actor_1 inter1 ... \\\n",
"0 NaN 6 ... \n",
"1 CGT: General Confederation of Labour; CTA: Arg... 6 ... \n",
"2 Labor Group (Argentina) 6 ... \n",
"3 NaN 6 ... \n",
"4 Labor Group (Bolivia) 6 ... \n",
"\n",
" location latitude longitude geo_precision \\\n",
"0 Saint-Laurent-du-Maroni 5.5010 -54.0294 1 \n",
"1 Buenos Aires - Comuna 1 -34.6036 -58.3817 1 \n",
"2 Cordoba -31.4355 -64.2009 1 \n",
"3 La Plata - Villa Elvira -34.9294 -57.9192 1 \n",
"4 Sucre -19.0333 -65.2627 1 \n",
"\n",
" source source_scale \\\n",
"0 France Info International \n",
"1 Diario Jornada; La Prensa (Argentina) National \n",
"2 La Voz del Interior (Argentina) Subnational \n",
"3 Diario El Dia Subnational \n",
"4 Correo del Sur Subnational \n",
"\n",
" notes fatalities \\\n",
"0 On 29 September 2023, in Saint-Laurent-du-Maro... 0 \n",
"1 On 29 September 2023, in Buenos Aires - Comuna... 0 \n",
"2 On 29 September 2023, in Cordoba (Capital, Cor... 0 \n",
"3 On 29 September 2023, in La Plata - Villa Elvi... 0 \n",
"4 On 29 September 2023, in Sucre (Chuquisaca), u... 0 \n",
"\n",
" tags timestamp \n",
"0 crowd size=around 100 1696264529 \n",
"1 crowd size=no report 1696273132 \n",
"2 crowd size=no report 1696273132 \n",
"3 crowd size=no report 1696273132 \n",
"4 crowd size=no report 1696273132 \n",
"\n",
"[5 rows x 31 columns]"
]
},
"execution_count": 148,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# rename\n",
"acled_sa = acled_sa.rename(columns={col: col.replace(\"event\", \"event_acled_south_america\") for col in acled_sa.columns}).head(5)\n",
"acled_sa"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "4d9f0535",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" event_acled_south_america_id_cnty | \n",
" event_acled_south_america_date | \n",
" year | \n",
" time_precision | \n",
" disorder_type | \n",
" event_acled_south_america_type | \n",
" sub_event_acled_south_america_type | \n",
" actor1 | \n",
" assoc_actor_1 | \n",
" inter1 | \n",
" ... | \n",
" geo_precision | \n",
" source | \n",
" source_scale | \n",
" notes | \n",
" fatalities | \n",
" tags | \n",
" timestamp | \n",
" fatality_bin | \n",
" full_location | \n",
" inter_recoded | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GUF216 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (French Guiana) | \n",
" NaN | \n",
" 6 | \n",
" ... | \n",
" 1 | \n",
" France Info | \n",
" International | \n",
" On 29 September 2023, in Saint-Laurent-du-Maro... | \n",
" 0 | \n",
" crowd size=around 100 | \n",
" 1696264529 | \n",
" 0 | \n",
" South America-French Guiana | \n",
" Median Value | \n",
"
\n",
" \n",
" 1 | \n",
" ARG14050 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" CGT: General Confederation of Labour; CTA: Arg... | \n",
" 6 | \n",
" ... | \n",
" 1 | \n",
" Diario Jornada; La Prensa (Argentina) | \n",
" National | \n",
" On 29 September 2023, in Buenos Aires - Comuna... | \n",
" 0 | \n",
" crowd size=no report | \n",
" 1696273132 | \n",
" 0 | \n",
" South America-Argentina | \n",
" Median Value | \n",
"
\n",
" \n",
" 2 | \n",
" ARG14057 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" Labor Group (Argentina) | \n",
" 6 | \n",
" ... | \n",
" 1 | \n",
" La Voz del Interior (Argentina) | \n",
" Subnational | \n",
" On 29 September 2023, in Cordoba (Capital, Cor... | \n",
" 0 | \n",
" crowd size=no report | \n",
" 1696273132 | \n",
" 0 | \n",
" South America-Argentina | \n",
" Median Value | \n",
"
\n",
" \n",
" 3 | \n",
" ARG14104 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Argentina) | \n",
" NaN | \n",
" 6 | \n",
" ... | \n",
" 1 | \n",
" Diario El Dia | \n",
" Subnational | \n",
" On 29 September 2023, in La Plata - Villa Elvi... | \n",
" 0 | \n",
" crowd size=no report | \n",
" 1696273132 | \n",
" 0 | \n",
" South America-Argentina | \n",
" Median Value | \n",
"
\n",
" \n",
" 4 | \n",
" BOL4911 | \n",
" 29 September 2023 | \n",
" 2023 | \n",
" 1 | \n",
" Demonstrations | \n",
" Protests | \n",
" Peaceful protest | \n",
" Protesters (Bolivia) | \n",
" Labor Group (Bolivia) | \n",
" 6 | \n",
" ... | \n",
" 1 | \n",
" Correo del Sur | \n",
" Subnational | \n",
" On 29 September 2023, in Sucre (Chuquisaca), u... | \n",
" 0 | \n",
" crowd size=no report | \n",
" 1696273132 | \n",
" 0 | \n",
" South America-Bolivia | \n",
" Median Value | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 34 columns
\n",
"
"
],
"text/plain": [
" event_acled_south_america_id_cnty event_acled_south_america_date year \\\n",
"0 GUF216 29 September 2023 2023 \n",
"1 ARG14050 29 September 2023 2023 \n",
"2 ARG14057 29 September 2023 2023 \n",
"3 ARG14104 29 September 2023 2023 \n",
"4 BOL4911 29 September 2023 2023 \n",
"\n",
" time_precision disorder_type event_acled_south_america_type \\\n",
"0 1 Demonstrations Protests \n",
"1 1 Demonstrations Protests \n",
"2 1 Demonstrations Protests \n",
"3 1 Demonstrations Protests \n",
"4 1 Demonstrations Protests \n",
"\n",
" sub_event_acled_south_america_type actor1 \\\n",
"0 Peaceful protest Protesters (French Guiana) \n",
"1 Peaceful protest Protesters (Argentina) \n",
"2 Peaceful protest Protesters (Argentina) \n",
"3 Peaceful protest Protesters (Argentina) \n",
"4 Peaceful protest Protesters (Bolivia) \n",
"\n",
" assoc_actor_1 inter1 ... \\\n",
"0 NaN 6 ... \n",
"1 CGT: General Confederation of Labour; CTA: Arg... 6 ... \n",
"2 Labor Group (Argentina) 6 ... \n",
"3 NaN 6 ... \n",
"4 Labor Group (Bolivia) 6 ... \n",
"\n",
" geo_precision source source_scale \\\n",
"0 1 France Info International \n",
"1 1 Diario Jornada; La Prensa (Argentina) National \n",
"2 1 La Voz del Interior (Argentina) Subnational \n",
"3 1 Diario El Dia Subnational \n",
"4 1 Correo del Sur Subnational \n",
"\n",
" notes fatalities \\\n",
"0 On 29 September 2023, in Saint-Laurent-du-Maro... 0 \n",
"1 On 29 September 2023, in Buenos Aires - Comuna... 0 \n",
"2 On 29 September 2023, in Cordoba (Capital, Cor... 0 \n",
"3 On 29 September 2023, in La Plata - Villa Elvi... 0 \n",
"4 On 29 September 2023, in Sucre (Chuquisaca), u... 0 \n",
"\n",
" tags timestamp fatality_bin \\\n",
"0 crowd size=around 100 1696264529 0 \n",
"1 crowd size=no report 1696273132 0 \n",
"2 crowd size=no report 1696273132 0 \n",
"3 crowd size=no report 1696273132 0 \n",
"4 crowd size=no report 1696273132 0 \n",
"\n",
" full_location inter_recoded \n",
"0 South America-French Guiana Median Value \n",
"1 South America-Argentina Median Value \n",
"2 South America-Argentina Median Value \n",
"3 South America-Argentina Median Value \n",
"4 South America-Bolivia Median Value \n",
"\n",
"[5 rows x 34 columns]"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 4\n",
"import numpy as np\n",
"(acled_sa.\n",
" assign(fatality_bin = np.where(acled_sa[\"fatalities\"]>0, 1, 0), \n",
" full_location = acled_sa[\"region\"]+ \"-\" + acled_sa[\"country\"], \n",
" inter_recoded = np.where(acled_sa[\"inter1\"]==acled_sa[\"inter1\"].median(), \"Median Value\", \n",
" np.where(acled_sa[\"inter1\"]>acled_sa[\"inter1\"].median(), \n",
" \"Higher Median\", \"Below Median\"))\n",
" ).head(5)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 149,
"id": "c9fdf76d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" inter1 | \n",
" inter_recoded | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 6 | \n",
" at median | \n",
"
\n",
" \n",
" 1 | \n",
" 6 | \n",
" at median | \n",
"
\n",
" \n",
" 2 | \n",
" 6 | \n",
" at median | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" at median | \n",
"
\n",
" \n",
" 4 | \n",
" 6 | \n",
" at median | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" inter1 inter_recoded\n",
"0 6 at median\n",
"1 6 at median\n",
"2 6 at median\n",
"3 6 at median\n",
"4 6 at median"
]
},
"execution_count": 149,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# solutions using np.select\n",
"median_value = acled_sa[\"inter1\"].median()\n",
"\n",
"# Define the conditions\n",
"conditions = [\n",
" acled_sa['inter1'] < median_value,\n",
" acled_sa['inter1'] == median_value,\n",
" acled_sa['inter1'] > median_value\n",
"]\n",
"\n",
"# Define the output values\n",
"choices = ['below median', 'at median', 'above median']\n",
"\n",
"# Use np.select \n",
"acled_sa['inter_recoded'] = np.select(conditions, choices, default='N/A')\n",
"\n",
"# see\n",
"acled_sa.filter([\"inter1\", \"inter_recoded\"])"
]
},
{
"cell_type": "markdown",
"id": "a3e5a1e1",
"metadata": {},
"source": [
"## Row-Wise Operations"
]
},
{
"cell_type": "markdown",
"id": "87c99eac",
"metadata": {},
"source": [
"For data wrangling tasks at the rows of your data frame, we will discuss: \n",
"\n",
"- Subsetting\n",
"- Filtering distinct values\n",
"- Recoding values \n",
"- Grouping and Summarizing\n",
"- Grouping and Transforming\n",
"- Sorting values"
]
},
{
"cell_type": "markdown",
"id": "caebebe7",
"metadata": {},
"source": [
"### Subsetting\n",
"\n",
"**Functionality:**\n",
"\n",
"Slice the dataframe row-wise following a certain input. \n",
"\n",
"**Implementation:**\n",
"\n",
"- index based implementation\n",
"- `.loc` or `.iloc` methods\n",
"- `.query`\n"
]
},
{
"cell_type": "markdown",
"id": "9d3d26bf",
"metadata": {},
"source": [
"#### Subsetting by index"
]
},
{
"cell_type": "code",
"execution_count": 150,
"id": "1af17d79",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 8 | \n",
" 1970 | \n",
" Mexico | \n",
" Brazil | \n",
" Italy | \n",
" Germany FR | \n",
" Uruguay | \n",
" 95 | \n",
" 16 | \n",
" 32 | \n",
" 1.603.975 | \n",
" 2.968750 | \n",
" False | \n",
"
\n",
" \n",
" 9 | \n",
" 1974 | \n",
" Germany | \n",
" Germany FR | \n",
" Netherlands | \n",
" Poland | \n",
" Brazil | \n",
" 97 | \n",
" 16 | \n",
" 38 | \n",
" 1.865.753 | \n",
" 2.552632 | \n",
" False | \n",
"
\n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
" 2.684211 | \n",
" True | \n",
"
\n",
" \n",
" 11 | \n",
" 1982 | \n",
" Spain | \n",
" Italy | \n",
" Germany FR | \n",
" Poland | \n",
" France | \n",
" 146 | \n",
" 24 | \n",
" 52 | \n",
" 2.109.723 | \n",
" 2.807692 | \n",
" False | \n",
"
\n",
" \n",
" 12 | \n",
" 1986 | \n",
" Mexico | \n",
" Argentina | \n",
" Germany FR | \n",
" France | \n",
" Belgium | \n",
" 132 | \n",
" 24 | \n",
" 52 | \n",
" 2.394.031 | \n",
" 2.538462 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
13 rows × 12 columns
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
".. ... ... ... ... ... ... \n",
"8 1970 Mexico Brazil Italy Germany FR Uruguay \n",
"9 1974 Germany Germany FR Netherlands Poland Brazil \n",
"10 1978 Argentina Argentina Netherlands Brazil Italy \n",
"11 1982 Spain Italy Germany FR Poland France \n",
"12 1986 Mexico Argentina Germany FR France Belgium \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"0 70 13 18 590.549 3.888889 \n",
"1 70 16 17 363.000 4.117647 \n",
"2 84 15 18 375.700 4.666667 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
".. ... ... ... ... ... \n",
"8 95 16 32 1.603.975 2.968750 \n",
"9 97 16 38 1.865.753 2.552632 \n",
"10 102 16 38 1.545.791 2.684211 \n",
"11 146 24 52 2.109.723 2.807692 \n",
"12 132 24 52 2.394.031 2.538462 \n",
"\n",
" winner_and_hoster \n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False \n",
".. ... \n",
"8 False \n",
"9 False \n",
"10 True \n",
"11 False \n",
"12 False \n",
"\n",
"[13 rows x 12 columns]"
]
},
"execution_count": 150,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# index based implementation\n",
"wc[wc.Year<1990]"
]
},
{
"cell_type": "code",
"execution_count": 152,
"id": "2b28b536",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
"
\n",
" \n",
" 5 | \n",
" 1958 | \n",
" Sweden | \n",
" Brazil | \n",
" Sweden | \n",
" France | \n",
" Germany FR | \n",
" 126 | \n",
" 16 | \n",
" 35 | \n",
" 819.810 | \n",
" 3.600000 | \n",
" False | \n",
"
\n",
" \n",
" 6 | \n",
" 1962 | \n",
" Chile | \n",
" Brazil | \n",
" Czechoslovakia | \n",
" Chile | \n",
" Yugoslavia | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 893.172 | \n",
" 2.781250 | \n",
" False | \n",
"
\n",
" \n",
" 7 | \n",
" 1966 | \n",
" England | \n",
" England | \n",
" Germany FR | \n",
" Portugal | \n",
" Soviet Union | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 1.563.135 | \n",
" 2.781250 | \n",
" True | \n",
"
\n",
" \n",
" 8 | \n",
" 1970 | \n",
" Mexico | \n",
" Brazil | \n",
" Italy | \n",
" Germany FR | \n",
" Uruguay | \n",
" 95 | \n",
" 16 | \n",
" 32 | \n",
" 1.603.975 | \n",
" 2.968750 | \n",
" False | \n",
"
\n",
" \n",
" 9 | \n",
" 1974 | \n",
" Germany | \n",
" Germany FR | \n",
" Netherlands | \n",
" Poland | \n",
" Brazil | \n",
" 97 | \n",
" 16 | \n",
" 38 | \n",
" 1.865.753 | \n",
" 2.552632 | \n",
" False | \n",
"
\n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
" 2.684211 | \n",
" True | \n",
"
\n",
" \n",
" 11 | \n",
" 1982 | \n",
" Spain | \n",
" Italy | \n",
" Germany FR | \n",
" Poland | \n",
" France | \n",
" 146 | \n",
" 24 | \n",
" 52 | \n",
" 2.109.723 | \n",
" 2.807692 | \n",
" False | \n",
"
\n",
" \n",
" 12 | \n",
" 1986 | \n",
" Mexico | \n",
" Argentina | \n",
" Germany FR | \n",
" France | \n",
" Belgium | \n",
" 132 | \n",
" 24 | \n",
" 52 | \n",
" 2.394.031 | \n",
" 2.538462 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"5 1958 Sweden Brazil Sweden France Germany FR \n",
"6 1962 Chile Brazil Czechoslovakia Chile Yugoslavia \n",
"7 1966 England England Germany FR Portugal Soviet Union \n",
"8 1970 Mexico Brazil Italy Germany FR Uruguay \n",
"9 1974 Germany Germany FR Netherlands Poland Brazil \n",
"10 1978 Argentina Argentina Netherlands Brazil Italy \n",
"11 1982 Spain Italy Germany FR Poland France \n",
"12 1986 Mexico Argentina Germany FR France Belgium \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
"5 126 16 35 819.810 3.600000 \n",
"6 89 16 32 893.172 2.781250 \n",
"7 89 16 32 1.563.135 2.781250 \n",
"8 95 16 32 1.603.975 2.968750 \n",
"9 97 16 38 1.865.753 2.552632 \n",
"10 102 16 38 1.545.791 2.684211 \n",
"11 146 24 52 2.109.723 2.807692 \n",
"12 132 24 52 2.394.031 2.538462 \n",
"\n",
" winner_and_hoster \n",
"3 False \n",
"4 False \n",
"5 False \n",
"6 False \n",
"7 True \n",
"8 False \n",
"9 False \n",
"10 True \n",
"11 False \n",
"12 False "
]
},
"execution_count": 152,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# or with multiple condition\n",
"wc[(wc.Year<1990)&(wc.Year>1940)]"
]
},
{
"cell_type": "code",
"execution_count": 153,
"id": "6b126969",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 True\n",
"4 True\n",
" ... \n",
"15 False\n",
"16 False\n",
"17 False\n",
"18 False\n",
"19 False\n",
"Name: Year, Length: 20, dtype: bool"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# see this\n",
"(wc.Year<1990)&(wc.Year>1940)"
]
},
{
"cell_type": "markdown",
"id": "9d0719b1",
"metadata": {},
"source": [
"#### Subsetting with `.loc()` method"
]
},
{
"cell_type": "code",
"execution_count": 155,
"id": "ffc870ef",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
"
\n",
" \n",
" 5 | \n",
" 1958 | \n",
" Sweden | \n",
" Brazil | \n",
" Sweden | \n",
" France | \n",
" Germany FR | \n",
" 126 | \n",
" 16 | \n",
" 35 | \n",
" 819.810 | \n",
" 3.600000 | \n",
" False | \n",
"
\n",
" \n",
" 6 | \n",
" 1962 | \n",
" Chile | \n",
" Brazil | \n",
" Czechoslovakia | \n",
" Chile | \n",
" Yugoslavia | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 893.172 | \n",
" 2.781250 | \n",
" False | \n",
"
\n",
" \n",
" 7 | \n",
" 1966 | \n",
" England | \n",
" England | \n",
" Germany FR | \n",
" Portugal | \n",
" Soviet Union | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 1.563.135 | \n",
" 2.781250 | \n",
" True | \n",
"
\n",
" \n",
" 8 | \n",
" 1970 | \n",
" Mexico | \n",
" Brazil | \n",
" Italy | \n",
" Germany FR | \n",
" Uruguay | \n",
" 95 | \n",
" 16 | \n",
" 32 | \n",
" 1.603.975 | \n",
" 2.968750 | \n",
" False | \n",
"
\n",
" \n",
" 9 | \n",
" 1974 | \n",
" Germany | \n",
" Germany FR | \n",
" Netherlands | \n",
" Poland | \n",
" Brazil | \n",
" 97 | \n",
" 16 | \n",
" 38 | \n",
" 1.865.753 | \n",
" 2.552632 | \n",
" False | \n",
"
\n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
" 2.684211 | \n",
" True | \n",
"
\n",
" \n",
" 11 | \n",
" 1982 | \n",
" Spain | \n",
" Italy | \n",
" Germany FR | \n",
" Poland | \n",
" France | \n",
" 146 | \n",
" 24 | \n",
" 52 | \n",
" 2.109.723 | \n",
" 2.807692 | \n",
" False | \n",
"
\n",
" \n",
" 12 | \n",
" 1986 | \n",
" Mexico | \n",
" Argentina | \n",
" Germany FR | \n",
" France | \n",
" Belgium | \n",
" 132 | \n",
" 24 | \n",
" 52 | \n",
" 2.394.031 | \n",
" 2.538462 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"5 1958 Sweden Brazil Sweden France Germany FR \n",
"6 1962 Chile Brazil Czechoslovakia Chile Yugoslavia \n",
"7 1966 England England Germany FR Portugal Soviet Union \n",
"8 1970 Mexico Brazil Italy Germany FR Uruguay \n",
"9 1974 Germany Germany FR Netherlands Poland Brazil \n",
"10 1978 Argentina Argentina Netherlands Brazil Italy \n",
"11 1982 Spain Italy Germany FR Poland France \n",
"12 1986 Mexico Argentina Germany FR France Belgium \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
"5 126 16 35 819.810 3.600000 \n",
"6 89 16 32 893.172 2.781250 \n",
"7 89 16 32 1.563.135 2.781250 \n",
"8 95 16 32 1.603.975 2.968750 \n",
"9 97 16 38 1.865.753 2.552632 \n",
"10 102 16 38 1.545.791 2.684211 \n",
"11 146 24 52 2.109.723 2.807692 \n",
"12 132 24 52 2.394.031 2.538462 \n",
"\n",
" winner_and_hoster \n",
"3 False \n",
"4 False \n",
"5 False \n",
"6 False \n",
"7 True \n",
"8 False \n",
"9 False \n",
"10 True \n",
"11 False \n",
"12 False "
]
},
"execution_count": 155,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# pretty much the same + column names if you wish\n",
"wc.loc[(wc.Year<1990)&(wc.Year>1940),:]"
]
},
{
"cell_type": "markdown",
"id": "10720601",
"metadata": {},
"source": [
"#### Subsetting with `.query()`: a pipe approach\n",
"\n",
"As before, you can use `.query()` methods to a more reable and pipeble approach. Notice the inside of the quotation marks use only the column name. "
]
},
{
"cell_type": "code",
"execution_count": 156,
"id": "fffc670b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
"
\n",
" \n",
" 5 | \n",
" 1958 | \n",
" Sweden | \n",
" Brazil | \n",
" Sweden | \n",
" France | \n",
" Germany FR | \n",
" 126 | \n",
" 16 | \n",
" 35 | \n",
" 819.810 | \n",
" 3.600000 | \n",
" False | \n",
"
\n",
" \n",
" 6 | \n",
" 1962 | \n",
" Chile | \n",
" Brazil | \n",
" Czechoslovakia | \n",
" Chile | \n",
" Yugoslavia | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 893.172 | \n",
" 2.781250 | \n",
" False | \n",
"
\n",
" \n",
" 7 | \n",
" 1966 | \n",
" England | \n",
" England | \n",
" Germany FR | \n",
" Portugal | \n",
" Soviet Union | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 1.563.135 | \n",
" 2.781250 | \n",
" True | \n",
"
\n",
" \n",
" 8 | \n",
" 1970 | \n",
" Mexico | \n",
" Brazil | \n",
" Italy | \n",
" Germany FR | \n",
" Uruguay | \n",
" 95 | \n",
" 16 | \n",
" 32 | \n",
" 1.603.975 | \n",
" 2.968750 | \n",
" False | \n",
"
\n",
" \n",
" 9 | \n",
" 1974 | \n",
" Germany | \n",
" Germany FR | \n",
" Netherlands | \n",
" Poland | \n",
" Brazil | \n",
" 97 | \n",
" 16 | \n",
" 38 | \n",
" 1.865.753 | \n",
" 2.552632 | \n",
" False | \n",
"
\n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
" 2.684211 | \n",
" True | \n",
"
\n",
" \n",
" 11 | \n",
" 1982 | \n",
" Spain | \n",
" Italy | \n",
" Germany FR | \n",
" Poland | \n",
" France | \n",
" 146 | \n",
" 24 | \n",
" 52 | \n",
" 2.109.723 | \n",
" 2.807692 | \n",
" False | \n",
"
\n",
" \n",
" 12 | \n",
" 1986 | \n",
" Mexico | \n",
" Argentina | \n",
" Germany FR | \n",
" France | \n",
" Belgium | \n",
" 132 | \n",
" 24 | \n",
" 52 | \n",
" 2.394.031 | \n",
" 2.538462 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"5 1958 Sweden Brazil Sweden France Germany FR \n",
"6 1962 Chile Brazil Czechoslovakia Chile Yugoslavia \n",
"7 1966 England England Germany FR Portugal Soviet Union \n",
"8 1970 Mexico Brazil Italy Germany FR Uruguay \n",
"9 1974 Germany Germany FR Netherlands Poland Brazil \n",
"10 1978 Argentina Argentina Netherlands Brazil Italy \n",
"11 1982 Spain Italy Germany FR Poland France \n",
"12 1986 Mexico Argentina Germany FR France Belgium \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
"5 126 16 35 819.810 3.600000 \n",
"6 89 16 32 893.172 2.781250 \n",
"7 89 16 32 1.563.135 2.781250 \n",
"8 95 16 32 1.603.975 2.968750 \n",
"9 97 16 38 1.865.753 2.552632 \n",
"10 102 16 38 1.545.791 2.684211 \n",
"11 146 24 52 2.109.723 2.807692 \n",
"12 132 24 52 2.394.031 2.538462 \n",
"\n",
" winner_and_hoster \n",
"3 False \n",
"4 False \n",
"5 False \n",
"6 False \n",
"7 True \n",
"8 False \n",
"9 False \n",
"10 True \n",
"11 False \n",
"12 False "
]
},
"execution_count": 156,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc.query('Year<1990 & Year>1940')"
]
},
{
"cell_type": "markdown",
"id": "42937846",
"metadata": {},
"source": [
"### Other types of subsetting"
]
},
{
"cell_type": "markdown",
"id": "9bc8a8c3",
"metadata": {},
"source": [
"#### Subset by distinct entry"
]
},
{
"cell_type": "code",
"execution_count": 157,
"id": "5aaa7b3b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
" 2.684211 | \n",
" True | \n",
"
\n",
" \n",
" 11 | \n",
" 1982 | \n",
" Spain | \n",
" Italy | \n",
" Germany FR | \n",
" Poland | \n",
" France | \n",
" 146 | \n",
" 24 | \n",
" 52 | \n",
" 2.109.723 | \n",
" 2.807692 | \n",
" False | \n",
"
\n",
" \n",
" 14 | \n",
" 1994 | \n",
" USA | \n",
" Brazil | \n",
" Italy | \n",
" Sweden | \n",
" Bulgaria | \n",
" 141 | \n",
" 24 | \n",
" 52 | \n",
" 3.587.538 | \n",
" 2.711538 | \n",
" False | \n",
"
\n",
" \n",
" 16 | \n",
" 2002 | \n",
" Korea/Japan | \n",
" Brazil | \n",
" Germany | \n",
" Turkey | \n",
" Korea Republic | \n",
" 161 | \n",
" 32 | \n",
" 64 | \n",
" 2.705.197 | \n",
" 2.515625 | \n",
" False | \n",
"
\n",
" \n",
" 18 | \n",
" 2010 | \n",
" South Africa | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
" 2.265625 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
15 rows × 12 columns
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
".. ... ... ... ... ... ... \n",
"10 1978 Argentina Argentina Netherlands Brazil Italy \n",
"11 1982 Spain Italy Germany FR Poland France \n",
"14 1994 USA Brazil Italy Sweden Bulgaria \n",
"16 2002 Korea/Japan Brazil Germany Turkey Korea Republic \n",
"18 2010 South Africa Spain Netherlands Germany Uruguay \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"0 70 13 18 590.549 3.888889 \n",
"1 70 16 17 363.000 4.117647 \n",
"2 84 15 18 375.700 4.666667 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
".. ... ... ... ... ... \n",
"10 102 16 38 1.545.791 2.684211 \n",
"11 146 24 52 2.109.723 2.807692 \n",
"14 141 24 52 3.587.538 2.711538 \n",
"16 161 32 64 2.705.197 2.515625 \n",
"18 145 32 64 3.178.856 2.265625 \n",
"\n",
" winner_and_hoster \n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False \n",
".. ... \n",
"10 True \n",
"11 False \n",
"14 False \n",
"16 False \n",
"18 False \n",
"\n",
"[15 rows x 12 columns]"
]
},
"execution_count": 157,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pandas: drop duplicative entries for a specific variable\n",
"# notice here you are actually deleting important rows\n",
"wc.drop_duplicates(\"Country\")"
]
},
{
"cell_type": "markdown",
"id": "5d7feefe",
"metadata": {},
"source": [
"### Subset by sampling"
]
},
{
"cell_type": "code",
"execution_count": 159,
"id": "48fb3ccf",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 7 | \n",
" 1966 | \n",
" England | \n",
" England | \n",
" Germany FR | \n",
" Portugal | \n",
" Soviet Union | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 1.563.135 | \n",
" 2.781250 | \n",
" True | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
"
\n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
" 2.684211 | \n",
" True | \n",
"
\n",
" \n",
" 5 | \n",
" 1958 | \n",
" Sweden | \n",
" Brazil | \n",
" Sweden | \n",
" France | \n",
" Germany FR | \n",
" 126 | \n",
" 16 | \n",
" 35 | \n",
" 819.810 | \n",
" 3.600000 | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
"
\n",
" \n",
" 9 | \n",
" 1974 | \n",
" Germany | \n",
" Germany FR | \n",
" Netherlands | \n",
" Poland | \n",
" Brazil | \n",
" 97 | \n",
" 16 | \n",
" 38 | \n",
" 1.865.753 | \n",
" 2.552632 | \n",
" False | \n",
"
\n",
" \n",
" 12 | \n",
" 1986 | \n",
" Mexico | \n",
" Argentina | \n",
" Germany FR | \n",
" France | \n",
" Belgium | \n",
" 132 | \n",
" 24 | \n",
" 52 | \n",
" 2.394.031 | \n",
" 2.538462 | \n",
" False | \n",
"
\n",
" \n",
" 6 | \n",
" 1962 | \n",
" Chile | \n",
" Brazil | \n",
" Czechoslovakia | \n",
" Chile | \n",
" Yugoslavia | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 893.172 | \n",
" 2.781250 | \n",
" False | \n",
"
\n",
" \n",
" 14 | \n",
" 1994 | \n",
" USA | \n",
" Brazil | \n",
" Italy | \n",
" Sweden | \n",
" Bulgaria | \n",
" 141 | \n",
" 24 | \n",
" 52 | \n",
" 3.587.538 | \n",
" 2.711538 | \n",
" False | \n",
"
\n",
" \n",
" 19 | \n",
" 2014 | \n",
" Brazil | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
" 2.671875 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"7 1966 England England Germany FR Portugal Soviet Union \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"10 1978 Argentina Argentina Netherlands Brazil Italy \n",
"5 1958 Sweden Brazil Sweden France Germany FR \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"9 1974 Germany Germany FR Netherlands Poland Brazil \n",
"12 1986 Mexico Argentina Germany FR France Belgium \n",
"6 1962 Chile Brazil Czechoslovakia Chile Yugoslavia \n",
"14 1994 USA Brazil Italy Sweden Bulgaria \n",
"19 2014 Brazil Germany Argentina Netherlands Brazil \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"7 89 16 32 1.563.135 2.781250 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"10 102 16 38 1.545.791 2.684211 \n",
"5 126 16 35 819.810 3.600000 \n",
"2 84 15 18 375.700 4.666667 \n",
"9 97 16 38 1.865.753 2.552632 \n",
"12 132 24 52 2.394.031 2.538462 \n",
"6 89 16 32 893.172 2.781250 \n",
"14 141 24 52 3.587.538 2.711538 \n",
"19 171 32 64 3.386.810 2.671875 \n",
"\n",
" winner_and_hoster \n",
"7 True \n",
"3 False \n",
"10 True \n",
"5 False \n",
"2 False \n",
"9 False \n",
"12 False \n",
"6 False \n",
"14 False \n",
"19 False "
]
},
"execution_count": 159,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pandas: randomly sample N number of rows from the data\n",
"wc.sample(10)"
]
},
{
"cell_type": "markdown",
"id": "780e5dc8",
"metadata": {},
"source": [
"### Recoding values\n",
"\n",
"**Functionality:**\n",
"\n",
"Recode a value given certain conditions. This type of transformation is one of the most importants in data cleaning. \n",
"\n",
"**Implementation:**\n",
"\n",
"There are many ways to recode variables in Python. We will showcase four of the most useful in my view. \n",
"\n",
"First, we will see more generalized row-wise approach in pandas using: \n",
"\n",
"- `map()`\n",
"- `apply()`\n",
"\n",
"Then we will see two vectorized solutions using numpy: \n",
"\n",
"- `np.where()`\n",
"- `np.select()` \n"
]
},
{
"cell_type": "markdown",
"id": "4f62d65c",
"metadata": {},
"source": [
"#### Recode with map() + dictionaries\n",
"\n",
"The `map()` function is used to substitute each value in a **Series** with another value. \n",
"\n",
"- It takes a series as input\n",
"- Uses a function/dictionary to transform values\n",
"- Returns a series. "
]
},
{
"cell_type": "code",
"execution_count": 160,
"id": "203d5658",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
" brazil_winner | \n",
"
\n",
" \n",
" \n",
" \n",
" 15 | \n",
" 1998 | \n",
" France | \n",
" France | \n",
" Brazil | \n",
" Croatia | \n",
" Netherlands | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 2.785.100 | \n",
" 2.671875 | \n",
" True | \n",
" 0.0 | \n",
"
\n",
" \n",
" 16 | \n",
" 2002 | \n",
" Korea/Japan | \n",
" Brazil | \n",
" Germany | \n",
" Turkey | \n",
" Korea Republic | \n",
" 161 | \n",
" 32 | \n",
" 64 | \n",
" 2.705.197 | \n",
" 2.515625 | \n",
" False | \n",
" 1.0 | \n",
"
\n",
" \n",
" 17 | \n",
" 2006 | \n",
" Germany | \n",
" Italy | \n",
" France | \n",
" Germany | \n",
" Portugal | \n",
" 147 | \n",
" 32 | \n",
" 64 | \n",
" 3.359.439 | \n",
" 2.296875 | \n",
" False | \n",
" 0.0 | \n",
"
\n",
" \n",
" 18 | \n",
" 2010 | \n",
" South Africa | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
" 2.265625 | \n",
" False | \n",
" 0.0 | \n",
"
\n",
" \n",
" 19 | \n",
" 2014 | \n",
" Brazil | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
" 2.671875 | \n",
" False | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"15 1998 France France Brazil Croatia Netherlands \n",
"16 2002 Korea/Japan Brazil Germany Turkey Korea Republic \n",
"17 2006 Germany Italy France Germany Portugal \n",
"18 2010 South Africa Spain Netherlands Germany Uruguay \n",
"19 2014 Brazil Germany Argentina Netherlands Brazil \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"15 171 32 64 2.785.100 2.671875 \n",
"16 161 32 64 2.705.197 2.515625 \n",
"17 147 32 64 3.359.439 2.296875 \n",
"18 145 32 64 3.178.856 2.265625 \n",
"19 171 32 64 3.386.810 2.671875 \n",
"\n",
" winner_and_hoster brazil_winner \n",
"15 True 0.0 \n",
"16 False 1.0 \n",
"17 False 0.0 \n",
"18 False 0.0 \n",
"19 False 0.0 "
]
},
"execution_count": 160,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# map + dictionary to recode to create a dummy for certain country\n",
"\n",
"# create a map function\n",
"mapping = {\"Brazil\":1}\n",
"\n",
"# map the values\n",
"wc[\"brazil_winner\"]= wc[\"Winner\"].map(mapping)\n",
"\n",
"# Fill missing values with a default value (e.g., 0)\n",
"wc[\"brazil_winner\"].fillna(0, inplace=True)\n",
"\n",
"# see results\n",
"wc.tail(5)"
]
},
{
"cell_type": "markdown",
"id": "8ce2e11a",
"metadata": {},
"source": [
"#### Recode with apply() + function\n",
"\n",
"The apply() function is used when you want to apply a function along the axis of a DataFrame (either rows or columns). This function can be both an in-built function or a user-defined function."
]
},
{
"cell_type": "code",
"execution_count": 161,
"id": "e986fd63",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
"Name: Winner, dtype: int64"
]
},
"execution_count": 161,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# apply + function to recode\n",
"def get_dummies(x):\n",
" if x ==\"Brazil\":\n",
" return 1\n",
" else:\n",
" return 0\n",
" \n",
"# apply function \n",
"wc['Winner'].apply(get_dummies).head()"
]
},
{
"cell_type": "markdown",
"id": "1316b44e",
"metadata": {},
"source": [
"Notice, we can make it more general by providing a argument to the function"
]
},
{
"cell_type": "code",
"execution_count": 163,
"id": "258fe411",
"metadata": {},
"outputs": [],
"source": [
"# apply + function to recode\n",
"def get_dummies(x, country):\n",
" if x == country:\n",
" return 1\n",
" else:\n",
" return 0\n",
" \n",
"# Apply function with Uruguay now\n",
"wc[\"winner_uruguay\"] = wc['Winner'].apply(get_dummies, country=\"Uruguay\").head(10)"
]
},
{
"cell_type": "markdown",
"id": "c9aec441",
"metadata": {},
"source": [
"Here we are using apply for a column, so we do not need to indicate axis=1. That's also why we are inputing the columns as a serie. Notice we could do the same with assign: "
]
},
{
"cell_type": "code",
"execution_count": 165,
"id": "f412f7c4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
" brazil_winner | \n",
" winner_uruguay | \n",
" winner_germany | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" 1998 | \n",
" France | \n",
" France | \n",
" Brazil | \n",
" Croatia | \n",
" Netherlands | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 2.785.100 | \n",
" 2.671875 | \n",
" True | \n",
" 0.0 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
" 16 | \n",
" 2002 | \n",
" Korea/Japan | \n",
" Brazil | \n",
" Germany | \n",
" Turkey | \n",
" Korea Republic | \n",
" 161 | \n",
" 32 | \n",
" 64 | \n",
" 2.705.197 | \n",
" 2.515625 | \n",
" False | \n",
" 1.0 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
" 17 | \n",
" 2006 | \n",
" Germany | \n",
" Italy | \n",
" France | \n",
" Germany | \n",
" Portugal | \n",
" 147 | \n",
" 32 | \n",
" 64 | \n",
" 3.359.439 | \n",
" 2.296875 | \n",
" False | \n",
" 0.0 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
" 18 | \n",
" 2010 | \n",
" South Africa | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
" 2.265625 | \n",
" False | \n",
" 0.0 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
" 19 | \n",
" 2014 | \n",
" Brazil | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
" 2.671875 | \n",
" False | \n",
" 0.0 | \n",
" NaN | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 15 columns
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third \\\n",
"0 1930 Uruguay Uruguay Argentina USA \n",
"1 1934 Italy Italy Czechoslovakia Germany \n",
"2 1938 France Italy Hungary Brazil \n",
"3 1950 Brazil Uruguay Brazil Sweden \n",
"4 1954 Switzerland Germany FR Hungary Austria \n",
".. ... ... ... ... ... \n",
"15 1998 France France Brazil Croatia \n",
"16 2002 Korea/Japan Brazil Germany Turkey \n",
"17 2006 Germany Italy France Germany \n",
"18 2010 South Africa Spain Netherlands Germany \n",
"19 2014 Brazil Germany Argentina Netherlands \n",
"\n",
" Fourth GoalsScored QualifiedTeams MatchesPlayed Attendance \\\n",
"0 Yugoslavia 70 13 18 590.549 \n",
"1 Austria 70 16 17 363.000 \n",
"2 Sweden 84 15 18 375.700 \n",
"3 Spain 88 13 22 1.045.246 \n",
"4 Uruguay 140 16 26 768.607 \n",
".. ... ... ... ... ... \n",
"15 Netherlands 171 32 64 2.785.100 \n",
"16 Korea Republic 161 32 64 2.705.197 \n",
"17 Portugal 147 32 64 3.359.439 \n",
"18 Uruguay 145 32 64 3.178.856 \n",
"19 Brazil 171 32 64 3.386.810 \n",
"\n",
" av_goals_matches winner_and_hoster brazil_winner winner_uruguay \\\n",
"0 3.888889 True 0.0 1.0 \n",
"1 4.117647 True 0.0 0.0 \n",
"2 4.666667 False 0.0 0.0 \n",
"3 4.000000 False 0.0 1.0 \n",
"4 5.384615 False 0.0 0.0 \n",
".. ... ... ... ... \n",
"15 2.671875 True 0.0 NaN \n",
"16 2.515625 False 1.0 NaN \n",
"17 2.296875 False 0.0 NaN \n",
"18 2.265625 False 0.0 NaN \n",
"19 2.671875 False 0.0 NaN \n",
"\n",
" winner_germany \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
".. ... \n",
"15 0 \n",
"16 0 \n",
"17 0 \n",
"18 0 \n",
"19 1 \n",
"\n",
"[20 rows x 15 columns]"
]
},
"execution_count": 165,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# using apply inside of assign\n",
"wc.assign(winner_germany=wc['Winner'].apply(get_dummies, country=\"Germany\"))"
]
},
{
"cell_type": "markdown",
"id": "b4723ac7",
"metadata": {},
"source": [
"#### Summary of `apply()` and `map()`\n",
"\n",
"- `apply()` is used to apply a function along an axis of the DataFrame or on values of Series.\n",
" - you are free to use lambda functions with map\n",
" - you can also apply functions column-wise changing the axis=1 argument. \n",
" \n",
"- `map()` is used to substitute each value in a Series with another value.\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "060416a0",
"metadata": {},
"source": [
"### Recode using numpy"
]
},
{
"cell_type": "markdown",
"id": "19502f36",
"metadata": {},
"source": [
"#### `np.where`: if-else approach.\n",
"\n",
"- `np.where` is similar to ifelse in R\n",
"- Useful if there’s only 1-2 (True/False conditions)\n",
"- sintax: `np.where(condition, true, false)`\n",
"- condition can be anything that returns a boolean. \n",
"\n",
"Let's see some examples:"
]
},
{
"cell_type": "code",
"execution_count": 167,
"id": "c14d3492",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
" brazil_winner | \n",
" winner_uruguay | \n",
" winner_brazil | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"0 70 13 18 590.549 3.888889 \n",
"1 70 16 17 363.000 4.117647 \n",
"2 84 15 18 375.700 4.666667 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
"\n",
" winner_and_hoster brazil_winner winner_uruguay winner_brazil \n",
"0 True 0.0 1.0 0 \n",
"1 True 0.0 0.0 0 \n",
"2 False 0.0 0.0 0 \n",
"3 False 0.0 1.0 0 \n",
"4 False 0.0 0.0 0 "
]
},
"execution_count": 167,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create a new variable\n",
"wc[\"winner_brazil\"]=np.where(wc[\"Winner\"]==\"Brazil\", 1, 0)\n",
"wc.head()"
]
},
{
"cell_type": "code",
"execution_count": 168,
"id": "bee65d2f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" av_goals_matches | \n",
" winner_and_hoster | \n",
" brazil_winner | \n",
" winner_uruguay | \n",
" winner_brazil | \n",
" winner_brazil_ | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 3.888889 | \n",
" True | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 4.117647 | \n",
" True | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 4.666667 | \n",
" False | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 4.000000 | \n",
" False | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 5.384615 | \n",
" False | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" 1958 | \n",
" Sweden | \n",
" Brazil | \n",
" Sweden | \n",
" France | \n",
" Germany FR | \n",
" 126 | \n",
" 16 | \n",
" 35 | \n",
" 819.810 | \n",
" 3.600000 | \n",
" False | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 6 | \n",
" 1962 | \n",
" Chile | \n",
" Brazil | \n",
" Czechoslovakia | \n",
" Chile | \n",
" Yugoslavia | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 893.172 | \n",
" 2.781250 | \n",
" False | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 7 | \n",
" 1966 | \n",
" England | \n",
" England | \n",
" Germany FR | \n",
" Portugal | \n",
" Soviet Union | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 1.563.135 | \n",
" 2.781250 | \n",
" True | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 8 | \n",
" 1970 | \n",
" Mexico | \n",
" Brazil | \n",
" Italy | \n",
" Germany FR | \n",
" Uruguay | \n",
" 95 | \n",
" 16 | \n",
" 32 | \n",
" 1.603.975 | \n",
" 2.968750 | \n",
" False | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 9 | \n",
" 1974 | \n",
" Germany | \n",
" Germany FR | \n",
" Netherlands | \n",
" Poland | \n",
" Brazil | \n",
" 97 | \n",
" 16 | \n",
" 38 | \n",
" 1.865.753 | \n",
" 2.552632 | \n",
" False | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia \n",
"1 1934 Italy Italy Czechoslovakia Germany Austria \n",
"2 1938 France Italy Hungary Brazil Sweden \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain \n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay \n",
"5 1958 Sweden Brazil Sweden France Germany FR \n",
"6 1962 Chile Brazil Czechoslovakia Chile Yugoslavia \n",
"7 1966 England England Germany FR Portugal Soviet Union \n",
"8 1970 Mexico Brazil Italy Germany FR Uruguay \n",
"9 1974 Germany Germany FR Netherlands Poland Brazil \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance av_goals_matches \\\n",
"0 70 13 18 590.549 3.888889 \n",
"1 70 16 17 363.000 4.117647 \n",
"2 84 15 18 375.700 4.666667 \n",
"3 88 13 22 1.045.246 4.000000 \n",
"4 140 16 26 768.607 5.384615 \n",
"5 126 16 35 819.810 3.600000 \n",
"6 89 16 32 893.172 2.781250 \n",
"7 89 16 32 1.563.135 2.781250 \n",
"8 95 16 32 1.603.975 2.968750 \n",
"9 97 16 38 1.865.753 2.552632 \n",
"\n",
" winner_and_hoster brazil_winner winner_uruguay winner_brazil \\\n",
"0 True 0.0 1.0 0 \n",
"1 True 0.0 0.0 0 \n",
"2 False 0.0 0.0 0 \n",
"3 False 0.0 1.0 0 \n",
"4 False 0.0 0.0 0 \n",
"5 False 1.0 0.0 1 \n",
"6 False 1.0 0.0 1 \n",
"7 True 0.0 0.0 0 \n",
"8 False 1.0 0.0 1 \n",
"9 False 0.0 0.0 0 \n",
"\n",
" winner_brazil_ \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"5 1 \n",
"6 1 \n",
"7 0 \n",
"8 1 \n",
"9 0 "
]
},
"execution_count": 168,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# notice we can easily use np.where with assign\n",
"wc.assign(winner_brazil_=np.where(wc[\"Winner\"]==\"Brazil\", 1, 0)).head(10)"
]
},
{
"cell_type": "code",
"execution_count": 169,
"id": "f0552c5e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" south_america | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Uruguay | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" Italy | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" Italy | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" Uruguay | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" Germany FR | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" Brazil | \n",
" 1 | \n",
"
\n",
" \n",
" 6 | \n",
" Brazil | \n",
" 1 | \n",
"
\n",
" \n",
" 7 | \n",
" England | \n",
" 0 | \n",
"
\n",
" \n",
" 8 | \n",
" Brazil | \n",
" 1 | \n",
"
\n",
" \n",
" 9 | \n",
" Germany FR | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Winner south_america\n",
"0 Uruguay 1\n",
"1 Italy 0\n",
"2 Italy 0\n",
"3 Uruguay 1\n",
"4 Germany FR 0\n",
"5 Brazil 1\n",
"6 Brazil 1\n",
"7 England 0\n",
"8 Brazil 1\n",
"9 Germany FR 0"
]
},
"execution_count": 169,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# using string methods\n",
"(wc\n",
" .assign(south_america=np.where(wc[\"Winner\"].str.contains(\"Brazil|Uruguay|Argentina\"), 1, 0))\n",
" .filter([\"Winner\", \"south_america\"])\n",
" .head(10)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "01e6ed55",
"metadata": {},
"source": [
"#### `np.select`: for multiple conditions\n",
"\n",
"- np.select: similar to `case_when` in R. \n",
"- useful for when there’s multiple conditions to be recoded\n",
"- sintax: `np.select(conditon, choicelist, default)`"
]
},
{
"cell_type": "code",
"execution_count": 170,
"id": "14484ecd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" where_is_hoster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 4+ | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 4+ | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" 1 | \n",
"
\n",
" \n",
" 16 | \n",
" 4+ | \n",
"
\n",
" \n",
" 17 | \n",
" 3 | \n",
"
\n",
" \n",
" 18 | \n",
" 4+ | \n",
"
\n",
" \n",
" 19 | \n",
" 4+ | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 1 columns
\n",
"
"
],
"text/plain": [
" where_is_hoster\n",
"0 1\n",
"1 1\n",
"2 4+\n",
"3 2\n",
"4 4+\n",
".. ...\n",
"15 1\n",
"16 4+\n",
"17 3\n",
"18 4+\n",
"19 4+\n",
"\n",
"[20 rows x 1 columns]"
]
},
"execution_count": 170,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# step one: create a list of conditions\n",
"condition = [wc[\"Winner\"]==wc[\"Country\"], \n",
" wc[\"Runners-Up\"]==wc[\"Country\"],\n",
" wc[\"Third\"]==wc[\"Country\"]\n",
" ]\n",
"# step two: create the choice list\n",
"choice_list = [1, 2, 3]\n",
"\n",
"# recode\n",
"(wc\n",
" .assign(where_is_hoster=np.select(condition, choice_list, default=\"4+\"))\n",
" .filter([\"where_is_hoster\"])\n",
")"
]
},
{
"cell_type": "markdown",
"id": "dbb2eba7",
"metadata": {},
"source": [
"### Group by: split-apply-combine\n",
"\n",
"**Functionality**:\n",
"\n",
"- Grouping data by specific variables/column indices\n",
"- Summarize/aggregate data by specific group features"
]
},
{
"cell_type": "markdown",
"id": "a28f879d",
"metadata": {},
"source": [
"**How it works**\n",
"\n",
"“group by” is a common data wrangling process that exists in any language (R, Python, SQL) and refers to a process involving one or more of the following steps:\n",
"\n",
"- Splitting the data into groups based on some criteria.\n",
"\n",
"- Applying a function to each group independently.\n",
"\n",
"- Combining the results into a data structure.\n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "1521fda3",
"metadata": {},
"source": [
"### groupby (it just splits!)\n",
"\n",
"The `groupby()` method in pandas splits your dataset in smaller parts. It generates an iterable where each group is broken up into a tuple (group,data). "
]
},
{
"cell_type": "code",
"execution_count": 171,
"id": "c1220088",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 171,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load worldcup dataset\n",
"wc = pd.read_csv(\"WorldCups.csv\")\n",
"wc_matches = pd.read_csv(\"WorldCupMatches.csv\")\n",
"\n",
"#groupby object\n",
"g = wc.groupby([\"Winner\"])\n",
"g"
]
},
{
"cell_type": "markdown",
"id": "3a247859",
"metadata": {},
"source": [
"The output of a groupby method is a flexible abstraction: although more complicated things are happening under the hood, in many ways, it can be treated as a collection of ``DataFrame``s. And as we learned, any collector can be iterated over!\n"
]
},
{
"cell_type": "code",
"execution_count": 172,
"id": "6c44b6ea",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Argentina\n",
"Brazil\n",
"England\n",
"France\n",
"Germany\n",
"Germany FR\n",
"Italy\n",
"Spain\n",
"Uruguay\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/jy/10_nyhkn3nv_rrbnd8f_fr940000gp/T/ipykernel_52085/2794934486.py:2: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.\n",
" for group, data in g:\n"
]
}
],
"source": [
"# Iteration for groups\n",
"for group, data in g:\n",
" print(group)"
]
},
{
"cell_type": "code",
"execution_count": 173,
"id": "996fa6dc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"10 1978 Argentina Argentina Netherlands Brazil Italy 102 \n",
"12 1986 Mexico Argentina Germany FR France Belgium 132 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"10 16 38 1.545.791 \n",
"12 24 52 2.394.031 \n",
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"5 1958 Sweden Brazil Sweden France Germany FR 126 \n",
"6 1962 Chile Brazil Czechoslovakia Chile Yugoslavia 89 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"5 16 35 819.810 \n",
"6 16 32 893.172 \n",
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"7 1966 England England Germany FR Portugal Soviet Union 89 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"7 16 32 1.563.135 \n",
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"15 1998 France France Brazil Croatia Netherlands 171 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"15 32 64 2.785.100 \n",
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"19 2014 Brazil Germany Argentina Netherlands Brazil 171 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"19 32 64 3.386.810 \n",
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"4 1954 Switzerland Germany FR Hungary Austria Uruguay 140 \n",
"9 1974 Germany Germany FR Netherlands Poland Brazil 97 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"4 16 26 768.607 \n",
"9 16 38 1.865.753 \n",
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"1 1934 Italy Italy Czechoslovakia Germany Austria 70 \n",
"2 1938 France Italy Hungary Brazil Sweden 84 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"1 16 17 363.000 \n",
"2 15 18 375.700 \n",
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"18 2010 South Africa Spain Netherlands Germany Uruguay 145 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"18 32 64 3.178.856 \n",
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"0 1930 Uruguay Uruguay Argentina USA Yugoslavia 70 \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain 88 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"0 13 18 590.549 \n",
"3 13 22 1.045.246 \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/jy/10_nyhkn3nv_rrbnd8f_fr940000gp/T/ipykernel_52085/4250474981.py:2: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.\n",
" for group, data in g:\n"
]
}
],
"source": [
"# iteration for data grouped\n",
"for group, data in g:\n",
" print(data.head(2))"
]
},
{
"cell_type": "markdown",
"id": "b048520c",
"metadata": {},
"source": [
"And we can acess a specific group:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"id": "ffb740c9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
"
\n",
" \n",
" 12 | \n",
" 1986 | \n",
" Mexico | \n",
" Argentina | \n",
" Germany FR | \n",
" France | \n",
" Belgium | \n",
" 132 | \n",
" 24 | \n",
" 52 | \n",
" 2.394.031 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"10 1978 Argentina Argentina Netherlands Brazil Italy 102 \n",
"12 1986 Mexico Argentina Germany FR France Belgium 132 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"10 16 38 1.545.791 \n",
"12 24 52 2.394.031 "
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"g.get_group(\"Argentina\")"
]
},
{
"cell_type": "markdown",
"id": "a5c249cb",
"metadata": {},
"source": [
"### Aggreations (summarize)\n",
"\n",
"The power of a grouping function (like `.groupby()` shines when coupled with an aggregation operation.\n",
"\n",
"An aggregation is any operation that reduces the dimensionality of the data!"
]
},
{
"cell_type": "markdown",
"id": "2cea9c79",
"metadata": {},
"source": [
"#### `pandas`: `.groupby()` + built-in methods"
]
},
{
"cell_type": "code",
"execution_count": 174,
"id": "95448a61",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/jy/10_nyhkn3nv_rrbnd8f_fr940000gp/T/ipykernel_52085/2212436017.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n",
" wc.groupby([\"Winner\"]).mean()\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
"
\n",
" \n",
" Winner | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Argentina | \n",
" 1982.000000 | \n",
" 117.000000 | \n",
" 20.000000 | \n",
" 45.000000 | \n",
"
\n",
" \n",
" Brazil | \n",
" 1977.200000 | \n",
" 122.400000 | \n",
" 20.800000 | \n",
" 43.000000 | \n",
"
\n",
" \n",
" England | \n",
" 1966.000000 | \n",
" 89.000000 | \n",
" 16.000000 | \n",
" 32.000000 | \n",
"
\n",
" \n",
" France | \n",
" 1998.000000 | \n",
" 171.000000 | \n",
" 32.000000 | \n",
" 64.000000 | \n",
"
\n",
" \n",
" Germany | \n",
" 2014.000000 | \n",
" 171.000000 | \n",
" 32.000000 | \n",
" 64.000000 | \n",
"
\n",
" \n",
" Germany FR | \n",
" 1972.666667 | \n",
" 117.333333 | \n",
" 18.666667 | \n",
" 38.666667 | \n",
"
\n",
" \n",
" Italy | \n",
" 1965.000000 | \n",
" 111.750000 | \n",
" 21.750000 | \n",
" 37.750000 | \n",
"
\n",
" \n",
" Spain | \n",
" 2010.000000 | \n",
" 145.000000 | \n",
" 32.000000 | \n",
" 64.000000 | \n",
"
\n",
" \n",
" Uruguay | \n",
" 1940.000000 | \n",
" 79.000000 | \n",
" 13.000000 | \n",
" 20.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year GoalsScored QualifiedTeams MatchesPlayed\n",
"Winner \n",
"Argentina 1982.000000 117.000000 20.000000 45.000000\n",
"Brazil 1977.200000 122.400000 20.800000 43.000000\n",
"England 1966.000000 89.000000 16.000000 32.000000\n",
"France 1998.000000 171.000000 32.000000 64.000000\n",
"Germany 2014.000000 171.000000 32.000000 64.000000\n",
"Germany FR 1972.666667 117.333333 18.666667 38.666667\n",
"Italy 1965.000000 111.750000 21.750000 37.750000\n",
"Spain 2010.000000 145.000000 32.000000 64.000000\n",
"Uruguay 1940.000000 79.000000 13.000000 20.000000"
]
},
"execution_count": 174,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# mean of all numeric variables grouping by winners\n",
"wc.groupby([\"Winner\"]).mean()"
]
},
{
"cell_type": "markdown",
"id": "20413049",
"metadata": {},
"source": [
"or select a specific variable to perform the aggregation step on. "
]
},
{
"cell_type": "code",
"execution_count": 176,
"id": "53b1f10d",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Winner\n",
"Argentina 117.000000\n",
"Brazil 122.400000\n",
"England 89.000000\n",
"France 171.000000\n",
"Germany 171.000000\n",
"Germany FR 117.333333\n",
"Italy 111.750000\n",
"Spain 145.000000\n",
"Uruguay 79.000000\n",
"Name: GoalsScored, dtype: float64"
]
},
"execution_count": 176,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# With a specific input\n",
"wc.groupby([\"Winner\"])[\"GoalsScored\"].mean()"
]
},
{
"cell_type": "markdown",
"id": "fc1e3d2c",
"metadata": {},
"source": [
"Notice the results come in a index structure. You can easily convert back to a fully formated dataframe by: "
]
},
{
"cell_type": "code",
"execution_count": 177,
"id": "024eb5e6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" GoalsScored | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Argentina | \n",
" 117.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" Brazil | \n",
" 122.400000 | \n",
"
\n",
" \n",
" 2 | \n",
" England | \n",
" 89.000000 | \n",
"
\n",
" \n",
" 3 | \n",
" France | \n",
" 171.000000 | \n",
"
\n",
" \n",
" 4 | \n",
" Germany | \n",
" 171.000000 | \n",
"
\n",
" \n",
" 5 | \n",
" Germany FR | \n",
" 117.333333 | \n",
"
\n",
" \n",
" 6 | \n",
" Italy | \n",
" 111.750000 | \n",
"
\n",
" \n",
" 7 | \n",
" Spain | \n",
" 145.000000 | \n",
"
\n",
" \n",
" 8 | \n",
" Uruguay | \n",
" 79.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Winner GoalsScored\n",
"0 Argentina 117.000000\n",
"1 Brazil 122.400000\n",
"2 England 89.000000\n",
"3 France 171.000000\n",
"4 Germany 171.000000\n",
"5 Germany FR 117.333333\n",
"6 Italy 111.750000\n",
"7 Spain 145.000000\n",
"8 Uruguay 79.000000"
]
},
"execution_count": 177,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# reseting the index\n",
"wc.groupby([\"Winner\"])[\"GoalsScored\"].mean().reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 178,
"id": "8523a68e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" GoalsScored | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Argentina | \n",
" 117.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" Brazil | \n",
" 122.400000 | \n",
"
\n",
" \n",
" 2 | \n",
" England | \n",
" 89.000000 | \n",
"
\n",
" \n",
" 3 | \n",
" France | \n",
" 171.000000 | \n",
"
\n",
" \n",
" 4 | \n",
" Germany | \n",
" 171.000000 | \n",
"
\n",
" \n",
" 5 | \n",
" Germany FR | \n",
" 117.333333 | \n",
"
\n",
" \n",
" 6 | \n",
" Italy | \n",
" 111.750000 | \n",
"
\n",
" \n",
" 7 | \n",
" Spain | \n",
" 145.000000 | \n",
"
\n",
" \n",
" 8 | \n",
" Uruguay | \n",
" 79.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Winner GoalsScored\n",
"0 Argentina 117.000000\n",
"1 Brazil 122.400000\n",
"2 England 89.000000\n",
"3 France 171.000000\n",
"4 Germany 171.000000\n",
"5 Germany FR 117.333333\n",
"6 Italy 111.750000\n",
"7 Spain 145.000000\n",
"8 Uruguay 79.000000"
]
},
"execution_count": 178,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## as_index=false\n",
"wc.groupby([\"Winner\"], as_index=False)[\"GoalsScored\"].mean()"
]
},
{
"cell_type": "markdown",
"id": "51d4cfb8",
"metadata": {},
"source": [
"Pandas offers many built-in methods to perform aggregations. Here is a list:"
]
},
{
"cell_type": "markdown",
"id": "e04a008d",
"metadata": {},
"source": [
"#### Built-in Methods\n",
"\n",
"[See user guide from Pandas Documentation](https://pandas.pydata.org/docs/user_guide/groupby.html)\n",
"\n",
"| Method | Functionality |\n",
"|:---------------:|:-----------------------------|\n",
"|`.any`|Compute whether any of the values in the groups are truthy| \n",
"|`.all`|Compute whether all of the values in the groups are truthy|\n",
"|`.count`|Compute the number of non-NA values in the groups|\n",
"|`.cov` |Compute the covariance of the groups|\n",
"|`.first`|Compute the first occurring value in each group|\n",
"|`.idxmax` |Compute the index of the maximum value in each group|\n",
"| `.idxmin`| Compute the index of the minimum value in each group|\n",
"|`.last`|Compute the last occurring value in each group|\n",
"|`.max`|Compute the maximum value in each group|\n",
"|`.mean`|Compute the mean of each group|\n",
"|`.median`|Compute the median of each group|\n",
"|`.min`|Compute the minimum value in each group|\n",
"|`.nunique`|Compute the number of unique values in each group|\n",
"|`.prod`|Compute the product of the values in each group|\n",
"|`.quantile`|Compute a given quantile of the values in each group|\n",
"|`.sem`|Compute the standard error of the mean of the values in each group|\n",
"|`.size`|Compute the number of values in each group|\n",
"|`.skew` |Compute the skew of the values in each group|\n",
"|`.std`|Compute the standard deviation of the values in each group|\n",
"|`.sum`|Compute the sum of the values in each group|\n",
"|`.var`|Compute the variance of the values in each group|"
]
},
{
"cell_type": "markdown",
"id": "e7e885e8",
"metadata": {},
"source": [
"Let's see some examples:"
]
},
{
"cell_type": "code",
"execution_count": 179,
"id": "a811968b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Home Team Name | \n",
" Home Team Goals | \n",
"
\n",
" \n",
" \n",
" \n",
" 31 | \n",
" Hungary | \n",
" 10 | \n",
"
\n",
" \n",
" 71 | \n",
" Yugoslavia | \n",
" 9 | \n",
"
\n",
" \n",
" 69 | \n",
" Uruguay | \n",
" 8 | \n",
"
\n",
" \n",
" 62 | \n",
" Sweden | \n",
" 8 | \n",
"
\n",
" \n",
" 25 | \n",
" Germany | \n",
" 8 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 56 | \n",
" Serbia | \n",
" 0 | \n",
"
\n",
" \n",
" 17 | \n",
" Czech Republic | \n",
" 0 | \n",
"
\n",
" \n",
" 64 | \n",
" Togo | \n",
" 0 | \n",
"
\n",
" \n",
" 32 | \n",
" IR Iran | \n",
" 0 | \n",
"
\n",
" \n",
" 77 | \n",
" rn\">United Arab Emirates | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
78 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Home Team Name Home Team Goals\n",
"31 Hungary 10\n",
"71 Yugoslavia 9\n",
"69 Uruguay 8\n",
"62 Sweden 8\n",
"25 Germany 8\n",
".. ... ...\n",
"56 Serbia 0\n",
"17 Czech Republic 0\n",
"64 Togo 0\n",
"32 IR Iran 0\n",
"77 rn\">United Arab Emirates 0\n",
"\n",
"[78 rows x 2 columns]"
]
},
"execution_count": 179,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# max goal by team as home team\n",
"(wc_matches\n",
" .groupby([\"Home Team Name\"])\n",
" [\"Home Team Goals\"]\n",
" .max()\n",
" .reset_index()\n",
" .sort_values(\"Home Team Goals\", ascending=False))"
]
},
{
"cell_type": "code",
"execution_count": 180,
"id": "e66300a4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Home Team Name | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 7 | \n",
" Brazil | \n",
" 82 | \n",
"
\n",
" \n",
" 35 | \n",
" Italy | \n",
" 57 | \n",
"
\n",
" \n",
" 2 | \n",
" Argentina | \n",
" 54 | \n",
"
\n",
" \n",
" 26 | \n",
" Germany FR | \n",
" 43 | \n",
"
\n",
" \n",
" 22 | \n",
" England | \n",
" 35 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 32 | \n",
" IR Iran | \n",
" 1 | \n",
"
\n",
" \n",
" 43 | \n",
" New Zealand | \n",
" 1 | \n",
"
\n",
" \n",
" 33 | \n",
" Iran | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" Angola | \n",
" 1 | \n",
"
\n",
" \n",
" 77 | \n",
" rn\">United Arab Emirates | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
78 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Home Team Name size\n",
"7 Brazil 82\n",
"35 Italy 57\n",
"2 Argentina 54\n",
"26 Germany FR 43\n",
"22 England 35\n",
".. ... ...\n",
"32 IR Iran 1\n",
"43 New Zealand 1\n",
"33 Iran 1\n",
"1 Angola 1\n",
"77 rn\">United Arab Emirates 1\n",
"\n",
"[78 rows x 2 columns]"
]
},
"execution_count": 180,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# number of matches as home team\n",
"(wc_matches\n",
" .groupby([\"Home Team Name\"], as_index=False)\n",
" [\"Home Team Name\"]\n",
" .size()\n",
" .sort_values(\"size\", ascending=False)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f3852470",
"metadata": {},
"source": [
"#### `pandas`: `.groupby()` + `agg()`"
]
},
{
"cell_type": "markdown",
"id": "036a4bb2",
"metadata": {},
"source": [
"Alternatively, we can specify a whole range of operations to aggregate by (along with specific variable columns) using the `.aggregate()`/`.agg()` method. To keep track of which operations correspond which variable, `pandas` will generate a hierarchical index for column entries. "
]
},
{
"cell_type": "code",
"execution_count": 181,
"id": "f496a540",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" mean | \n",
" std | \n",
" median | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Argentina | \n",
" 117.000000 | \n",
" 21.213203 | \n",
" 117.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Brazil | \n",
" 122.400000 | \n",
" 30.476220 | \n",
" 126.0 | \n",
"
\n",
" \n",
" 2 | \n",
" England | \n",
" 89.000000 | \n",
" NaN | \n",
" 89.0 | \n",
"
\n",
" \n",
" 3 | \n",
" France | \n",
" 171.000000 | \n",
" NaN | \n",
" 171.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Germany | \n",
" 171.000000 | \n",
" NaN | \n",
" 171.0 | \n",
"
\n",
" \n",
" 5 | \n",
" Germany FR | \n",
" 117.333333 | \n",
" 21.594752 | \n",
" 115.0 | \n",
"
\n",
" \n",
" 6 | \n",
" Italy | \n",
" 111.750000 | \n",
" 40.532908 | \n",
" 115.0 | \n",
"
\n",
" \n",
" 7 | \n",
" Spain | \n",
" 145.000000 | \n",
" NaN | \n",
" 145.0 | \n",
"
\n",
" \n",
" 8 | \n",
" Uruguay | \n",
" 79.000000 | \n",
" 12.727922 | \n",
" 79.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Winner mean std median\n",
"0 Argentina 117.000000 21.213203 117.0\n",
"1 Brazil 122.400000 30.476220 126.0\n",
"2 England 89.000000 NaN 89.0\n",
"3 France 171.000000 NaN 171.0\n",
"4 Germany 171.000000 NaN 171.0\n",
"5 Germany FR 117.333333 21.594752 115.0\n",
"6 Italy 111.750000 40.532908 115.0\n",
"7 Spain 145.000000 NaN 145.0\n",
"8 Uruguay 79.000000 12.727922 79.0"
]
},
"execution_count": 181,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc.groupby([\"Winner\"])[\"GoalsScored\"].agg([\"mean\",\"std\",\"median\"]).reset_index()"
]
},
{
"cell_type": "markdown",
"id": "06cc51c1",
"metadata": {},
"source": [
"We can also **_user-defined functions_** into the `aggregate()` function as well."
]
},
{
"cell_type": "code",
"execution_count": 182,
"id": "2b97b99f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean | \n",
" std | \n",
" median | \n",
" mean_add_50 | \n",
"
\n",
" \n",
" Winner | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Argentina | \n",
" 117.000000 | \n",
" 21.213203 | \n",
" 117.0 | \n",
" 167.000000 | \n",
"
\n",
" \n",
" Brazil | \n",
" 122.400000 | \n",
" 30.476220 | \n",
" 126.0 | \n",
" 172.400000 | \n",
"
\n",
" \n",
" England | \n",
" 89.000000 | \n",
" NaN | \n",
" 89.0 | \n",
" 139.000000 | \n",
"
\n",
" \n",
" France | \n",
" 171.000000 | \n",
" NaN | \n",
" 171.0 | \n",
" 221.000000 | \n",
"
\n",
" \n",
" Germany | \n",
" 171.000000 | \n",
" NaN | \n",
" 171.0 | \n",
" 221.000000 | \n",
"
\n",
" \n",
" Germany FR | \n",
" 117.333333 | \n",
" 21.594752 | \n",
" 115.0 | \n",
" 167.333333 | \n",
"
\n",
" \n",
" Italy | \n",
" 111.750000 | \n",
" 40.532908 | \n",
" 115.0 | \n",
" 161.750000 | \n",
"
\n",
" \n",
" Spain | \n",
" 145.000000 | \n",
" NaN | \n",
" 145.0 | \n",
" 195.000000 | \n",
"
\n",
" \n",
" Uruguay | \n",
" 79.000000 | \n",
" 12.727922 | \n",
" 79.0 | \n",
" 129.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean std median mean_add_50\n",
"Winner \n",
"Argentina 117.000000 21.213203 117.0 167.000000\n",
"Brazil 122.400000 30.476220 126.0 172.400000\n",
"England 89.000000 NaN 89.0 139.000000\n",
"France 171.000000 NaN 171.0 221.000000\n",
"Germany 171.000000 NaN 171.0 221.000000\n",
"Germany FR 117.333333 21.594752 115.0 167.333333\n",
"Italy 111.750000 40.532908 115.0 161.750000\n",
"Spain 145.000000 NaN 145.0 195.000000\n",
"Uruguay 79.000000 12.727922 79.0 129.000000"
]
},
"execution_count": 182,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def mean_add_50(x):\n",
" return np.mean(x) + 50\n",
"\n",
"wc.groupby([\"Winner\"])[\"GoalsScored\"].agg([\"mean\",\"std\",\"median\",mean_add_50])"
]
},
{
"cell_type": "markdown",
"id": "c846d8ee",
"metadata": {},
"source": [
"`agg()` + `.rename()` provides an easy workflow to rename your newly created variables"
]
},
{
"cell_type": "code",
"execution_count": 183,
"id": "0313425f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" goals_mean | \n",
" goals_std | \n",
" goals_median | \n",
" mean_50_goals | \n",
"
\n",
" \n",
" Winner | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Argentina | \n",
" 117.000000 | \n",
" 21.213203 | \n",
" 117.0 | \n",
" 167.000000 | \n",
"
\n",
" \n",
" Brazil | \n",
" 122.400000 | \n",
" 30.476220 | \n",
" 126.0 | \n",
" 172.400000 | \n",
"
\n",
" \n",
" England | \n",
" 89.000000 | \n",
" NaN | \n",
" 89.0 | \n",
" 139.000000 | \n",
"
\n",
" \n",
" France | \n",
" 171.000000 | \n",
" NaN | \n",
" 171.0 | \n",
" 221.000000 | \n",
"
\n",
" \n",
" Germany | \n",
" 171.000000 | \n",
" NaN | \n",
" 171.0 | \n",
" 221.000000 | \n",
"
\n",
" \n",
" Germany FR | \n",
" 117.333333 | \n",
" 21.594752 | \n",
" 115.0 | \n",
" 167.333333 | \n",
"
\n",
" \n",
" Italy | \n",
" 111.750000 | \n",
" 40.532908 | \n",
" 115.0 | \n",
" 161.750000 | \n",
"
\n",
" \n",
" Spain | \n",
" 145.000000 | \n",
" NaN | \n",
" 145.0 | \n",
" 195.000000 | \n",
"
\n",
" \n",
" Uruguay | \n",
" 79.000000 | \n",
" 12.727922 | \n",
" 79.0 | \n",
" 129.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" goals_mean goals_std goals_median mean_50_goals\n",
"Winner \n",
"Argentina 117.000000 21.213203 117.0 167.000000\n",
"Brazil 122.400000 30.476220 126.0 172.400000\n",
"England 89.000000 NaN 89.0 139.000000\n",
"France 171.000000 NaN 171.0 221.000000\n",
"Germany 171.000000 NaN 171.0 221.000000\n",
"Germany FR 117.333333 21.594752 115.0 167.333333\n",
"Italy 111.750000 40.532908 115.0 161.750000\n",
"Spain 145.000000 NaN 145.0 195.000000\n",
"Uruguay 79.000000 12.727922 79.0 129.000000"
]
},
"execution_count": 183,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(wc.groupby([\"Winner\"])[\"GoalsScored\"].\n",
" agg([\"mean\",\"std\",\"median\",mean_add_50]).\n",
" rename(columns={\"mean\": \"goals_mean\", \n",
" \"std\": \"goals_std\",\n",
" \"median\": \"goals_median\", \n",
" \"mean_add_50\":\"mean_50_goals\"})\n",
") "
]
},
{
"cell_type": "markdown",
"id": "ff932c40",
"metadata": {},
"source": [
"#### `pandas`: `.groupby()` + `.apply()` \n",
"\n",
"Even though I cover this in more details on the miscellaneous notebook, here is a good moment to bring the `.apply()` methods from pandas one more time. \n",
"\n",
"The `apply` method lets you apply an arbitrary function by row or by columns on your data frame. As you can imagine, it also allow you to apply results by at a `groupby` object, and returns a summarizes dataset. \n",
"\n",
"The function should take a `DataFrame` and returns either a Pandas object (e.g., `DataFrame`, `Series`) or a scalar\n",
"\n",
"Let's an example with the mean add function:"
]
},
{
"cell_type": "code",
"execution_count": 184,
"id": "bfa10fbb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" GoalsScored | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Argentina | \n",
" 152.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Brazil | \n",
" 179.5 | \n",
"
\n",
" \n",
" 2 | \n",
" Chile | \n",
" 139.0 | \n",
"
\n",
" \n",
" 3 | \n",
" England | \n",
" 139.0 | \n",
"
\n",
" \n",
" 4 | \n",
" France | \n",
" 177.5 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 10 | \n",
" Spain | \n",
" 196.0 | \n",
"
\n",
" \n",
" 11 | \n",
" Sweden | \n",
" 176.0 | \n",
"
\n",
" \n",
" 12 | \n",
" Switzerland | \n",
" 190.0 | \n",
"
\n",
" \n",
" 13 | \n",
" USA | \n",
" 191.0 | \n",
"
\n",
" \n",
" 14 | \n",
" Uruguay | \n",
" 120.0 | \n",
"
\n",
" \n",
"
\n",
"
15 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Country GoalsScored\n",
"0 Argentina 152.0\n",
"1 Brazil 179.5\n",
"2 Chile 139.0\n",
"3 England 139.0\n",
"4 France 177.5\n",
".. ... ...\n",
"10 Spain 196.0\n",
"11 Sweden 176.0\n",
"12 Switzerland 190.0\n",
"13 USA 191.0\n",
"14 Uruguay 120.0\n",
"\n",
"[15 rows x 2 columns]"
]
},
"execution_count": 184,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(wc.groupby([\"Country\"])[\"GoalsScored\"]. \n",
" apply(mean_add_50). \n",
" reset_index()\n",
")"
]
},
{
"cell_type": "markdown",
"id": "97a69e83",
"metadata": {},
"source": [
"#### multi-index grouping"
]
},
{
"cell_type": "markdown",
"id": "d99abdfc",
"metadata": {},
"source": [
"We can also group by more than one variable (i.e. implement a **_multi-index on the rows_**)."
]
},
{
"cell_type": "code",
"execution_count": 185,
"id": "d454821f",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Winner | \n",
" Year | \n",
" GoalsScored | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Argentina | \n",
" 1978 | \n",
" 102.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Argentina | \n",
" 1986 | \n",
" 132.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Brazil | \n",
" 1958 | \n",
" 126.0 | \n",
"
\n",
" \n",
" 3 | \n",
" Brazil | \n",
" 1962 | \n",
" 89.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Brazil | \n",
" 1970 | \n",
" 95.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Winner Year GoalsScored\n",
"0 Argentina 1978 102.0\n",
"1 Argentina 1986 132.0\n",
"2 Brazil 1958 126.0\n",
"3 Brazil 1962 89.0\n",
"4 Brazil 1970 95.0"
]
},
"execution_count": 185,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc.groupby([\"Winner\", \"Year\"])[\"GoalsScored\"].mean().reset_index().head(5)"
]
},
{
"cell_type": "markdown",
"id": "fee42614",
"metadata": {},
"source": [
"#### `pandas`: `.groupby()` + `.transform()`\n",
"\n",
"Other times we want to implement data manipulations by some grouping variable but retain structure of the original data. Put differently, our aim is not to aggregate but to perform some operation across specific groups. \n",
"\n",
"To do so, we will combine `.groupby()` and `.transform()` methods. Let's see an example: "
]
},
{
"cell_type": "code",
"execution_count": 186,
"id": "8caf7cb5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 79.000000\n",
"1 111.750000\n",
"2 111.750000\n",
"3 79.000000\n",
"4 117.333333\n",
" ... \n",
"15 171.000000\n",
"16 122.400000\n",
"17 111.750000\n",
"18 145.000000\n",
"19 171.000000\n",
"Name: GoalsScored, Length: 20, dtype: float64"
]
},
"execution_count": 186,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create a new column\n",
"# notive you need to select the column you want to transform. \n",
"wc.groupby([\"Winner\"])[\"GoalsScored\"].transform(\"mean\")"
]
},
{
"cell_type": "code",
"execution_count": 188,
"id": "a4f9c1c3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
" goals_score_wc_mean_wc | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1930 | \n",
" Uruguay | \n",
" Uruguay | \n",
" Argentina | \n",
" USA | \n",
" Yugoslavia | \n",
" 70 | \n",
" 13 | \n",
" 18 | \n",
" 590.549 | \n",
" 79.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 1934 | \n",
" Italy | \n",
" Italy | \n",
" Czechoslovakia | \n",
" Germany | \n",
" Austria | \n",
" 70 | \n",
" 16 | \n",
" 17 | \n",
" 363.000 | \n",
" 111.750000 | \n",
"
\n",
" \n",
" 2 | \n",
" 1938 | \n",
" France | \n",
" Italy | \n",
" Hungary | \n",
" Brazil | \n",
" Sweden | \n",
" 84 | \n",
" 15 | \n",
" 18 | \n",
" 375.700 | \n",
" 111.750000 | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
" 79.000000 | \n",
"
\n",
" \n",
" 4 | \n",
" 1954 | \n",
" Switzerland | \n",
" Germany FR | \n",
" Hungary | \n",
" Austria | \n",
" Uruguay | \n",
" 140 | \n",
" 16 | \n",
" 26 | \n",
" 768.607 | \n",
" 117.333333 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 15 | \n",
" 1998 | \n",
" France | \n",
" France | \n",
" Brazil | \n",
" Croatia | \n",
" Netherlands | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 2.785.100 | \n",
" 171.000000 | \n",
"
\n",
" \n",
" 16 | \n",
" 2002 | \n",
" Korea/Japan | \n",
" Brazil | \n",
" Germany | \n",
" Turkey | \n",
" Korea Republic | \n",
" 161 | \n",
" 32 | \n",
" 64 | \n",
" 2.705.197 | \n",
" 122.400000 | \n",
"
\n",
" \n",
" 17 | \n",
" 2006 | \n",
" Germany | \n",
" Italy | \n",
" France | \n",
" Germany | \n",
" Portugal | \n",
" 147 | \n",
" 32 | \n",
" 64 | \n",
" 3.359.439 | \n",
" 111.750000 | \n",
"
\n",
" \n",
" 18 | \n",
" 2010 | \n",
" South Africa | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
" 145.000000 | \n",
"
\n",
" \n",
" 19 | \n",
" 2014 | \n",
" Brazil | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
" 171.000000 | \n",
"
\n",
" \n",
"
\n",
"
20 rows × 11 columns
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third \\\n",
"0 1930 Uruguay Uruguay Argentina USA \n",
"1 1934 Italy Italy Czechoslovakia Germany \n",
"2 1938 France Italy Hungary Brazil \n",
"3 1950 Brazil Uruguay Brazil Sweden \n",
"4 1954 Switzerland Germany FR Hungary Austria \n",
".. ... ... ... ... ... \n",
"15 1998 France France Brazil Croatia \n",
"16 2002 Korea/Japan Brazil Germany Turkey \n",
"17 2006 Germany Italy France Germany \n",
"18 2010 South Africa Spain Netherlands Germany \n",
"19 2014 Brazil Germany Argentina Netherlands \n",
"\n",
" Fourth GoalsScored QualifiedTeams MatchesPlayed Attendance \\\n",
"0 Yugoslavia 70 13 18 590.549 \n",
"1 Austria 70 16 17 363.000 \n",
"2 Sweden 84 15 18 375.700 \n",
"3 Spain 88 13 22 1.045.246 \n",
"4 Uruguay 140 16 26 768.607 \n",
".. ... ... ... ... ... \n",
"15 Netherlands 171 32 64 2.785.100 \n",
"16 Korea Republic 161 32 64 2.705.197 \n",
"17 Portugal 147 32 64 3.359.439 \n",
"18 Uruguay 145 32 64 3.178.856 \n",
"19 Brazil 171 32 64 3.386.810 \n",
"\n",
" goals_score_wc_mean_wc \n",
"0 79.000000 \n",
"1 111.750000 \n",
"2 111.750000 \n",
"3 79.000000 \n",
"4 117.333333 \n",
".. ... \n",
"15 171.000000 \n",
"16 122.400000 \n",
"17 111.750000 \n",
"18 145.000000 \n",
"19 171.000000 \n",
"\n",
"[20 rows x 11 columns]"
]
},
"execution_count": 188,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# easily combined with assign\n",
"# create a new column\n",
"(wc.assign(goals_score_wc_mean_wc=wc.groupby([\"Winner\"])[\"GoalsScored\"].\n",
" transform(\"mean\")))"
]
},
{
"cell_type": "code",
"execution_count": 189,
"id": "b807e02c",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0 -9.000000\n",
"1 -41.750000\n",
"2 -27.750000\n",
"3 9.000000\n",
"4 22.666667\n",
" ... \n",
"15 0.000000\n",
"16 38.600000\n",
"17 35.250000\n",
"18 0.000000\n",
"19 0.000000\n",
"Name: GoalsScored, Length: 20, dtype: float64"
]
},
"execution_count": 189,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# also: very useful with lambda functions\n",
"wc.groupby(\"Winner\")[\"GoalsScored\"].transform(lambda x: x - x.mean())"
]
},
{
"cell_type": "markdown",
"id": "534b77f0",
"metadata": {},
"source": [
"### Sorting values"
]
},
{
"cell_type": "code",
"execution_count": 190,
"id": "dac3752f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
"
\n",
" \n",
" 19 | \n",
" 2014 | \n",
" Brazil | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
"
\n",
" \n",
" 3 | \n",
" 1950 | \n",
" Brazil | \n",
" Uruguay | \n",
" Brazil | \n",
" Sweden | \n",
" Spain | \n",
" 88 | \n",
" 13 | \n",
" 22 | \n",
" 1.045.246 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth GoalsScored \\\n",
"10 1978 Argentina Argentina Netherlands Brazil Italy 102 \n",
"19 2014 Brazil Germany Argentina Netherlands Brazil 171 \n",
"3 1950 Brazil Uruguay Brazil Sweden Spain 88 \n",
"\n",
" QualifiedTeams MatchesPlayed Attendance \n",
"10 16 38 1.545.791 \n",
"19 32 64 3.386.810 \n",
"3 13 22 1.045.246 "
]
},
"execution_count": 190,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pandas: sort values by a column variable (ascending)\n",
"wc.sort_values('Country').head(3)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"id": "e67e7a37",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 19 | \n",
" 2014 | \n",
" Brazil | \n",
" Germany | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" 171 | \n",
" 32 | \n",
" 64 | \n",
" 3.386.810 | \n",
"
\n",
" \n",
" 18 | \n",
" 2010 | \n",
" South Africa | \n",
" Spain | \n",
" Netherlands | \n",
" Germany | \n",
" Uruguay | \n",
" 145 | \n",
" 32 | \n",
" 64 | \n",
" 3.178.856 | \n",
"
\n",
" \n",
" 17 | \n",
" 2006 | \n",
" Germany | \n",
" Italy | \n",
" France | \n",
" Germany | \n",
" Portugal | \n",
" 147 | \n",
" 32 | \n",
" 64 | \n",
" 3.359.439 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"19 2014 Brazil Germany Argentina Netherlands Brazil \n",
"18 2010 South Africa Spain Netherlands Germany Uruguay \n",
"17 2006 Germany Italy France Germany Portugal \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance \n",
"19 171 32 64 3.386.810 \n",
"18 145 32 64 3.178.856 \n",
"17 147 32 64 3.359.439 "
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pandas: sort values by a column variable (descending)\n",
"wc.sort_values('Year',ascending=False).head(3)"
]
},
{
"cell_type": "code",
"execution_count": 191,
"id": "cba57ff5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Country | \n",
" Winner | \n",
" Runners-Up | \n",
" Third | \n",
" Fourth | \n",
" GoalsScored | \n",
" QualifiedTeams | \n",
" MatchesPlayed | \n",
" Attendance | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" 1978 | \n",
" Argentina | \n",
" Argentina | \n",
" Netherlands | \n",
" Brazil | \n",
" Italy | \n",
" 102 | \n",
" 16 | \n",
" 38 | \n",
" 1.545.791 | \n",
"
\n",
" \n",
" 12 | \n",
" 1986 | \n",
" Mexico | \n",
" Argentina | \n",
" Germany FR | \n",
" France | \n",
" Belgium | \n",
" 132 | \n",
" 24 | \n",
" 52 | \n",
" 2.394.031 | \n",
"
\n",
" \n",
" 6 | \n",
" 1962 | \n",
" Chile | \n",
" Brazil | \n",
" Czechoslovakia | \n",
" Chile | \n",
" Yugoslavia | \n",
" 89 | \n",
" 16 | \n",
" 32 | \n",
" 893.172 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Country Winner Runners-Up Third Fourth \\\n",
"10 1978 Argentina Argentina Netherlands Brazil Italy \n",
"12 1986 Mexico Argentina Germany FR France Belgium \n",
"6 1962 Chile Brazil Czechoslovakia Chile Yugoslavia \n",
"\n",
" GoalsScored QualifiedTeams MatchesPlayed Attendance \n",
"10 102 16 38 1.545.791 \n",
"12 132 24 52 2.394.031 \n",
"6 89 16 32 893.172 "
]
},
"execution_count": 191,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pandas: sort values by more than one column variable \n",
"wc.sort_values(['Winner', \"Country\"]).head(3)"
]
},
{
"cell_type": "markdown",
"id": "58dd7e47",
"metadata": {},
"source": [
"### That was a lot! but you will get to keep this notebook for you! \n",
"\n",
"And remember of the [cheat sheet for pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)"
]
},
{
"cell_type": "markdown",
"id": "c4033260",
"metadata": {},
"source": [
"## Practice \n",
"\n",
"Using the same ACLED data from the previous exercise, answer:\n",
"\n",
"1. What are the different event types recorded?\n",
"\n",
"2. How many events are recorded for each year?\n",
"\n",
"3. What’s the most common event type in the data?\n",
"\n",
"4. Which countries had the highest number of reported fatalities?"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "c3752239",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Protests',\n",
" 'Battles',\n",
" 'Strategic developments',\n",
" 'Violence against civilians',\n",
" 'Riots',\n",
" 'Explosions/Remote violence']"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1\n",
"acled_sa = pd.read_csv(\"acled_south_america.csv\")\n",
"acled_sa[\"event_type\"].drop_duplicates().to_list()"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "c9f5582c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2018 | \n",
" 28758 | \n",
"
\n",
" \n",
" 1 | \n",
" 2019 | \n",
" 26441 | \n",
"
\n",
" \n",
" 2 | \n",
" 2020 | \n",
" 26615 | \n",
"
\n",
" \n",
" 3 | \n",
" 2021 | \n",
" 30190 | \n",
"
\n",
" \n",
" 4 | \n",
" 2022 | \n",
" 32039 | \n",
"
\n",
" \n",
" 5 | \n",
" 2023 | \n",
" 23544 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year size\n",
"0 2018 28758\n",
"1 2019 26441\n",
"2 2020 26615\n",
"3 2021 30190\n",
"4 2022 32039\n",
"5 2023 23544"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 2\n",
"(acled_sa.groupby(\"year\", as_index=False)\n",
" [\"year\"]\n",
" .size()\n",
" ) "
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "ad40fcb7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" event_type | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Protests | \n",
" 80850 | \n",
"
\n",
" \n",
" 1 | \n",
" Battles | \n",
" 34042 | \n",
"
\n",
" \n",
" 2 | \n",
" Violence against civilians | \n",
" 30001 | \n",
"
\n",
" \n",
" 3 | \n",
" Riots | \n",
" 13447 | \n",
"
\n",
" \n",
" 4 | \n",
" Strategic developments | \n",
" 6917 | \n",
"
\n",
" \n",
" 5 | \n",
" Explosions/Remote violence | \n",
" 2330 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" event_type size\n",
"0 Protests 80850\n",
"1 Battles 34042\n",
"2 Violence against civilians 30001\n",
"3 Riots 13447\n",
"4 Strategic developments 6917\n",
"5 Explosions/Remote violence 2330"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 3 \n",
"(acled_sa.groupby(\"event_type\", as_index=False)\n",
" [\"event_type\"]\n",
" .size()\n",
" .sort_values(\"size\", ascending=False)\n",
" .reset_index(drop=True)\n",
" ) "
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "dfb9cb18",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" fatalities | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" Brazil | \n",
" 35842 | \n",
"
\n",
" \n",
" 4 | \n",
" Colombia | \n",
" 10553 | \n",
"
\n",
" \n",
" 13 | \n",
" Venezuela | \n",
" 8824 | \n",
"
\n",
" \n",
" 10 | \n",
" Peru | \n",
" 231 | \n",
"
\n",
" \n",
" 9 | \n",
" Paraguay | \n",
" 132 | \n",
"
\n",
" \n",
" 5 | \n",
" Ecuador | \n",
" 130 | \n",
"
\n",
" \n",
" 0 | \n",
" Argentina | \n",
" 117 | \n",
"
\n",
" \n",
" 3 | \n",
" Chile | \n",
" 108 | \n",
"
\n",
" \n",
" 1 | \n",
" Bolivia | \n",
" 88 | \n",
"
\n",
" \n",
" 11 | \n",
" Suriname | \n",
" 12 | \n",
"
\n",
" \n",
" 8 | \n",
" Guyana | \n",
" 11 | \n",
"
\n",
" \n",
" 12 | \n",
" Uruguay | \n",
" 7 | \n",
"
\n",
" \n",
" 7 | \n",
" French Guiana | \n",
" 2 | \n",
"
\n",
" \n",
" 6 | \n",
" Falkland Islands | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country fatalities\n",
"2 Brazil 35842\n",
"4 Colombia 10553\n",
"13 Venezuela 8824\n",
"10 Peru 231\n",
"9 Paraguay 132\n",
"5 Ecuador 130\n",
"0 Argentina 117\n",
"3 Chile 108\n",
"1 Bolivia 88\n",
"11 Suriname 12\n",
"8 Guyana 11\n",
"12 Uruguay 7\n",
"7 French Guiana 2\n",
"6 Falkland Islands 0"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"acled_sa.groupby(\"country\", as_index=False)[\"fatalities\"].sum().sort_values(\"fatalities\", ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "1bc7cc09",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[NbConvertApp] Converting notebook _week-6c-data_wrangling_pandas.ipynb to html\n",
"[NbConvertApp] Writing 562236 bytes to _week-6c-data_wrangling_pandas.html\n"
]
}
],
"source": [
"!jupyter nbconvert _week-6c-data_wrangling_pandas.ipynb --to html --template classic"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}