In this Notebook we cover some miscellaneous data wrangling techniques:
import pandas as pd
import numpy as np
If you started your data science career, or are simultaneously learning, R, it is likely you were exposed to the tidyverse
world, and the famous pipe
. Here it is for you:
Pipe operators, introduced by tha magritt package in R, absolutely transformed the way of writing coding in R. It did for good reasons. And in the past few years, piping has also become more prevalent in Python.
Pipe operators allows you to:
In pandas, piping allows you to chain together in a single workflow a sequence of methods. Pipe operators (.pipe()
) in pandas
were introduced relatively recently (version 0.16.2).
Let's see some examples using the world cup data. We want to count which country played more world cup games.
# read world cup data
wc = pd.read_csv("WorldCupMatches.csv")
wc = pd.read_csv("WorldCupMatches.csv")
# select columns
wc = wc.filter(['Year','Home Team Name','Away Team Name'])
# make it tidy
wc = wc.melt(id_vars=["Year"], var_name="var", value_name="country")
# group by
wc = wc.groupby("country")
# count occurrences
wc = wc.size()
# move index to column with new name
wc = wc.reset_index(name="n")
# sort
wc = wc.sort_values(by="n", ascending=False)
# print 10
wc.head(10)
wc = pd.read_csv("WorldCupMatches.csv")
# select columns
wc_10 = (wc.filter(['Year','Home Team Name','Away Team Name']).
melt(id_vars=["Year"], var_name="var", value_name="country").
groupby("country").
size().
reset_index(name="n").
sort_values(by="n", ascending=False)
)
# print 10
wc_10.head(10)
But it is not a nice code to read!
wc.filter(['Year','Home Team Name','Away Team Name']).melt(id_vars=["Year"], var_name="var", value_name="country"). groupby("country"). size(). reset_index(name="n").sort_values(by="n", ascending=False).head(10)
It should be clear by now, but these are some of the advantages of piping your code:
Keep in mind pipes also have some disadvantages. In my view, specially when working in notebook in which you do not execute line by line, pipes can make codes a bit harder to debug and also embed errors that could be more easily perceived by examining intermediate variables.
Read more about Missing Data in Python Data Science Handbook
Real datasets are messy.
Missingness (such as non-response in surveys, lack of information from a particular year, wrong input in a databases) are frequent visitors on our day-to-day work with datasets.
Let's see a few approaches to work with missing data in pandas
At first, we should learn how pandas
records missing data. Pandas use sentinels (a global annotation) to handle missing values, and more specifically Pandas use two already-existing Python null values to identify missings:
Let's see how to work with these values
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': ["id_1929", "id_2982", "id_2902"],
'age': [30, 5, 30],
'gender': ["M", "F", None],
'politics': [np.nan, "stronly agree", "agree"],
'super_sensitive_item': [np.nan,np.nan,"prefer not to answer"],
'social_media': ["Twitter", "Facebook",""]})
df
pd.isnull()
: detect nulls
# check nulls for all columns
df.isnull()
# lets see overall
df.isnull().sum()
There are three main approaches when we need to deal with missing values
list-wise deletion: ignore them and drop them.
recoding values: recode answers as missing
imputation: guess a plausible value that the data could take on. (your home work will focus on this!)
.dropna()
: List-wise deletion¶If missings are just noise, and do not matter for your analysis, you could choose to just drop them:
# To drop row if any NaN values are present:
df.dropna(axis=0)
# To drop column if any NaN values are present:
df.dropna(axis=1)
For example, sometimes we have values we consider to be missings, for example, empty responses or prefer not to answer type of responses. Let's see a few different ways to do that.
# with assign method
(df.assign(social_media=np.where(df.social_media=="", np.nan, df.social_media)))
# with a simple replace
df.replace("", np.nan)
pd.fillna()
: imputation¶Rather we can fill the value with a place holder.
# set all nas to zero
df.fillna(0)
# another texct
df.fillna("Missing")
forward-fill: forward propagation of previous values into current values.
df.ffill()
back-fill: backward propagation of future values into current values.
df.bfill()
Forward and back fill make little sense when we're not explicitly dealing with a time series containing the same units (e.g. countries).
Note that this _barely scratches the surface of imputation techniques_. But we'll always want to think carefully about what it means to manufacture data when data doesn't exist.
Most often, imputation will come from statistical assumptions about the data generation process of your variable of interest. For example: can we impute partisanship for american voters if we know their gender, age, race and where they vote? probably yes...
Quite often as we do data analysis, we need ways to apply flexible user-defined functions on entries (row,columns, or cells) of our data frames. Writing a for-loop to iterate through Pandas DataFrame and Series will often work for these tasks, but we know loops can be inneficient and hard to read.
Pandas offers a set of built-in methods to apply user-defined functions on dataframes. Those are similar to apply family functions in base R, or the map functions in tidyverse.
9
This is additional material. You will see some of these tasks can also be done with .transform()
and .agg()
. But as you start consuming Python code from other programmers, it is useful to know how more about .apply()
, .map()
and .applymap()
in Pandas.
.apply()
¶The apply() function is used when you want to apply a function along the axis of a DataFrame (either rows or columns). This function can be both an in-built function or a user-defined function.
import pandas as pd
# Creating a simple dataframe
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40],
'C': [100, 200, 300, 400]
})
# Define a function to apply
def my_func(x):
return x*2
# Using apply function
df.apply(my_func)
In this case, the apply()
function will take each column (or row if axis=1 is specified) and apply the function my_func
to it. The result will be a DataFrame with all elements doubled.
map()
¶The map()
function is used to substitute each value in a Series with another value. It's typically used for mapping categorical data to numerical data, for example, when preparing data for a machine learning algorithm. It can take a function, a dictionary, or a Series.
#Here's an example usage of map():
import pandas as pd
# Creating a simple series
s = pd.Series(['cat', 'dog', 'cow', 'cat', 'dog', 'cow'])
# Creating a mapping dictionary
animals = {'cat': 1, 'dog': 2, 'cow': 3}
# Using map function
s.map(animals)
applymap()
¶The applymap()
function is used to apply a function to every single element in a DataFrame. It's similar to apply()
, but it works element-wise. In this case, applymap() will apply the function my_func to every single element in the DataFrame, and the result will be a DataFrame with all elements doubled.
# Here's an example usage of applymap():
import pandas as pd
# Creating a simple dataframe
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40],
'C': [100, 200, 300, 400]
})
# Define a function to apply
def my_func(x):
return x*2
# Using applymap function
df.applymap(my_func)
apply()
allows you to use multiple columns, which is useful when you need to perform operations that require values from several columns. When you set axis=1, the function you're applying will receive each row (instead of each column when axis=0).
# Here is an example where the goal is to compute a new column as the product of two existing columns:
import pandas as pd
# Creating a simple dataframe
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40]
})
# Define a function to apply
def multiply_cols(row):
return row['A'] * row['B']
# Using apply function
df['C'] = df.apply(multiply_cols, axis=1)
df
You can also use .apply()
with multiple arguments
# Defining multiple arguments in the function: If you have a function that takes multiple arguments, you can apply that function to a DataFrame or Series and specify the other arguments in the apply() call.
# Here's an example:
# Creating a simple dataframe
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40]
})
# Define a function to apply that takes multiple arguments
def multiply_by_factor(x, factor):
return x * factor
# Using apply function with multiple arguments
df['A'] = df['A'].apply(multiply_by_factor, factor=2)
And of course, you can also use lambda functions:
# Creating a simple dataframe
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40]
})
# Using apply with a lambda function
df['C'] = df.apply(lambda row: row['A'] * row['B'], axis=1)
#Lambda functions can also take additional parameters. Here's how to do it:
# Creating a simple dataframe
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40]
})
# Using apply with a lambda function and an additional parameter
df['C'] = df.apply(lambda row, factor: row['A'] * row['B'] * factor, axis=1, factor=2)
df