PPOL 5203 Data Science I: Foundations

Miscellaneous in Pandas

Tiago Ventura

In this Notebook we cover some miscellaneous data wrangling techniques:

Pipining operations in Pandas.
Dealing with Missing values/imputation
Apply and Map Methods in Pandas

Setup¶

import pandas as pd
import numpy as np

Piping¶

If you started your data science career, or are simultaneously learning, R, it is likely you were exposed to the tidyverse world, and the famous pipe. Here it is for you:

Pipe operators, introduced by tha magritt package in R, absolutely transformed the way of writing coding in R. It did for good reasons. And in the past few years, piping has also become more prevalent in Python.

Pipe operators allows you to:

Chain together data manipulations in a single operational sequence.

In pandas, piping allows you to chain together in a single workflow a sequence of methods. Pipe operators (.pipe()) in pandas were introduced relatively recently (version 0.16.2).

Let's see some examples using the world cup data. We want to count which country played more world cup games.

# read world cup data
wc = pd.read_csv("WorldCupMatches.csv")

Method 1: sequentially overwrite the object¶

wc = pd.read_csv("WorldCupMatches.csv")

# select columns
wc = wc.filter(['Year','Home Team Name','Away Team Name'])

# make it tidy
wc = wc.melt(id_vars=["Year"], var_name="var", value_name="country")

# group by
wc = wc.groupby("country")

# count occurrences
wc = wc.size()

# move index to column with new name
wc = wc.reset_index(name="n")

# sort
wc = wc.sort_values(by="n", ascending=False)

# print 10
wc.head(10)

Method 2: Pandas Pipe¶

wc = pd.read_csv("WorldCupMatches.csv")

# select columns
wc_10 = (wc.filter(['Year','Home Team Name','Away Team Name']).
             melt(id_vars=["Year"], var_name="var", value_name="country"). 
             groupby("country"). 
             size(). 
             reset_index(name="n").
             sort_values(by="n", ascending=False)
        )

# print 10
wc_10.head(10)

Notice that a sequential chain would also work¶

But it is not a nice code to read!

wc.filter(['Year','Home Team Name','Away Team Name']).melt(id_vars=["Year"], var_name="var", value_name="country"). groupby("country"). size(). reset_index(name="n").sort_values(by="n", ascending=False).head(10)

Final notes in Piping¶

To understand pipes, you should always remember: **data in, data out**. That's what pipes do: apply methods sequentially to `pandas dataframes`.

It should be clear by now, but these are some of the advantages of piping your code:

Improves code structure
Eliminates intermediate variables
Can improve code readability
Memory efficiency by eliminating intermediate variables
Makes easies to add steps anywhere in the sequence of operations

Keep in mind pipes also have some disadvantages. In my view, specially when working in notebook in which you do not execute line by line, pipes can make codes a bit harder to debug and also embed errors that could be more easily perceived by examining intermediate variables.

Missing Values in Pandas¶

Real datasets are messy.

Missingness (such as non-response in surveys, lack of information from a particular year, wrong input in a databases) are frequent visitors on our day-to-day work with datasets.

Let's see a few approaches to work with missing data in pandas

At first, we should learn how pandas records missing data. Pandas use sentinels (a global annotation) to handle missing values, and more specifically Pandas use two already-existing Python null values to identify missings:

None object.
floating-point NaN value

Let's see how to work with these values

Handling Missing Data¶

import numpy as np
import pandas as pd
df = pd.DataFrame({'id': ["id_1929", "id_2982", "id_2902"],
                  'age': [30, 5, 30],
                  'gender': ["M", "F", None],
                  'politics': [np.nan, "stronly agree", "agree"],
                  'super_sensitive_item': [np.nan,np.nan,"prefer not to answer"],
                  'social_media': ["Twitter", "Facebook",""]})
df

pd.isnull(): detect nulls

# check nulls for all columns
df.isnull()

# lets see overall
df.isnull().sum()

id                      0
age                     0
gender                  1
politics                1
super_sensitive_item    2
social_media            0
dtype: int64

Dealing with incompleteness¶

There are three main approaches when we need to deal with missing values

list-wise deletion: ignore them and drop them.
recoding values: recode answers as missing
imputation: guess a plausible value that the data could take on. (your home work will focus on this!)

`.dropna()`: List-wise deletion¶

If missings are just noise, and do not matter for your analysis, you could choose to just drop them:

# To drop row if any NaN values are present:
df.dropna(axis=0)

# To drop column if any NaN values are present:
df.dropna(axis=1)

Recode Missings¶

For example, sometimes we have values we consider to be missings, for example, empty responses or prefer not to answer type of responses. Let's see a few different ways to do that.

# with assign method
(df.assign(social_media=np.where(df.social_media=="", np.nan, df.social_media)))

# with a simple replace
df.replace("", np.nan)

`pd.fillna()`: imputation¶

Rather we can fill the value with a place holder.

# set all nas to zero
df.fillna(0)

# another texct
df.fillna("Missing")

forward-fill: forward propagation of previous values into current values.

df.ffill()

back-fill: backward propagation of future values into current values.

df.bfill()

Forward and back fill make little sense when we're not explicitly dealing with a time series containing the same units (e.g. countries).

Note that this _barely scratches the surface of imputation techniques_. But we'll always want to think carefully about what it means to manufacture data when data doesn't exist.

Most often, imputation will come from statistical assumptions about the data generation process of your variable of interest. For example: can we impute partisanship for american voters if we know their gender, age, race and where they vote? probably yes...

Apply, Map and Applymap¶

Quite often as we do data analysis, we need ways to apply flexible user-defined functions on entries (row,columns, or cells) of our data frames. Writing a for-loop to iterate through Pandas DataFrame and Series will often work for these tasks, but we know loops can be inneficient and hard to read.

Pandas offers a set of built-in methods to apply user-defined functions on dataframes. Those are similar to apply family functions in base R, or the map functions in tidyverse. 9 This is additional material. You will see some of these tasks can also be done with .transform() and .agg(). But as you start consuming Python code from other programmers, it is useful to know how more about .apply(), .map() and .applymap() in Pandas.

`.apply()`¶

The apply() function is used when you want to apply a function along the axis of a DataFrame (either rows or columns). This function can be both an in-built function or a user-defined function.

import pandas as pd
# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40],
    'C': [100, 200, 300, 400]
})

# Define a function to apply
def my_func(x):
    return x*2

# Using apply function
df.apply(my_func)

In this case, the apply() function will take each column (or row if axis=1 is specified) and apply the function my_func to it. The result will be a DataFrame with all elements doubled.

`map()`¶

The map() function is used to substitute each value in a Series with another value. It's typically used for mapping categorical data to numerical data, for example, when preparing data for a machine learning algorithm. It can take a function, a dictionary, or a Series.

#Here's an example usage of map():

import pandas as pd

# Creating a simple series
s = pd.Series(['cat', 'dog', 'cow', 'cat', 'dog', 'cow'])

# Creating a mapping dictionary
animals = {'cat': 1, 'dog': 2, 'cow': 3}

# Using map function
s.map(animals)

0    1
1    2
2    3
3    1
4    2
5    3
dtype: int64

`applymap()`¶

The applymap() function is used to apply a function to every single element in a DataFrame. It's similar to apply(), but it works element-wise. In this case, applymap() will apply the function my_func to every single element in the DataFrame, and the result will be a DataFrame with all elements doubled.

# Here's an example usage of applymap():

import pandas as pd

# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40],
    'C': [100, 200, 300, 400]
})

# Define a function to apply
def my_func(x):
    return x*2

# Using applymap function
df.applymap(my_func)

Summary¶

apply() is used to apply a function along an axis of the DataFrame or on values of Series.
map() is used to substitute each value in a Series with another value.
applymap() is used to apply a function to every single element in the DataFrame.

Multiple columnes¶

apply() allows you to use multiple columns, which is useful when you need to perform operations that require values from several columns. When you set axis=1, the function you're applying will receive each row (instead of each column when axis=0).

# Here is an example where the goal is to compute a new column as the product of two existing columns:
import pandas as pd

# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Define a function to apply
def multiply_cols(row):
    return row['A'] * row['B']

# Using apply function
df['C'] = df.apply(multiply_cols, axis=1)
df

You can also use .apply() with multiple arguments

# Defining multiple arguments in the function: If you have a function that takes multiple arguments, you can apply that function to a DataFrame or Series and specify the other arguments in the apply() call.
# Here's an example:

# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Define a function to apply that takes multiple arguments
def multiply_by_factor(x, factor):
    return x * factor

# Using apply function with multiple arguments
df['A'] = df['A'].apply(multiply_by_factor, factor=2)

And of course, you can also use lambda functions:

# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Using apply with a lambda function
df['C'] = df.apply(lambda row: row['A'] * row['B'], axis=1)


#Lambda functions can also take additional parameters. Here's how to do it:


# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Using apply with a lambda function and an additional parameter
df['C'] = df.apply(lambda row, factor: row['A'] * row['B'] * factor, axis=1, factor=2)

df

	country	n
7	Brazil	108
39	Italy	83
2	Argentina	81
25	England	62
29	Germany FR	62
26	France	61
66	Spain	59
45	Mexico	54
47	Netherlands	54
74	Uruguay	52

	country	n
7	Brazil	108
39	Italy	83
2	Argentina	81
25	England	62
29	Germany FR	62
26	France	61
66	Spain	59
45	Mexico	54
47	Netherlands	54
74	Uruguay	52

	country	n
7	Brazil	108
39	Italy	83
2	Argentina	81
25	England	62
29	Germany FR	62
26	France	61
66	Spain	59
45	Mexico	54
47	Netherlands	54
74	Uruguay	52

	id	age	gender	politics	super_sensitive_item	social_media
0	id_1929	30	M	NaN	NaN	Twitter
1	id_2982	5	F	stronly agree	NaN	Facebook
2	id_2902	30	None	agree	prefer not to answer

	id	age	gender	politics	super_sensitive_item	social_media
0	False	False	False	True	True	False
1	False	False	False	False	True	False
2	False	False	True	False	False	False

	id	age	gender	politics	super_sensitive_item	social_media
0	id_1929	30	M	Missing	Missing	Twitter
1	id_2982	5	F	stronly agree	Missing	Facebook
2	id_2902	30	Missing	agree	prefer not to answer

	A	B	C
0	2	20	200
1	4	40	400
2	6	60	600
3	8	80	800

	A	B	C
0	2	20	200
1	4	40	400
2	6	60	600
3	8	80	800

	A	B	C
0	1	10	10
1	2	20	40
2	3	30	90
3	4	40	160

	A	B	C
0	1	10	20
1	2	20	80
2	3	30	180
3	4	40	320

PPOL 5203 Data Science I: Foundations Miscellaneous in Pandas Tiago Ventura