PPOL 5203 Data Science I: Foundations

Miscellaneous in Pandas

Tiago Ventura


In this Notebook we cover some miscellaneous data wrangling techniques:

  • Pipining operations in Pandas.
  • Dealing with Missing values/imputation
  • Apply and Map Methods in Pandas

Setup

In [1]:
import pandas as pd
import numpy as np

Piping

If you started your data science career, or are simultaneously learning, R, it is likely you were exposed to the tidyverse world, and the famous pipe. Here it is for you:

Pipe operators, introduced by tha magritt package in R, absolutely transformed the way of writing coding in R. It did for good reasons. And in the past few years, piping has also become more prevalent in Python.

Pipe operators allows you to:

  • Chain together data manipulations in a single operational sequence.

In pandas, piping allows you to chain together in a single workflow a sequence of methods. Pipe operators (.pipe()) in pandas were introduced relatively recently (version 0.16.2).

Let's see some examples using the world cup data. We want to count which country played more world cup games.

In [2]:
# read world cup data
wc = pd.read_csv("WorldCupMatches.csv")

Method 1: sequentially overwrite the object

In [3]:
wc = pd.read_csv("WorldCupMatches.csv")

# select columns
wc = wc.filter(['Year','Home Team Name','Away Team Name'])

# make it tidy
wc = wc.melt(id_vars=["Year"], var_name="var", value_name="country")

# group by
wc = wc.groupby("country")

# count occurrences
wc = wc.size()

# move index to column with new name
wc = wc.reset_index(name="n")

# sort
wc = wc.sort_values(by="n", ascending=False)

# print 10
wc.head(10)
Out[3]:
country n
7 Brazil 108
39 Italy 83
2 Argentina 81
25 England 62
29 Germany FR 62
26 France 61
66 Spain 59
45 Mexico 54
47 Netherlands 54
74 Uruguay 52

Method 2: Pandas Pipe

In [4]:
wc = pd.read_csv("WorldCupMatches.csv")

# select columns
wc_10 = (wc.filter(['Year','Home Team Name','Away Team Name']).
             melt(id_vars=["Year"], var_name="var", value_name="country"). 
             groupby("country"). 
             size(). 
             reset_index(name="n").
             sort_values(by="n", ascending=False)
        )

# print 10
wc_10.head(10)
Out[4]:
country n
7 Brazil 108
39 Italy 83
2 Argentina 81
25 England 62
29 Germany FR 62
26 France 61
66 Spain 59
45 Mexico 54
47 Netherlands 54
74 Uruguay 52

Notice that a sequential chain would also work

But it is not a nice code to read!

In [5]:
wc.filter(['Year','Home Team Name','Away Team Name']).melt(id_vars=["Year"], var_name="var", value_name="country"). groupby("country"). size(). reset_index(name="n").sort_values(by="n", ascending=False).head(10)
        
Out[5]:
country n
7 Brazil 108
39 Italy 83
2 Argentina 81
25 England 62
29 Germany FR 62
26 France 61
66 Spain 59
45 Mexico 54
47 Netherlands 54
74 Uruguay 52

Final notes in Piping

To understand pipes, you should always remember: **data in, data out**. That's what pipes do: apply methods sequentially to `pandas dataframes`.

It should be clear by now, but these are some of the advantages of piping your code:

  • Improves code structure
  • Eliminates intermediate variables
  • Can improve code readability
  • Memory efficiency by eliminating intermediate variables
  • Makes easies to add steps anywhere in the sequence of operations

Keep in mind pipes also have some disadvantages. In my view, specially when working in notebook in which you do not execute line by line, pipes can make codes a bit harder to debug and also embed errors that could be more easily perceived by examining intermediate variables.

Missing Values in Pandas

Read more about Missing Data in Python Data Science Handbook

Real datasets are messy.

Missingness (such as non-response in surveys, lack of information from a particular year, wrong input in a databases) are frequent visitors on our day-to-day work with datasets.

Let's see a few approaches to work with missing data in pandas

At first, we should learn how pandas records missing data. Pandas use sentinels (a global annotation) to handle missing values, and more specifically Pandas use two already-existing Python null values to identify missings:

  • None object.
  • floating-point NaN value

Let's see how to work with these values

Handling Missing Data

In [6]:
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': ["id_1929", "id_2982", "id_2902"],
                  'age': [30, 5, 30],
                  'gender': ["M", "F", None],
                  'politics': [np.nan, "stronly agree", "agree"],
                  'super_sensitive_item': [np.nan,np.nan,"prefer not to answer"],
                  'social_media': ["Twitter", "Facebook",""]})
df
Out[6]:
id age gender politics super_sensitive_item social_media
0 id_1929 30 M NaN NaN Twitter
1 id_2982 5 F stronly agree NaN Facebook
2 id_2902 30 None agree prefer not to answer

pd.isnull(): detect nulls

In [7]:
# check nulls for all columns
df.isnull()
Out[7]:
id age gender politics super_sensitive_item social_media
0 False False False True True False
1 False False False False True False
2 False False True False False False
In [8]:
# lets see overall
df.isnull().sum()
Out[8]:
id                      0
age                     0
gender                  1
politics                1
super_sensitive_item    2
social_media            0
dtype: int64

Dealing with incompleteness

There are three main approaches when we need to deal with missing values

  1. list-wise deletion: ignore them and drop them.

  2. recoding values: recode answers as missing

  3. imputation: guess a plausible value that the data could take on. (your home work will focus on this!)

.dropna(): List-wise deletion

If missings are just noise, and do not matter for your analysis, you could choose to just drop them:

In [9]:
# To drop row if any NaN values are present:
df.dropna(axis=0)
Out[9]:
id age gender politics super_sensitive_item social_media
In [10]:
# To drop column if any NaN values are present:
df.dropna(axis=1)
Out[10]:
id age social_media
0 id_1929 30 Twitter
1 id_2982 5 Facebook
2 id_2902 30

Recode Missings

For example, sometimes we have values we consider to be missings, for example, empty responses or prefer not to answer type of responses. Let's see a few different ways to do that.

In [11]:
# with assign method
(df.assign(social_media=np.where(df.social_media=="", np.nan, df.social_media)))
Out[11]:
id age gender politics super_sensitive_item social_media
0 id_1929 30 M NaN NaN Twitter
1 id_2982 5 F stronly agree NaN Facebook
2 id_2902 30 None agree prefer not to answer NaN
In [12]:
# with a simple replace
df.replace("", np.nan)
Out[12]:
id age gender politics super_sensitive_item social_media
0 id_1929 30 M NaN NaN Twitter
1 id_2982 5 F stronly agree NaN Facebook
2 id_2902 30 None agree prefer not to answer NaN

pd.fillna(): imputation

Rather we can fill the value with a place holder.

In [36]:
# set all nas to zero
df.fillna(0) 
Out[36]:
id age gender politics super_sensitive_item social_media
0 id_1929 30 M 0 0 Twitter
1 id_2982 5 F stronly agree 0 Facebook
2 id_2902 30 0 agree prefer not to answer
In [37]:
# another texct
df.fillna("Missing")
Out[37]:
id age gender politics super_sensitive_item social_media
0 id_1929 30 M Missing Missing Twitter
1 id_2982 5 F stronly agree Missing Facebook
2 id_2902 30 Missing agree prefer not to answer

forward-fill: forward propagation of previous values into current values.

In [38]:
df.ffill() 
Out[38]:
id age gender politics super_sensitive_item social_media
0 id_1929 30 M NaN NaN Twitter
1 id_2982 5 F stronly agree NaN Facebook
2 id_2902 30 F agree prefer not to answer

back-fill: backward propagation of future values into current values.

In [39]:
df.bfill()
Out[39]:
id age gender politics super_sensitive_item social_media
0 id_1929 30 M stronly agree prefer not to answer Twitter
1 id_2982 5 F stronly agree prefer not to answer Facebook
2 id_2902 30 None agree prefer not to answer

Forward and back fill make little sense when we're not explicitly dealing with a time series containing the same units (e.g. countries).

Note that this _barely scratches the surface of imputation techniques_. But we'll always want to think carefully about what it means to manufacture data when data doesn't exist.

Most often, imputation will come from statistical assumptions about the data generation process of your variable of interest. For example: can we impute partisanship for american voters if we know their gender, age, race and where they vote? probably yes...

Apply, Map and Applymap

Quite often as we do data analysis, we need ways to apply flexible user-defined functions on entries (row,columns, or cells) of our data frames. Writing a for-loop to iterate through Pandas DataFrame and Series will often work for these tasks, but we know loops can be inneficient and hard to read.

Pandas offers a set of built-in methods to apply user-defined functions on dataframes. Those are similar to apply family functions in base R, or the map functions in tidyverse. 9 This is additional material. You will see some of these tasks can also be done with .transform() and .agg(). But as you start consuming Python code from other programmers, it is useful to know how more about .apply(), .map() and .applymap() in Pandas.

.apply()

The apply() function is used when you want to apply a function along the axis of a DataFrame (either rows or columns). This function can be both an in-built function or a user-defined function.

In [16]:
import pandas as pd
# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40],
    'C': [100, 200, 300, 400]
})

# Define a function to apply
def my_func(x):
    return x*2

# Using apply function
df.apply(my_func)
Out[16]:
A B C
0 2 20 200
1 4 40 400
2 6 60 600
3 8 80 800

In this case, the apply() function will take each column (or row if axis=1 is specified) and apply the function my_func to it. The result will be a DataFrame with all elements doubled.

map()

The map() function is used to substitute each value in a Series with another value. It's typically used for mapping categorical data to numerical data, for example, when preparing data for a machine learning algorithm. It can take a function, a dictionary, or a Series.

In [15]:
#Here's an example usage of map():

import pandas as pd

# Creating a simple series
s = pd.Series(['cat', 'dog', 'cow', 'cat', 'dog', 'cow'])

# Creating a mapping dictionary
animals = {'cat': 1, 'dog': 2, 'cow': 3}

# Using map function
s.map(animals)
Out[15]:
0    1
1    2
2    3
3    1
4    2
5    3
dtype: int64

applymap()

The applymap() function is used to apply a function to every single element in a DataFrame. It's similar to apply(), but it works element-wise. In this case, applymap() will apply the function my_func to every single element in the DataFrame, and the result will be a DataFrame with all elements doubled.

In [14]:
# Here's an example usage of applymap():

import pandas as pd

# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40],
    'C': [100, 200, 300, 400]
})

# Define a function to apply
def my_func(x):
    return x*2

# Using applymap function
df.applymap(my_func)
Out[14]:
A B C
0 2 20 200
1 4 40 400
2 6 60 600
3 8 80 800

Summary

  • apply() is used to apply a function along an axis of the DataFrame or on values of Series.
  • map() is used to substitute each value in a Series with another value.
  • applymap() is used to apply a function to every single element in the DataFrame.

Multiple columnes

apply() allows you to use multiple columns, which is useful when you need to perform operations that require values from several columns. When you set axis=1, the function you're applying will receive each row (instead of each column when axis=0).

In [13]:
# Here is an example where the goal is to compute a new column as the product of two existing columns:
import pandas as pd

# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Define a function to apply
def multiply_cols(row):
    return row['A'] * row['B']

# Using apply function
df['C'] = df.apply(multiply_cols, axis=1)
df
Out[13]:
A B C
0 1 10 10
1 2 20 40
2 3 30 90
3 4 40 160

You can also use .apply() with multiple arguments

In [18]:
# Defining multiple arguments in the function: If you have a function that takes multiple arguments, you can apply that function to a DataFrame or Series and specify the other arguments in the apply() call.
# Here's an example:

# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Define a function to apply that takes multiple arguments
def multiply_by_factor(x, factor):
    return x * factor

# Using apply function with multiple arguments
df['A'] = df['A'].apply(multiply_by_factor, factor=2)

And of course, you can also use lambda functions:

In [20]:
# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Using apply with a lambda function
df['C'] = df.apply(lambda row: row['A'] * row['B'], axis=1)


#Lambda functions can also take additional parameters. Here's how to do it:


# Creating a simple dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Using apply with a lambda function and an additional parameter
df['C'] = df.apply(lambda row, factor: row['A'] * row['B'] * factor, axis=1, factor=2)
In [21]:
df
Out[21]:
A B C
0 1 10 20
1 2 20 80
2 3 30 180
3 4 40 320

See this medium piece here to read more about these methods

In [ ]: