PPOL 5203 - Data Science I: Foundations

Week 7: Tidying, Joining and Visualizing Data

Professor: Tiago Ventura

2024-10-22

Plans for Today

  • Logistics: Problem Sets and Mid-Semester Survey.

  • Tidying and Joining Data

  • Data Visualization: Principles and Practice

Problem Sets

  • Problem set II

    • Most of you made some mistake on question 5.

    • When you are not sure on how to answer the question, come to my office hours, or just ask on slack! Happy to give more guidance on what I am expecting.

    • I am happy to show you my solutions during the office hours or TAs office hours

    • You also lost points for not fully explaining your answers. Just the code is not enough, particularly on more complicated questions. Explain you findings. This is really important for Problem set 3

  • Problem Set III is REALLY long. Please, start working on your solutions!

Mid-Semester Survey: Start Doing

  • “More in class exercises”

  • “More practice during class!”

  • “I was going to say maybe more in-class exercises but it sounds like that’s where we’re heading!”

  • “Hold a lab” – I will raise this issue with Professor Bailey. Meanwhile, use the office hours.

Mid-Semester Survey: Stop Doing

  • Mostly “NAs” – Great!

  • More short breaks instead of a long break

  • Some of you prefer the biweekly classes. I will consider this for next year! Thanks for the feedback!

Mid-Semester Survey:Continue Doing

  • Comparisons between Python and R

  • “more in-class assignments - we are already started and I like that. And of course, the break in between helps too.”

  • “I find the in-class exercises and structure of the lecture notes really helpful! The lecture notes are particularly helpful because they’re so hands-on and explicit.”

  • “Increasing the frequency of problem sets”

  • “Being the best professor” - :’)

Tidying, Joining and Visualizing Data

Readings for this week are very important!!

Tidy + Join

On a higher level, we will learn:

  • Concept of tidy data (Wickham, 2014):

    • TLDR: a way to standardize your datasets

    • Each variable must have its own column.

    • Each observation must have its own row.

    • Each value must have its own cell.

  • Advantages: sits well with grammar of graphics + facilitates split-aply-combine

  • Joining Methods: Ways to connect related tables.

Pandas Notebook Join and Tidy

Data Visualization

TLDR

  • Visualization matters! A figure is almost always better than a table

  • You have a full-semester ahead of you for Data Visualization

  • We will cover the basics:

    • Focus more on grammar of graphics and plotnine - a ggplot2 implementation in Python

    • Skim through native Python libraries (matplotlib and seaborn)

What do you see?

  • How many variables (data) are mapped in this graph?

  • How are these variables (non-constant) represented in the figure?

  • What are the non-data (constant) related information presented in the graph?

Aesthetics

The key aspect on data visualization is to take data points and convert them visual elements.

All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as aesthetics. Fundamentals of Data Visualization, Claus Wilke

Cartesian coordinates system: 2d

More dimensions

We often want map more variables into the graph. We do this exploring new aesthetics.

Color Aesthetics to Distinguish

Color Aesthetics to Highlight

To represent visually a sequence of data points

Grammar of Graphics

Grammar: set of structural rules that dictate how words in a language can be combined to form meaningful sentences.

Grammar of Graphics: brings a similar effort to establish structural rules to data visualizations

Implementation:

  • ggplot2 in R

  • plotnine in Python

Major Components of the Grammar of Graphics

  • plotnine/ggplot2 graphs have three key steps

    • Data Step: The raw data that you want to plot.

    • Geometries step: The geometric shapes that will represent the data.

    • Aesthetics <aes()> step: Aesthetics of the geometric and statistical objects, such as position, color, size, shape, and transparency

Additional Components of the Grammar of Graphics:

  • Facets: to produce create subplots based on specific variable

  • annotations: labels, titles, subtitles, captions.

  • Coordinates & Scales: some additional functions to adjust aesthetics you are mapping (change colors, size, alpha, scale of x and y coordinates)

-   Theme: Control the finer presentation details like font size, background color, grid line styles, etc.

Gapminder

import pandas as pd
import numpy as np
from plotnine import * # to imitate ggplot
from gapminder import gapminder # bring data

import warnings
warnings.filterwarnings('ignore') # Ignore warnings

# create to new log variables
gapminder = (gapminder
       .assign(lngdpPercap = np.log(gapminder["gdpPercap"]), 
               lnpop = np.log(gapminder["pop"]))
      )

ggplot (data = <DATA>)

+

geom_<representation> (aes (<aesthethics>) )

# build in plotnine graph

# step 1: data
(
ggplot(data=gapminder) + 

# step 2: geom
 geom_point(

# step 3: aesthethics
     aes(x="lngdpPercap", y="lifeExp"))
)
<Figure Size: (640 x 480)>

ggplot(data = <DATA>)

+

geom_<representation>(aes(<aesthethics>))

+

geom_<representation>(aes(<aesthethics>))

+

scale_<aesthetics>()

+

theme_<>

+

facet_<>

+

labels

Notebook for Data Viz - Practice!