PPOL 5203 - Data Science I: Foundations

Week 7: Tidying, Joining and Visualizing Data

Professor: Tiago Ventura

Plans for Today

  • Grouped operations in Pandas

  • Tidying and Joining Data

  • Data Visualization: Principles and Practice

Problem Set 3 is due this week! It is a long problem set! Remember to explain your responses and answer with words!

Grouped Operations in Pandas

From Python Data Science Handbook by Jake VanderPlas

Tidy + Join

On a higher level, we will learn:

  • Concept of tidy data (Wickham, 2014):

    • TLDR: a way to standardize your datasets

    • Each variable must have its own column.

    • Each observation must have its own row.

    • Each value must have its own cell.

  • Advantages: sits well with grammar of graphics + facilitates split-aply-combine

  • Joining Methods: Ways to connect related tables.

Pandas Notebooks: Data Wrangling (Last week) and Join and Tidy

Data Visualization

TLDR

  • Visualization matters! A figure is almost always better than a table

  • You have a full-semester ahead of you for Data Visualization

  • Readings for this week are very important!!

  • We will cover the basics:

    • Focus more on grammar of graphics and plotnine - a ggplot2 implementation in Python

    • Skim through native Python libraries (matplotlib and seaborn)

Quizz: What do you see?

  • How many variables (data) are mapped in this graph?

  • How are these variables (non-constant) represented in the figure?

  • What are the non-data (constant) related information presented in the graph?

Aesthetics

All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as aesthetics. Fundamentals of Data Visualization, Claus Wilke

Aesthetics: visual mappings that connect data variables to visual attributes of graphical elements

Cartesian coordinates system: 2d

More dimensions

We often want map more variables into the graph. We do this exploring new aesthetics.

Color Aesthetics to Distinguish

To represent visually a sequence of data points

Grammar of Graphics

Grammar: set of structural rules that dictate how words in a language can be combined to form meaningful sentences.

Grammar of Graphics: brings a similar effort to establish structural rules to data visualizations

Implementation:

  • ggplot2 in R

  • plotnine in Python

Major Components of the Grammar of Graphics

  • plotnine/ggplot2 graphs have three key steps

    • Data Step: The raw data that you want to plot.

    • Geometries step: The geometric shapes that will represent the data.

    • Aesthetics <aes()> step: Aesthetics of the geometric and statistical objects, such as position, color, size, shape, and transparency

Additional Components of the Grammar of Graphics:

  • Facets: to produce create subplots based on specific variable

  • annotations: labels, titles, subtitles, captions.

  • Coordinates & Scales: some additional functions to adjust aesthetics you are mapping (change colors, size, alpha, scale of x and y coordinates)

  • Theme: Control the finer presentation details like font size, background color, grid line styles, etc.

Notebook for Data Viz - Practice!