Week 7: Tidying, Joining and Visualizing Data
Grouped operations in Pandas
Tidying and Joining Data
Data Visualization: Principles and Practice
From Python Data Science Handbook by Jake VanderPlas
On a higher level, we will learn:
Concept of tidy data (Wickham, 2014):
TLDR: a way to standardize your datasets
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
Advantages: sits well with grammar of graphics + facilitates split-aply-combine
Joining Methods: Ways to connect related tables.
Visualization matters! A figure is almost always better than a table
You have a full-semester ahead of you for Data Visualization
Readings for this week are very important!!
We will cover the basics:
Focus more on grammar of graphics and plotnine - a ggplot2 implementation in Python
Skim through native Python libraries (matplotlib and seaborn)
How many variables (data) are mapped in this graph?
How are these variables (non-constant) represented in the figure?
What are the non-data (constant) related information presented in the graph?
All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as aesthetics. Fundamentals of Data Visualization, Claus Wilke
Aesthetics: visual mappings that connect data variables to visual attributes of graphical elements
Grammar: set of structural rules that dictate how words in a language can be combined to form meaningful sentences.
Grammar of Graphics: brings a similar effort to establish structural rules to data visualizations
Implementation:
ggplot2 in R
plotnine in Python
plotnine/ggplot2 graphs have three key steps
Data Step: The raw data that you want to plot.
Geometries step: The geometric shapes that will represent the data.
Aesthetics <aes()> step: Aesthetics of the geometric and statistical objects, such as position, color, size, shape, and transparency
Facets: to produce create subplots based on specific variable
annotations: labels, titles, subtitles, captions.
Coordinates & Scales: some additional functions to adjust aesthetics you are mapping (change colors, size, alpha, scale of x and y coordinates)
Theme: Control the finer presentation details like font size, background color, grid line styles, etc.
Data science I: Foundations