PPOL 5203 - Data Science I: Foundations

Week 6: Data Wrangling in Pandas

Author

Professor: Tiago Ventura

Published

October 14, 2024

Where we are…

  • We started with the basics of being a data scientist

  • We moved over to the primitives of Python as your main DS tool

  • Then we started our journey working with tabular data:

    • Numpy for matrices

    • Pandas for heterogeneous dataframes ~ introductions, slicing, constructors

Plans for Today & Next Week

  • Data Wrangling in Pandas

    • Loading and Writing Data

    • Data Processsing: row, columns, and grouped

    • Tidying and Joining (Next Week + Visualization)

    • Miscelanneous (Work by yourselves in the notebook)

  • More in class exercises

Class Website: https://tiagoventura.github.io/ppol5203/weeks/week-06.html

Loading/Writing Data in Pandas

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq

Read more about all the input/output methods here.

Data Wrangling

pandas dplyr\(^\dagger\) Description
.filter() select() select column variables/index
.drop() select() drop selected column variables/index
.rename() rename() rename column variables/index
.query() filter() row-wise subset of a data frame by a values of a column variable/index
.assign() mutate() Create a new variable on the existing data frame
.sort_values() arrange() Arrange all data values along a specified (set of) column variable(s)/indices
.groupby() group_by() Index data frame by specific (set of) column variable(s)/index value(s)
.agg() summarize() aggregate data by specific function rules
.pivot_table() spread() cast the data from a “long” to a “wide” format
pd.melt() gather() cast the data from a “wide” to a “long” format
.() %>% piping, fluid programming, or the passing one function output to the next

Lecture Notes

Class mid-semester survey

Help me understand what is working and not working in the course, take the survey here: https://forms.gle/7QqsojrSLV3B493i6