PPOL 5203 - Data Science I: Foundations

Week 6: Data Wrangling in Pandas

Author

Professor: Tiago Ventura

Published

October 14, 2024

Where we are…

We started with the basics of being a data scientist
We moved over to the primitives of Python as your main DS tool
Then we started our journey working with tabular data:
- Numpy for matrices
- Pandas for heterogeneous dataframes ~ introductions, slicing, constructors

Data Wrangling in Pandas
- Loading and Writing Data
- Data Processsing: row, columns, and grouped
- Tidying and Joining (Next Week + Visualization)
- Miscelanneous (Work by yourselves in the notebook)
More in class exercises

Format Type	Data Description	Reader	Writer
text	CSV	`read_csv`	`to_csv`
text	JSON	`read_json`	`to_json`
text	HTML	`read_html`	`to_html`
text	Local clipboard	`read_clipboard`	`to_clipboard`
binary	MS Excel	`read_excel`	`to_excel`
binary	HDF5 Format	`read_hdf`	`to_hdf`
binary	Feather Format	`read_feather`	`to_feather`
binary	Parquet Format	`read_parquet`	`to_parquet`
binary	Msgpack	`read_msgpack`	`to_msgpack`
binary	Stata	`read_stata`	`to_stata`
binary	SAS	`read_sas`
binary	Python Pickle Format	`read_pickle`	`to_pickle`
SQL	SQL	`read_sql`	`to_sql`
SQL	Google Big Query	`read_gbq`	`to_gbq`

Read more about all the input/output methods here.

`pandas`	`dplyr`\(^\dagger\)	Description
`.filter()`	`select()`	select column variables/index
`.drop()`	`select()`	drop selected column variables/index
`.rename()`	`rename()`	rename column variables/index
`.query()`	`filter()`	row-wise subset of a data frame by a values of a column variable/index
`.assign()`	`mutate()`	Create a new variable on the existing data frame
`.sort_values()`	`arrange()`	Arrange all data values along a specified (set of) column variable(s)/indices
`.groupby()`	`group_by()`	Index data frame by specific (set of) column variable(s)/index value(s)
`.agg()`	`summarize()`	aggregate data by specific function rules
`.pivot_table()`	`spread()`	cast the data from a “long” to a “wide” format
`pd.melt()`	`gather()`	cast the data from a “wide” to a “long” format
`.()`	`%>%`	piping, fluid programming, or the passing one function output to the next

Help me understand what is working and not working in the course, take the survey here: https://forms.gle/7QqsojrSLV3B493i6