PPOL 5203 - Data Science I: Foundations
Week 6: Data Wrangling in Pandas
Where we are…
We started with the basics of being a data scientist
We moved over to the primitives of Python as your main DS tool
Then we started our journey working with tabular data:
Numpy for matrices
Pandas for heterogeneous dataframes ~ introductions, slicing, constructors
Plans for Today & Next Week
Data Wrangling in Pandas
Loading and Writing Data
Data Processsing: row, columns, and grouped
Tidying and Joining (Next Week + Visualization)
Miscelanneous (Work by yourselves in the notebook)
More in class exercises
Class Website: https://tiagoventura.github.io/ppol5203/weeks/week-06.html
Loading/Writing Data in Pandas
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv |
to_csv |
text | JSON | read_json |
to_json |
text | HTML | read_html |
to_html |
text | Local clipboard | read_clipboard |
to_clipboard |
binary | MS Excel | read_excel |
to_excel |
binary | HDF5 Format | read_hdf |
to_hdf |
binary | Feather Format | read_feather |
to_feather |
binary | Parquet Format | read_parquet |
to_parquet |
binary | Msgpack | read_msgpack |
to_msgpack |
binary | Stata | read_stata |
to_stata |
binary | SAS | read_sas |
|
binary | Python Pickle Format | read_pickle |
to_pickle |
SQL | SQL | read_sql |
to_sql |
SQL | Google Big Query | read_gbq |
to_gbq |
Read more about all the input/output methods here.
Data Wrangling
pandas |
dplyr \(^\dagger\) |
Description |
---|---|---|
.filter() |
select() |
select column variables/index |
.drop() |
select() |
drop selected column variables/index |
.rename() |
rename() |
rename column variables/index |
.query() |
filter() |
row-wise subset of a data frame by a values of a column variable/index |
.assign() |
mutate() |
Create a new variable on the existing data frame |
.sort_values() |
arrange() |
Arrange all data values along a specified (set of) column variable(s)/indices |
.groupby() |
group_by() |
Index data frame by specific (set of) column variable(s)/index value(s) |
.agg() |
summarize() |
aggregate data by specific function rules |
.pivot_table() |
spread() |
cast the data from a “long” to a “wide” format |
pd.melt() |
gather() |
cast the data from a “wide” to a “long” format |
.() |
%>% |
piping, fluid programming, or the passing one function output to the next |
Lecture Notes
Class mid-semester survey
Help me understand what is working and not working in the course, take the survey here: https://forms.gle/7QqsojrSLV3B493i6