Week 6: Data Wrangling in Pandas
We started with the basics of being a data scientist
We moved over to the primitives of Python as your main DS tool
Then we started our journey working with tabular data:
Numpy for matrices
Pandas for heterogeneous dataframes ~ data structures, slicing, constructors.
Data Wrangling in Pandas
Loading and Writing Data
Data Processsing: row, columns, and grouped
Tidying and Joining (Next Week + Visualization)
Miscelanneous (Work by yourselves in the notebook)
More in class exercises
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv |
to_csv |
text | JSON | read_json |
to_json |
text | HTML | read_html |
to_html |
text | Local clipboard | read_clipboard |
to_clipboard |
binary | MS Excel | read_excel |
to_excel |
binary | HDF5 Format | read_hdf |
to_hdf |
binary | Feather Format | read_feather |
to_feather |
binary | Parquet Format | read_parquet |
to_parquet |
binary | Msgpack | read_msgpack |
to_msgpack |
binary | Stata | read_stata |
to_stata |
binary | SAS | read_sas |
|
binary | Python Pickle Format | read_pickle |
to_pickle |
SQL | SQL | read_sql |
to_sql |
SQL | Google Big Query | read_gbq |
to_gbq |
Read more about all the input/output methods here.
pandas |
dplyr \(^\dagger\) |
Description |
---|---|---|
.filter() |
select() |
select column variables/index |
.drop() |
select() |
drop selected column variables/index |
.rename() |
rename() |
rename column variables/index |
.query() |
filter() |
row-wise subset of a data frame by a values of a column variable/index |
.assign() |
mutate() |
Create a new variable on the existing data frame |
.sort_values() |
arrange() |
Arrange all data values along a specified (set of) column variable(s)/indices |
.groupby() |
group_by() |
Index data frame by specific (set of) column variable(s)/index value(s) |
.agg() |
summarize() |
aggregate data by specific function rules |
.pivot_table() |
spread() |
cast the data from a “long” to a “wide” format |
pd.melt() |
gather() |
cast the data from a “wide” to a “long” format |
.() |
%>% |
piping, fluid programming, or the passing one function output to the next |
What is it? A data science project, applying concepts learned throughout the course.
Involves collecting data, cleaning and analyzing it, and presenting your findings
The project is composed of three parts:
a 2 page project proposal: (which should be discussed and approved by me)
an in-class presentation,
A 10-page project report.
Requirement | Due | Length | Percentage |
---|---|---|---|
Project Proposal | October 31 | 2 pages | 5% |
Presentation | December 9 | 10-15 minutes | 10% |
Project Report | December 16 | 10 pages | 25% |
Six groups of three students + Two groups of two . You pick your groups.
Before October 31, you should have meet with me to discuss your proposal.
At lest one hour before our meeting, send me a draft of your proposal.
Add the info here
Help me understand what is working and not working in the course, take the survey here: https://forms.gle/E584UtfrivXrbruE6
Data science I: Foundations