PPOL 5203 Data Science I: Foundations

Reproducibility, Version Control, Git/Github

Tiago Ventura

Learning goals¶

Today we will discuss Version Control, Workflow and Reproducibility in Data Science. It means we will pretty much learn how to incorporate Git and Github to our day-to-day data science workflow.

These are the main concepts we will learn:

Reproducibility in Data Science
Conceptualize what version control and how it works;
Basics of Git and operating it with command line arguments;
Generating repositories on Github;
Linking local repositories with online remotes;
Dealing with merge conflicts;
Advanced functionality:
- Branching

I will cover some of the concepts here. However, I expect you to do the weekly readings to fully grasp the concepts I will describe here and cover during the lecture

Cheatsheets¶

Note that the purpose of this notebook is to serve as a cheatsheet/reference for future use. Though there are better one out there than what I've outlined here. I've listed a few below for your use.

Reproducibility¶

Your data science workflow will generally involve:

Think about a research question/data analysis
Collect data
Clean the data
Run models
Visualize the results
Explain in plain language the substantive importance of your results.

Some key challenges will emerge as you evolve in your carreer as a data scientist:

Your projects will grow.
- When projects scale-up, you need to keep track of your work. Sometimes you decide to experiment with a different model, share you code with a colleagues who suggested you to try a different approach, work with some different visualization. All of this, ideally, without altering the previous draft of your results.
Data Science is a collaborative field
- Data Science is a new and very collaborative field, both in academia and industry. Most of your projects will be shared with other data scientists, and you will want to be able to work at the same time in the project, pushing your ideas in different directions, without affecting what you have already consolidated.
You will be constantly checking you code and others' code.
- To do so, you need codes to be well-documented, transparent, and check how the code changed over time
Projects will come and go.
- Sometimes a project will slow down, and you will return to it a few months later. Trust me, you will thank you past self if you code is well documented, transparent, and your steps are fully reproducible.

These challenges will appear across different projects, but, most importantly, you are very likely to repeat these tasks for the different iterations of the same project. As you do this, you need to make sure your results are reproducible.

Best Practices for Reproducibility and Project Management¶

Reproducibility is fundamental to the scientific method. As you publish a article/data report/data analysis you need to make sure others (or even you in the future) would be able to easily keep a record of your actions as in the data (cleaning, measurement choices, models), the presentation of the results, all in a reproducible way.

Here are some suggestions for you to overcome these challenges will minimal headaches:

Documentation

Use # to describe every single step of your code

Readability

Make you code readable in plain english. This usually mean giving names to your variables and functions that fully describe what your intents are.
Avoid:
- Abbreviation
- Generice Names
- Misleading names

Naming

Use meaningful names for your code/data/notebooks.
- File names should be meaningful
- DO NOT USE SPACES. Use snake case (_) style for you files and code

Portability

Use computational environments for your projects. (pyenv
Avoid absolute file paths
Keep track of the version of your package/libraries versions

Use Self-Contained Projects

Organize each data analysis into a specific folder on your computer.
In my case, every project folder has the following subfolders:
- /code
- /data
- /outputs
- /literature
- /text
- /misc

Version Control¶

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. It keeps a complete record of changes to a single file, a folder of material, or a whole project.

Version control avoids: file1.doc, file1_mod.doc, file1_modagain.doc, file1_mod_thelast.doc, etc..
Easily handle collaboration with co-workers (through GitHub)
Avoids version file names
Allows you to “rewind the tape” to earlier incarnations of your notes, drafts, papers and code
It is like microsoft Word track changes but for your entire project

Version control is the most complex and most important of the best practices described here for reproducibility in data science. It requires us to learn a tool to perform version control. We will use Git to do so.

Git & Github¶

Git: one of many options for version control in data science. You hardly hear abour other options though as has become the standard for version control

Github: is a public remote host for Git repositories. It works as a place where software developers and social scientists make their work available, and where you can contribute to ongoing projects or make public your own.

In a nutshell, you will git locally, and share and collaborate with others using Github.

Git basic workflow in words¶

Git has three main states that your files can reside in:

Modified - changed the file but have not committed it to the database yet.
Staged - marked a modified file in its current version to go into the next commit snapshot.
Committed - data is safely stored in your local database.

The fundamental Git process can be paraphrased as follows:

You alter files within your working directory.
You deliberately choose and stage only the modifications that you plan to include in your upcoming commit, transferring these changes to the staging area.
You make a commit, which captures the current state of the files in the staging area and permanently records that version into your Git directory

Basic Git commands¶

Set up your identity

git config --global user.name "myname": set your user name
git config --global user.email your-email@georgetown.edu: set your email account

Starting or getting a git repository on your machine

git init: start a new repository from a working directory
git clone <url or location to repository>: clone an existing repository
git status: get the current status of the repository.

Staging any changes you made

git add <file>: stage a file to be committed
git add .: stage all files to be committed
git reset HEAD <file>: un-stage all files to be committed

Saving the staged changes

git commit -m "some message": commit staged changes to repository
git commit: commit staged changes to repository (will be prompted to leave a message)

Getting current state from the remote (Github) or sending changes to it.

git fetch: download recent changes in the remote repository (but do not explicitly merge with your local version)
git pull: download recent changes in the remote repository and merge with your local version)
git push: push commits to remote (e.g. github repository)

Getting Help

git help <verb>
man git-<verb>

Git in Practice¶

The code below covers what we will do in class. It shows you the basics of using git for local version control.

1) Create an empty directory to be our git walkthrough¶

Open you command line.

Check your home directory or where you are when you open your terminal

pwd writes to standard output the full path name of your current directory
mkdir gitwalkthrough creates a new folder

Change working directory

cd gitwalkthrough change working directory

2) Check if you have a git¶

Let's make sure you have a git repository here (you know you do not because you just started this folder)

git status get the current status of the repository
- returns: fatal: not a git repository (or any of the parent directories): .git

Create your first git repository. This command creates a local version control for this project. This is different from creating a database in the server, like on github

git init start a new repository from a working directory

If you want to see what is going on behind the hood, this command creates a hidden directory .git in your folder. This directory tracks your version control.

3) Tracking and staging new files¶

Now we have a git repo created, let make our first commit.

First, let's create some random file using the command line. I will use vim as a text editor here.

vim test1.txt to open and edit the file

Remember, git has three basic states: untracked, tracked and stagged files. To move between these states, we use the add command.

git status gonna tell us we have a new untracked filed.
git add test1.txt start tracking/stage the file test1.txt
git status see we have tracked and staged our files

Let create a new file and modify our test1.txt

vim test1.txt modify as you wish
vim test2.txt create a new file
git status *check the output now, you have a new untracked file, and also unstagged changes in the old file
git add test1.txt test2.txt to track and stage all

4) First commit¶

With all our changes made, we are ready to make the first commit. Or, in other words, take the first snapshot of our directory

git commit -m 'first commit' make your first commit
git log *see the history of your commits

5) Returning to a previous snapshot¶

Let's see here how we can use to move back and forth at different stages of our project.

First, create a new test3.txt file

vim test3.txt
git add . track and stage
git commit -m 'second commit' *second commit

Check the log and moving back

git log
git checkout <hash> *to move to a past different snapshot

Where is file 3?

ls *no file test3.txt. we moved bck to our first commit.

Oh I didn't want miss file 3! No problemo!

git log --reflog *gets the reference log
git checkout <hash>

Git: Branchinch and Merging.¶

Branching allows us to work on different paths in git. It is very useful for two purposes:

Experimenting with code
Collaborating with colleagues.

An interesting way to visualize what branching means is by using the Visualize Git tool

Below you can see basics of the commands to create braches and merge in git.

Create a new branch
- git checkout -b <branch-name>
write code or create new files
- vim test4.txt
Stage and commit
- git add .
- git commit -m " hello from alternative world"
Check status across different branches
- ls test4.txt should be here
- git checkout master move back to the master branch
- ls *no test4.txt
Then we can merge our branches. Here we are doing a fast-forward merge, moving our master to keep up with the alternative branch
- git merge new branch

In general, at this stage, you would use git push to merge your local branch into your remote reposity (usually on github), and git pull to bring your remote to your local branch. You can understand git push| git pull as a merging step of your branch into the remote master branch.

Git Remotes: Git + Github.¶

Most times, you will use git integrated with Github. This is because the way git is more useful is by allowing multiple researchers to write and share code at the same time.

This is the workflow we will go through in class. It is usually how I use git in my day to day work. This is also insipiered by a tutorial we developed at the Center for Social Media and Politics (CSMaP), where I worked as a Postdoctoral Researcher.

Starting a New Project¶

Before you write any code:

Go to your github, and create a new repository
Open your terminal, and clone the project
- git clone <url>
Move your working directory to this new folder
- cd <project-directory>
Write code!
Track your changes: git add .
Commit your changes git commit -m 'describe your commit'
Push the changes in your local repository to GitHub:
- git push -u origin [branch-name] if this is a branch protected repo, you need to do a PR

Say now you already have the directory cloned, and you have a colleagues working in the code. Now you want to work on your part of the project. Then you pretty much follow the same sequence as above, but instead of clonning at your first step, you will do a pull from the git repo.

cd <gitrepo>
git pull

Git: Conflicts¶

When merging across different branches, sometimes there are conflicts between branches or remote versions of a repository. This can happen with you experimenting using different branches, but it is very often when collaboration on large scale projects. Say you changed the some part of a file by deleting a function and a colleague changed the same file by modifying the function. This would be an example of a conflict.

Git does not know which version is the correct one, so it will mark the file as having a conflict using a special delimiter.

<<<<<<< HEAD
ADD EXAMPLE FROM class
=======
ADD EXAMPLE FROM CLASS
>>>>>>> new-branch

What these delimiters mean?

the top half is the branch you a merging into
the bottom half is from the commit that you are trying to merge in

How to solve a conflict?

Open your text editor and navigate to the file that has merge conflicts.
Solve the conflict (which may incorporate changes from both branches) and delete the conflict markers <<<<<<<, =======, >>>>>>> and make the changes you want in the final merge.
Stage your changes (git add)
Commit your changes (git commit)

Let's work through an example:

1) Create a new branch, change an file, and commit¶

git checkout -b "new" : create a branch call new, checkout directly
vim test1.txt: make some modification
git add test1.txt
git commit -m "new file 1": commit your changes

2) Now, let's do the same in the master branch¶

git checkout "master" : move back to master
vim test1.txt: make some modification and see that the old modification is not here
git add test1.txt
git commit -m "new file 1" from master commit your changes

3) Merge and solve conflict¶

Because we are at the master branch, we just need to merge the new branch to the master

git merge new: this merges new to the master. if we wanted to do the opposite, then we would need to checkout to new, and use git merge master

We got an error. each branch has a difference in the same file

See the error:

vim test1.txt: see the new file is a combination of each of the different versions of the files in each branch.
Fix the issue in your text editor
git add test1.txt
git commit -m "fixed conflict": commit your changes
git log: to see your merge complete

.gitignore: ignoring specific files or file types¶

Sometimes we do not want to track certain file types.

For example, Github has an upload rate of 100mb, meaning that we wouldn't want to push really big data sources up to the repository. We might want to avoid uploading any data files to our Github repository for this reason. To do this, we may want to ignore specific file types, such as .csv (comma separated values) or .Rdata (an R data file type). To do this, we need to make a special file that Git reads to tell it which files not to track.

We can exclude these files by adding a .gitignore file to our project folder.

*.ipynb_checkpoints 
*.Rdata
*.csv

Advanced Git commands¶

Accessing the logs → who did what to which file and when?

git log: look at the commit history
- Useful arguments:
  - --oneline: view a condensed summary
  - --all: view the entire commit history
  - --graph: view a text graph of the commit sequence
  - --stat: abbreviated stats for each commit
  - --since=2.weeks: review commits within some temporal range
- Easily format the log
  - git log --pretty=format:"%h - %an, %ar : %s"
  - see Git Basics on Viewing the Commit History for more insight into the different possible configurations and customizations

Tracking Differences

git diff : explore the differences between files
- Use the hash hexidecimal code to compare commits
  - e.g. git diff 44d14b2 2adbea3
git whatchanged

Tracking Movement: If we were to just rename or move a file, git doesn't necessarily know that it was already tracking that file.

git mv old-file-location new-file-location: Move files around so that the git history is retained
git mv old-file-name new-file-name: Rename files so that the git history is retained

Time Traveling

git checkout <commit-hash>: Move to prior snapshots of the project
git revert <commit-hash>: Revert the project to a prior point

Branching: A branch in git is a lightweight, movable pointer to a commit. Default branch is named "master"

git branch <name-of-new-branch>: create a new branch
git checkout <name-of-branch>: checkout a branch
git checkout -b <name-of-new-branch>: create & checkout a branch simultaneously
git merge <name-of-main-branch> <name-of-branch-to-be-merged>: merging branches
git branch -d <name-of-branch>: deleting branches
git branch -v: seeing the last commit on each branch

Git Remotes¶

Git Remote

git remote add origin https://github.com/user/repo.git: connect a local git repository to a Github repository
- generic version: git remote add <name-of-our-remote> <REMOTE_URL>
- We can add another remote to say another git repository service, like bitbucket.

Looking at our different remotes

git remote: print available remotes in the console
git ls-remote: Displays references available in a remote repository along with the associated commit IDs.
git remote -v: shows the URL of the remotes

Fetching from a remote

git fetch <remote-name>

Pushing changes to the remote

git push -u <remote> <branch>: telling it which remote we are pushing to.
git push -u origin master: telling it which remote we are pushing to.

Inspecting Remotes

git remote show origin
git remote show

Renaming Remotes

git remote rename origin my-go-to-remote

Removing Remotes

git remote remove <remote-name>

!jupyter nbconvert _version_control_git.ipynb --to html --template classic

[NbConvertApp] Converting notebook _version_control_git.ipynb to html
[NbConvertApp] Writing 302818 bytes to _version_control_git.html

PPOL 5203 Data Science I: Foundations Reproducibility, Version Control, Git/Github Tiago Ventura