PPOL 5203 Data Science I: Foundations

Reproducibility, Version Control, Git/Github

Tiago Ventura


Learning goals

Today we will discuss Version Control, Workflow and Reproducibility in Data Science. It means we will pretty much learn how to incorporate Git and Github to our day-to-day data science workflow.

These are the main concepts we will learn:

  • Reproducibility in Data Science
  • Conceptualize what version control and how it works;
  • Basics of Git and operating it with command line arguments;
  • Generating repositories on Github;
  • Linking local repositories with online remotes;
  • Dealing with merge conflicts;
  • Advanced functionality:
    • Branching

I will cover some of the concepts here. However, I expect you to do the weekly readings to fully grasp the concepts I will describe here and cover during the lecture

Cheatsheets

Note that the purpose of this notebook is to serve as a cheatsheet/reference for future use. Though there are better one out there than what I've outlined here. I've listed a few below for your use.

Reproducibility

Your data science workflow will generally involve:

  • Think about a research question/data analysis
  • Collect data
  • Clean the data
  • Run models
  • Visualize the results
  • Explain in plain language the substantive importance of your results.

Some key challenges will emerge as you evolve in your carreer as a data scientist:

  • Your projects will grow.

    • When projects scale-up, you need to keep track of your work. Sometimes you decide to experiment with a different model, share you code with a colleagues who suggested you to try a different approach, work with some different visualization. All of this, ideally, without altering the previous draft of your results.
  • Data Science is a collaborative field

    • Data Science is a new and very collaborative field, both in academia and industry. Most of your projects will be shared with other data scientists, and you will want to be able to work at the same time in the project, pushing your ideas in different directions, without affecting what you have already consolidated.
  • You will be constantly checking you code and others' code.

    • To do so, you need codes to be well-documented, transparent, and check how the code changed over time
  • Projects will come and go.

    • Sometimes a project will slow down, and you will return to it a few months later. Trust me, you will thank you past self if you code is well documented, transparent, and your steps are fully reproducible.

These challenges will appear across different projects, but, most importantly, you are very likely to repeat these tasks for the different iterations of the same project. As you do this, you need to make sure your results are reproducible.

Best Practices for Reproducibility and Project Management

Reproducibility is fundamental to the scientific method. As you publish a article/data report/data analysis you need to make sure others (or even you in the future) would be able to easily keep a record of your actions as in the data (cleaning, measurement choices, models), the presentation of the results, all in a reproducible way.

Here are some suggestions for you to overcome these challenges will minimal headaches:

Documentation

  • Use # to describe every single step of your code

Readability

  • Make you code readable in plain english. This usually mean giving names to your variables and functions that fully describe what your intents are.
  • Avoid:
    • Abbreviation
    • Generice Names
    • Misleading names

Naming

  • Use meaningful names for your code/data/notebooks.
    • File names should be meaningful
    • DO NOT USE SPACES. Use snake case (_) style for you files and code

Portability

  • Use computational environments for your projects. (pyenv
  • Avoid absolute file paths
  • Keep track of the version of your package/libraries versions

Use Self-Contained Projects

  • Organize each data analysis into a specific folder on your computer.
  • In my case, every project folder has the following subfolders:
    • /code
    • /data
    • /outputs
    • /literature
    • /text
    • /misc

Version Control

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. It keeps a complete record of changes to a single file, a folder of material, or a whole project.

  • Version control avoids: file1.doc, file1_mod.doc, file1_modagain.doc, file1_mod_thelast.doc, etc..

  • Easily handle collaboration with co-workers (through GitHub)

  • Avoids version file names

  • Allows you to “rewind the tape” to earlier incarnations of your notes, drafts, papers and code

  • It is like microsoft Word track changes but for your entire project

Version control is the most complex and most important of the best practices described here for reproducibility in data science. It requires us to learn a tool to perform version control. We will use Git to do so.

Git & Github

Git: one of many options for version control in data science. You hardly hear abour other options though as has become the standard for version control

Github: is a public remote host for Git repositories. It works as a place where software developers and social scientists make their work available, and where you can contribute to ongoing projects or make public your own.

In a nutshell, you will git locally, and share and collaborate with others using Github.

Git basic workflow in words

Git has three main states that your files can reside in:

  • Modified - changed the file but have not committed it to the database yet.
  • Staged - marked a modified file in its current version to go into the next commit snapshot.
  • Committed - data is safely stored in your local database.

The fundamental Git process can be paraphrased as follows:

  • You alter files within your working directory.

  • You deliberately choose and stage only the modifications that you plan to include in your upcoming commit, transferring these changes to the staging area.

  • You make a commit, which captures the current state of the files in the staging area and permanently records that version into your Git directory

Basic Git commands

Set up your identity

  • git config --global user.name "myname": set your user name
  • git config --global user.email your-email@georgetown.edu: set your email account

Starting or getting a git repository on your machine

  • git init: start a new repository from a working directory
  • git clone <url or location to repository>: clone an existing repository
  • git status: get the current status of the repository.

Staging any changes you made

  • git add <file>: stage a file to be committed
  • git add .: stage all files to be committed
  • git reset HEAD <file>: un-stage all files to be committed

Saving the staged changes

  • git commit -m "some message": commit staged changes to repository
  • git commit: commit staged changes to repository (will be prompted to leave a message)

Getting current state from the remote (Github) or sending changes to it.

  • git fetch: download recent changes in the remote repository (but do not explicitly merge with your local version)
  • git pull: download recent changes in the remote repository and merge with your local version)
  • git push: push commits to remote (e.g. github repository)

Getting Help

  • git help <verb>
  • man git-<verb>

Git in Practice

The code below covers what we will do in class. It shows you the basics of using git for local version control.

1) Create an empty directory to be our git walkthrough

Open you command line.

Check your home directory or where you are when you open your terminal

  • pwd writes to standard output the full path name of your current directory
  • mkdir gitwalkthrough creates a new folder

Change working directory

  • cd gitwalkthrough change working directory

2) Check if you have a git

Let's make sure you have a git repository here (you know you do not because you just started this folder)

  • git status get the current status of the repository
    • returns: fatal: not a git repository (or any of the parent directories): .git

Create your first git repository. This command creates a local version control for this project. This is different from creating a database in the server, like on github

  • git init start a new repository from a working directory

If you want to see what is going on behind the hood, this command creates a hidden directory .git in your folder. This directory tracks your version control.

3) Tracking and staging new files

Now we have a git repo created, let make our first commit.

First, let's create some random file using the command line. I will use vim as a text editor here.

  • vim test1.txt to open and edit the file

Remember, git has three basic states: untracked, tracked and stagged files. To move between these states, we use the add command.

  • git status gonna tell us we have a new untracked filed.
  • git add test1.txt start tracking/stage the file test1.txt
  • git status see we have tracked and staged our files

Let create a new file and modify our test1.txt

  • vim test1.txt modify as you wish
  • vim test2.txt create a new file
  • git status *check the output now, you have a new untracked file, and also unstagged changes in the old file
  • git add test1.txt test2.txt to track and stage all

4) First commit

With all our changes made, we are ready to make the first commit. Or, in other words, take the first snapshot of our directory

  • git commit -m 'first commit' make your first commit
  • git log *see the history of your commits

5) Returning to a previous snapshot

Let's see here how we can use to move back and forth at different stages of our project.

First, create a new test3.txt file

  • vim test3.txt
  • git add . track and stage
  • git commit -m 'second commit' *second commit

Check the log and moving back

  • git log
  • git checkout <hash> *to move to a past different snapshot

Where is file 3?

  • ls *no file test3.txt. we moved bck to our first commit.

Oh I didn't want miss file 3! No problemo!

  • git log --reflog *gets the reference log
  • git checkout <hash>

Git: Branchinch and Merging.

Branching allows us to work on different paths in git. It is very useful for two purposes:

  • Experimenting with code

  • Collaborating with colleagues.

An interesting way to visualize what branching means is by using the Visualize Git tool

Below you can see basics of the commands to create braches and merge in git.

  • Create a new branch

    • git checkout -b <branch-name>
  • write code or create new files

    • vim test4.txt
  • Stage and commit

    • git add .
    • git commit -m " hello from alternative world"
  • Check status across different branches

    • ls test4.txt should be here
    • git checkout master move back to the master branch
    • ls *no test4.txt
  • Then we can merge our branches. Here we are doing a fast-forward merge, moving our master to keep up with the alternative branch

    • git merge new branch

In general, at this stage, you would use git push to merge your local branch into your remote reposity (usually on github), and git pull to bring your remote to your local branch. You can understand git push| git pull as a merging step of your branch into the remote master branch.

Git Remotes: Git + Github.

Most times, you will use git integrated with Github. This is because the way git is more useful is by allowing multiple researchers to write and share code at the same time.

This is the workflow we will go through in class. It is usually how I use git in my day to day work. This is also insipiered by a tutorial we developed at the Center for Social Media and Politics (CSMaP), where I worked as a Postdoctoral Researcher.

Starting a New Project

Before you write any code:

  • Go to your github, and create a new repository

  • Open your terminal, and clone the project

    • git clone <url>
  • Move your working directory to this new folder

    • cd <project-directory>
  • Write code!

  • Track your changes: git add .

  • Commit your changes git commit -m 'describe your commit'

  • Push the changes in your local repository to GitHub:

    • git push -u origin [branch-name] if this is a branch protected repo, you need to do a PR

Say now you already have the directory cloned, and you have a colleagues working in the code. Now you want to work on your part of the project. Then you pretty much follow the same sequence as above, but instead of clonning at your first step, you will do a pull from the git repo.

  • cd <gitrepo>

  • git pull

Git: Conflicts

When merging across different branches, sometimes there are conflicts between branches or remote versions of a repository. This can happen with you experimenting using different branches, but it is very often when collaboration on large scale projects. Say you changed the some part of a file by deleting a function and a colleague changed the same file by modifying the function. This would be an example of a conflict.

Git does not know which version is the correct one, so it will mark the file as having a conflict using a special delimiter.

<<<<<<< HEAD
ADD EXAMPLE FROM class
=======
ADD EXAMPLE FROM CLASS
>>>>>>> new-branch

What these delimiters mean?

  • the top half is the branch you a merging into

  • the bottom half is from the commit that you are trying to merge in

How to solve a conflict?

  • Open your text editor and navigate to the file that has merge conflicts.

  • Solve the conflict (which may incorporate changes from both branches) and delete the conflict markers <<<<<<<, =======, >>>>>>> and make the changes you want in the final merge.

  • Stage your changes (git add)

  • Commit your changes (git commit)

Let's work through an example:

1) Create a new branch, change an file, and commit

  • git checkout -b "new" : create a branch call new, checkout directly
  • vim test1.txt: make some modification
  • git add test1.txt
  • git commit -m "new file 1": commit your changes

2) Now, let's do the same in the master branch

  • git checkout "master" : move back to master
  • vim test1.txt: make some modification and see that the old modification is not here
  • git add test1.txt
  • git commit -m "new file 1" from master commit your changes

3) Merge and solve conflict

Because we are at the master branch, we just need to merge the new branch to the master

  • git merge new: this merges new to the master. if we wanted to do the opposite, then we would need to checkout to new, and use git merge master

We got an error. each branch has a difference in the same file

See the error:

  • vim test1.txt: see the new file is a combination of each of the different versions of the files in each branch.

  • Fix the issue in your text editor

  • git add test1.txt

  • git commit -m "fixed conflict": commit your changes

  • git log: to see your merge complete

.gitignore: ignoring specific files or file types

Sometimes we do not want to track certain file types.

For example, Github has an upload rate of 100mb, meaning that we wouldn't want to push really big data sources up to the repository. We might want to avoid uploading any data files to our Github repository for this reason. To do this, we may want to ignore specific file types, such as .csv (comma separated values) or .Rdata (an R data file type). To do this, we need to make a special file that Git reads to tell it which files not to track.

We can exclude these files by adding a .gitignore file to our project folder.

*.ipynb_checkpoints 
*.Rdata
*.csv

Advanced Git commands

Accessing the logs → who did what to which file and when?

  • git log: look at the commit history
    • Useful arguments:
      • --oneline: view a condensed summary
      • --all: view the entire commit history
      • --graph: view a text graph of the commit sequence
      • --stat: abbreviated stats for each commit
      • --since=2.weeks: review commits within some temporal range
    • Easily format the log

Tracking Differences

  • git diff : explore the differences between files
    • Use the hash hexidecimal code to compare commits
      • e.g. git diff 44d14b2 2adbea3
  • git whatchanged

Tracking Movement: If we were to just rename or move a file, git doesn't necessarily know that it was already tracking that file.

  • git mv old-file-location new-file-location: Move files around so that the git history is retained
  • git mv old-file-name new-file-name: Rename files so that the git history is retained

Time Traveling

  • git checkout <commit-hash>: Move to prior snapshots of the project
  • git revert <commit-hash>: Revert the project to a prior point

Branching: A branch in git is a lightweight, movable pointer to a commit. Default branch is named "master"

  • git branch <name-of-new-branch>: create a new branch
  • git checkout <name-of-branch>: checkout a branch
  • git checkout -b <name-of-new-branch>: create & checkout a branch simultaneously
  • git merge <name-of-main-branch> <name-of-branch-to-be-merged>: merging branches
  • git branch -d <name-of-branch>: deleting branches
  • git branch -v: seeing the last commit on each branch

Git Remotes

Git Remote

  • git remote add origin https://github.com/user/repo.git: connect a local git repository to a Github repository
    • generic version: git remote add <name-of-our-remote> <REMOTE_URL>
    • We can add another remote to say another git repository service, like bitbucket.

Looking at our different remotes

  • git remote: print available remotes in the console
  • git ls-remote: Displays references available in a remote repository along with the associated commit IDs.
  • git remote -v: shows the URL of the remotes

Fetching from a remote

  • git fetch <remote-name>

Pushing changes to the remote

  • git push -u <remote> <branch>: telling it which remote we are pushing to.
  • git push -u origin master: telling it which remote we are pushing to.

Inspecting Remotes

  • git remote show origin
  • git remote show

Renaming Remotes

  • git remote rename origin my-go-to-remote

Removing Remotes

  • git remote remove <remote-name>
In [1]:
!jupyter nbconvert _version_control_git.ipynb --to html --template classic
[NbConvertApp] Converting notebook _version_control_git.ipynb to html
[NbConvertApp] Writing 302818 bytes to _version_control_git.html