Today we will discuss Version Control, Workflow and Reproducibility in Data Science. It means we will pretty much learn how to incorporate Git and Github to our day-to-day data science workflow.
These are the main concepts we will learn:
Git
and operating it with command line arguments;Github
;I will cover some of the concepts here. However, I expect you to do the weekly readings to fully grasp the concepts I will describe here and cover during the lecture
Note that the purpose of this notebook is to serve as a cheatsheet/reference for future use. Though there are better one out there than what I've outlined here. I've listed a few below for your use.
Your data science workflow will generally involve:
Some key challenges will emerge as you evolve in your carreer as a data scientist:
Your projects will grow.
Data Science is a collaborative field
You will be constantly checking you code and others' code.
Projects will come and go.
These challenges will appear across different projects, but, most importantly, you are very likely to repeat these tasks for the different iterations of the same project. As you do this, you need to make sure your results are reproducible.
Reproducibility is fundamental to the scientific method. As you publish a article/data report/data analysis you need to make sure others (or even you in the future) would be able to easily keep a record of your actions as in the data (cleaning, measurement choices, models), the presentation of the results, all in a reproducible way.
Here are some suggestions for you to overcome these challenges will minimal headaches:
Documentation
#
to describe every single step of your codeReadability
Naming
(_)
style for you files and codePortability
Use Self-Contained Projects
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. It keeps a complete record of changes to a single file, a folder of material, or a whole project.
Version control avoids: file1.doc, file1_mod.doc, file1_modagain.doc, file1_mod_thelast.doc, etc..
Easily handle collaboration with co-workers (through GitHub)
Avoids version file names
Allows you to “rewind the tape” to earlier incarnations of your notes, drafts, papers and code
It is like microsoft Word track changes but for your entire project
Version control is the most complex and most important of the best practices described here for reproducibility in data science. It requires us to learn a tool to perform version control. We will use Git
to do so.
Git
: one of many options for version control in data science. You hardly hear abour other options though as has become the standard for version control
Github: is a public remote host for Git
repositories. It works as a place where software developers and social scientists make their work available, and where you can contribute to ongoing projects or make public your own.
In a nutshell, you will git
locally, and share and collaborate with others using Github.
Git
has three main states that your files can reside in:
The fundamental Git
process can be paraphrased as follows:
You alter files within your working directory.
You deliberately choose and stage only the modifications that you plan to include in your upcoming commit, transferring these changes to the staging area.
You make a commit, which captures the current state of the files in the staging area and permanently records that version into your Git directory
Set up your identity
git config --global user.name "myname"
: set your user namegit config --global user.email your-email@georgetown.edu
: set your email accountStarting or getting a git repository on your machine
git init
: start a new repository from a working directorygit clone <url or location to repository>
: clone an existing repositorygit status
: get the current status of the repository.Staging any changes you made
git add <file>
: stage a file to be committed git add .
: stage all files to be committedgit reset HEAD <file>
: un-stage all files to be committedSaving the staged changes
git commit -m "some message"
: commit staged changes to repository git commit
: commit staged changes to repository (will be prompted to leave a message)Getting current state from the remote (Github) or sending changes to it.
git fetch
: download recent changes in the remote repository (but do not explicitly merge with your local version)git pull
: download recent changes in the remote repository and merge with your local version)git push
: push commits to remote (e.g. github repository)Getting Help
git help <verb>
man git-<verb>
The code below covers what we will do in class. It shows you the basics of using git
for local version control.
Open you command line.
Check your home directory or where you are when you open your terminal
pwd
writes to standard output the full path name of your current directory mkdir gitwalkthrough
creates a new folderChange working directory
cd gitwalkthrough
change working directoryLet's make sure you have a git
repository here (you know you do not because you just started this folder)
git status
get the current status of the repositoryfatal: not a git repository (or any of the parent directories): .git
Create your first git
repository. This command creates a local version control for this project. This is different from creating a database in the server, like on github
git init
start a new repository from a working directoryIf you want to see what is going on behind the hood, this command creates a hidden directory .git
in your folder. This directory tracks your version control.
Now we have a git
repo created, let make our first commit.
First, let's create some random file using the command line. I will use vim as a text editor here.
vim test1.txt
to open and edit the fileRemember, git
has three basic states: untracked, tracked and stagged files. To move between these states, we use the add command.
git status
gonna tell us we have a new untracked filed.git add test1.txt
start tracking/stage the file test1.txtgit status
see we have tracked and staged our filesLet create a new file and modify our test1.txt
vim test1.txt
modify as you wishvim test2.txt
create a new filegit status
*check the output now, you have a new untracked file, and also unstagged changes in the old filegit add test1.txt test2.txt
to track and stage allWith all our changes made, we are ready to make the first commit. Or, in other words, take the first snapshot of our directory
git commit -m 'first commit'
make your first commitgit log
*see the history of your commitsLet's see here how we can use to move back and forth at different stages of our project.
First, create a new test3.txt file
vim test3.txt
git add .
track and stagegit commit -m 'second commit'
*second commitCheck the log and moving back
git log
git checkout <hash>
*to move to a past different snapshotWhere is file 3?
ls
*no file test3.txt. we moved bck to our first commit. Oh I didn't want miss file 3! No problemo!
git log --reflog
*gets the reference loggit checkout <hash>
Branching allows us to work on different paths in git
. It is very useful for two purposes:
Experimenting with code
Collaborating with colleagues.
An interesting way to visualize what branching means is by using the Visualize Git tool
Below you can see basics of the commands to create braches and merge in git.
Create a new branch
git checkout -b <branch-name>
write code or create new files
vim test4.txt
Stage and commit
git add .
git commit -m " hello from alternative world"
Check status across different branches
ls
test4.txt should be heregit checkout master
move back to the master branchls
*no test4.txtThen we can merge our branches. Here we are doing a fast-forward merge, moving our master to keep up with the alternative branch
git merge new branch
In general, at this stage, you would use git push
to merge your local branch into your remote reposity (usually on github), and git pull
to bring your remote to your local branch. You can understand git push
| git pull
as a merging step of your branch into the remote master branch.
Most times, you will use git
integrated with Github
. This is because the way git
is more useful is by allowing multiple researchers to write and share code at the same time.
This is the workflow we will go through in class. It is usually how I use git
in my day to day work. This is also insipiered by a tutorial we developed at the Center for Social Media and Politics (CSMaP), where I worked as a Postdoctoral Researcher.
Before you write any code:
Go to your github, and create a new repository
Open your terminal, and clone the project
git clone <url>
Move your working directory to this new folder
cd <project-directory>
Write code!
Track your changes: git add .
Commit your changes git commit -m 'describe your commit'
Push the changes in your local repository to GitHub:
git push -u origin [branch-name]
if this is a branch protected repo, you need to do a PRSay now you already have the directory cloned, and you have a colleagues working in the code. Now you want to work on your part of the project. Then you pretty much follow the same sequence as above, but instead of clonning at your first step, you will do a pull from the git repo.
cd <gitrepo>
git pull
When merging across different branches, sometimes there are conflicts between branches or remote versions of a repository. This can happen with you experimenting using different branches, but it is very often when collaboration on large scale projects. Say you changed the some part of a file by deleting a function and a colleague changed the same file by modifying the function. This would be an example of a conflict.
Git does not know which version is the correct one, so it will mark the file as having a conflict using a special delimiter.
<<<<<<< HEAD
ADD EXAMPLE FROM class
=======
ADD EXAMPLE FROM CLASS
>>>>>>> new-branch
What these delimiters mean?
the top half is the branch you a merging into
the bottom half is from the commit that you are trying to merge in
How to solve a conflict?
Open your text editor and navigate to the file that has merge conflicts.
Solve the conflict (which may incorporate changes from both branches) and delete the conflict markers <<<<<<<, =======, >>>>>>> and make the changes you want in the final merge.
Stage your changes (git add
)
Commit your changes (git commit
)
Let's work through an example:
git checkout -b "new"
: create a branch call new, checkout directlyvim test1.txt
: make some modificationgit add test1.txt
git commit -m "new file 1"
: commit your changesgit checkout "master"
: move back to mastervim test1.txt
: make some modification and see that the old modification is not heregit add test1.txt
git commit -m "new file 1"
from master commit your changesBecause we are at the master branch, we just need to merge the new branch to the master
git merge new
: this merges new to the master. if we wanted to do the opposite, then we would need to checkout to new, and use git merge master
We got an error. each branch has a difference in the same file
See the error:
vim test1.txt
: see the new file is a combination of each of the different versions of the files in each branch.
Fix the issue in your text editor
git add test1.txt
git commit -m "fixed conflict"
: commit your changes
git log
: to see your merge complete
Sometimes we do not want to track certain file types.
For example, Github
has an upload rate of 100mb, meaning that we wouldn't want to push really big data sources up to the repository. We might want to avoid uploading any data files to our Github repository for this reason. To do this, we may want to ignore specific file types, such as .csv
(comma separated values) or .Rdata
(an R data file type). To do this, we need to make a special file that Git reads to tell it which files not to track.
We can exclude these files by adding a .gitignore
file to our project folder.
*.ipynb_checkpoints
*.Rdata
*.csv
Accessing the logs → who did what to which file and when?
git log
: look at the commit history--oneline
: view a condensed summary --all
: view the entire commit history--graph
: view a text graph of the commit sequence--stat
: abbreviated stats for each commit--since=2.weeks
: review commits within some temporal rangegit log --pretty=format:"%h - %an, %ar : %s"
Tracking Differences
git diff
: explore the differences between filesgit diff 44d14b2 2adbea3
git whatchanged
Tracking Movement: If we were to just rename or move a file, git
doesn't necessarily know that it was already tracking that file.
git mv old-file-location new-file-location
: Move files around so that the git history is retainedgit mv old-file-name new-file-name
: Rename files so that the git history is retainedTime Traveling
git checkout <commit-hash>
: Move to prior snapshots of the projectgit revert <commit-hash>
: Revert the project to a prior pointBranching: A branch in git
is a lightweight, movable pointer to a commit. Default branch is named "master"
git branch <name-of-new-branch>
: create a new branchgit checkout <name-of-branch>
: checkout a branchgit checkout -b <name-of-new-branch>
: create & checkout a branch simultaneouslygit merge <name-of-main-branch> <name-of-branch-to-be-merged>
: merging branchesgit branch -d <name-of-branch>
: deleting branchesgit branch -v
: seeing the last commit on each branchGit Remote
git remote add origin https://github.com/user/repo.git
: connect a local git repository to a Github repositorygit remote add <name-of-our-remote> <REMOTE_URL>
Looking at our different remotes
git remote
: print available remotes in the consolegit ls-remote
: Displays references available in a remote repository along with the associated commit IDs.git remote -v
: shows the URL of the remotesFetching from a remote
git fetch <remote-name>
Pushing changes to the remote
git push -u <remote> <branch>
: telling it which remote we are pushing to.git push -u origin master
: telling it which remote we are pushing to.Inspecting Remotes
git remote show origin
git remote show
Renaming Remotes
git remote rename origin my-go-to-remote
Removing Remotes
git remote remove <remote-name>
!jupyter nbconvert _version_control_git.ipynb --to html --template classic