+ - 0:00:00
Notes for current slide
Notes for next slide

Version control for academics

or

Git: from the kernel to reproducible research

Oskar Laverny @ YRD

Feb. 17, 2023

fork me here

1 / 25

Version Control ?

Piled Higher and Deeper by Jorge Cham. Source: www.phdcomics.com
"Primitive" version controls are manual (and thus error-prone), but quite common:

1. When working on a project, you need some kind of history of what you did

2. Being able to roll back and forth between versions is nice

3. Collaboration is a mandatory feature

=> You need some kind of version control system.


2 / 25

Why version control

A Version Control System (VCS) is a system that keeps track of changes made to a collection of files or folder. Such systems usually facilitate collaborations.

Data is tracked as a list of snapshots, which also contains metadata.

Why is it usefull :

  • Permits collaborative development
  • Efficient for all projects sizes
  • Automatically answer hard questions : who, when, why ? Bisection ?
  • Allows you to revert any changes and go back to a previous state
  • Allows you to maintain several concurrent versions of the project

Remark: Printing + red pen + Scan is primitive VCS. Mailing is a valid but very manual VCS.

but mailing is error prone, merging two versions is a mess...

3 / 25

Version control and reproducible research

As academics, our work usually outputs a lot of plain text files:

  • Main content as .tex files (definitions, theorems, proofs, etc..)
  • References as .bib file
  • Data as .csv file or other format
  • Source code that analyze data and produces graphs and tables
  • Compilation pipeline as a makefile

Academics usually value reproducibility of results. This requires that you provide others a way to... reproduce your results.

A good VCS facilitates reproducibility as it allows you and others to understand how the current result was made, and therefore faciliate drastically the reproduction of your results.

4 / 25

Classical motivations

Good reasons to use version control:

  • Transparency of work done
  • Better organization of files and folder
  • Easy to track change
  • Easy collaboration (no need to mail things around).
  • Ensure reproducibility of your work by anyone else
  • Huge potential for automation we will discuss that later

More precisely:

  • An academic project is a set of changes made to a set of text files.
  • This is very close to a software project
  • Tools from software engineers are applicable to academic work.

=> You should consider your work as a code base.

5 / 25

Git: the defacto standard VCS.

XKCD Comic
1. Git was created in 2005 by Linus Torvalds to manage the kernel's code.

2. SODS2021 : 94% of devs use git everyday.

3. Github has over 200M repo from 73M users in 337 languages, holding the humanity code legacy.


Today's program:

I. Bottom-up introduction to Git (leaky astraction)

II. Examples of application to academic work

6 / 25

I. Bottom-up introduction to Git

7 / 25

Git's data model

A git repository is a collection of files and folders in a directory, here called (root). A sample directory might look like that:

(root) (tree)
|
|- analysis (tree)
| |
| | data.csv (blob)
| | script.jl (blob)
|
|- bibliograpy.bib (blob)
|- paper.tex (blob)
|- makefile (blob)
|

Folders are called trees, files are called blobs. The data structure is recursive : trees can contain other trees and blobs.

Snapshots in git are called commits. They have metadata attached to them :

  • Author
  • Description
  • Date
  • etc...
8 / 25

Git's history model

Git models history as a Directed acyclic graph (DAG). Each snapshot, encapsulating the whole set of files and folders, is timestamped and metadata are added. One metadata is the parent snapshot.

Exemple of a linear history:

O <-- O <-- O

You can work on diferent things in paralel via branches:

O <-- O <-- O <-- O (you add a missing proof)
^
\
--- O <-- O (your coauthor fixes a notation)

Branches can then be merged:

O <-- O <-- O <-- O <---- O (both the proof and the notation fix are included)
^ /
\ v
--- O <-- O

merge conflicts facilitation..

9 / 25

Pseudo-implementation

We can implement the git data model quite easily.

A file is just of bunch of bytes :

Blob = Vector{Byte}

A directory contains named files and directories

Tree = Dict{String,Union{Tree,Blob}}

A commit has parents, metadata, and the top-level tree

struct Commit
parent::Vector{Commit} # might be empty
author::String
message::String
snapshot::Tree
end

This is a clean way to model history.

10 / 25

Objects and content-adressing

An object is a blob, tree or commit.

Object = Union{Blob,Tree,Commit}

Objects are content-addressed. What git maintains on disc is a store of objects with names :

# The main git store:
objects = Dict{String, Object}() # we initialize it empty.

where the keys are the sha1-hashes of the objects:

function store!(objects,o)
id = sha1(o)
objects[id] = o
end
function load(objects,id)
return objects[id]
end

Note: everything is referenced by id and not copied around..

11 / 25

e156393cef8b62f30d58ca0eb37bf7a75221471e

A Sha1 hash is a hexadecimal string of 40 characters (160 bits), such as e156393...71e.

  • Sha1 is a cryptographic hash function: it is deterministic but chaotic
  • Gives a "unique" identifier to a commit, which is in "bijection" with its content.

Problem: Sha1 names are really inconvenient. Thus git maintains a set of references:

references = Dict{String,String}() # maps readable human names to sha1 hashes.

Ex: fix_bug_in_proof, master, HEAD, v0.1, sudmitted_version ...

We can now refer to things by name in the commit graph.

Remark: The actual commit graph is immutable, but references are mutable. In other words, you can move the fix_bug_in_proof reference, while you cannot change hashes of already-done commits.

12 / 25

A few commands

All git commands make additions to the DAG and/or modifications to the references:

git commit
git checkout
git branch
git merge
---------------------------------------------------
O <-- O <-- O (HEAD -> master)
--------------------------------------------------- git commit
O <-- O <-- O <-- O (HEAD -> master)
--------------------------------------------------- git branch; git checkout; git commit x2
O <-- O <-- O <-- O (master)
^
\
--- O <-- O (HEAD -> bugfix)
--------------------------------------------------- git checkout; git merge
O <-- O <-- O <-- O <---- O (HEAD -> master)
^ /
\ v
--- O <-- O (bugfix)
---------------------------------------------------
13 / 25

II. Examples of application to academic work

14 / 25

The templating problem (1/3)

Say we finished working on a paper.

lrnv@laptop paper $ git lg
* b9f1617 (HEAD -> master, tag: v1.0) Fix a typo Oskar Laverny 3 minutes ago (2023-01-12)
* 3cb7e6c Add proof of main Thm Oskar Laverny 4 minutes ago (2023-01-12)
* 88d916e Add definition of XX Oskar Laverny 4 minutes ago (2023-01-12)
* 147a080 First commit Oskar Laverny 5 minutes ago (2023-01-12)

The current version, tagged v1.0, will be the version we push to arXiv and send to review. But the paper requires that we send a version to review with specific notations, templating, etc... Use a branch:

lrnv@laptop paper $ git checkout -b templating
Switched to a new branch 'templating'

Now do stuff in that branch to comply with the requirements of the journal, and commit them.

15 / 25

The templating problem (2/3)

lrnv@laptop paper $ git lg
* d380354 (HEAD -> templating) Comply with journal's XX template Oskar Laverny 18 seconds ago (2023-01-12)
* b9f1617 (tag: v1.0, master) Fix a typo Oskar Laverny 8 minutes ago (2023-01-12)
* 3cb7e6c Add proof of main Thm Oskar Laverny 8 minutes ago (2023-01-12)
* 88d916e Add definition of XX Oskar Laverny 9 minutes ago (2023-01-12)
* 147a080 First commit Oskar Laverny 10 minutes ago (2023-01-12)

Now we send the version from the templating branch to the journal. While they revise is, we find some error in the paper. We need to fix the mistake in all the versions, right ?

* fde936d (HEAD -> master) Fix error about XXX Oskar Laverny 23 seconds ago (2023-01-12)
| * d380354 (templating) Comply with journal's XX template Oskar Laverny 4 minutes ago (2023-01-12)
|/
* b9f1617 (tag: v1.0) Fix a typo Oskar Laverny 11 minutes ago (2023-01-12)
* 3cb7e6c Add proof of main Thm Oskar Laverny 12 minutes ago (2023-01-12)
* 88d916e Add definition of XX Oskar Laverny 12 minutes ago (2023-01-12)
* 147a080 First commit Oskar Laverny 13 minutes ago (2023-01-12)
16 / 25

The templating problem (3/3)

lrnv@laptop paper $ git checkout templating && git merge master && git lg
* b9faeb0 (HEAD -> templating) Merge commit Oskar Laverny 1 seconds ago (2023-01-12)
|\
* | fde936d (HEAD -> master) Fix error about XXX Oskar Laverny 23 seconds ago (2023-01-12)
| * d380354 Comply with journal's XX template Oskar Laverny 4 minutes ago (2023-01-12)
|/
* b9f1617 (tag: v1.0) Fix a typo Oskar Laverny 11 minutes ago (2023-01-12)
* 3cb7e6c Add proof of main Thm Oskar Laverny 12 minutes ago (2023-01-12)
* 88d916e Add definition of XX Oskar Laverny 12 minutes ago (2023-01-12)
* 147a080 First commit Oskar Laverny 13 minutes ago (2023-01-12)

Now both versions contain the error fix. You can of course do the same thing with more versions...

Remind that references are mutable, but objects are not. The hashes of each commits are still exactly the sames, but references have moved and the DAG was updated.

17 / 25

Github & Github Actions (1/2)

Github is a website that allows you to host git repositories and collaborate with others on the same projects. Github Actions are a great way to automatize some repetitive tasks around academic projects:

  • Run the analysis and re-render the paper at each commit.
  • Automatic testing and continuous integration of software.
  • Automatically compile stuff for you and your collaborators, or even reviewers, with fixed urls for outputs.

This type of continuous integration will ensures constant reproducibility of your results. If you end up doing something repetitively, then probably you should automatize it:

  • Less risk of error
  • No more forgetting a step (e.g. forgetting to update a graph after changing its input!)
18 / 25

Github Actions & latexdiff (2/2)

The git-latexdiff tool allows to compile a diffed version of the latex document. It can be done online directly at each commit to track changes on the go:

XKCD Comic

These latexdiff are also perfect to send back to reviewers... e.g. with a commit log!

19 / 25

Using issues to discuss and collaborate.

If more than one author is involved, then a Github repository on which everyone writes modifications is usually a good thing. Github repo comes with a discussion area:

  • Issues can be used to discuss potential modifications together. They can be linked in commit messages for clearer retrospective and history.
  • Pull request to validate together changes before merging them.

Written and asynchronous collaborative writting is neat. What is written never gets lost, and everyone can catch up easily.

Known example : Textbook on informal homotopy type theory

  • 600pages book
  • written by my than 20 people
  • in 6 months.
  • More than 1000 issues and PRs

See Andrej Bauer's blog post about it.

20 / 25

Cost-free experimentation

Branching out is a neat way of experimenting, e.g. a large refactoring of notations, without loosing what you already have.

No risk of loosing your work => Costless inovation.

"Branches are like soap: you should use them."

Exemple of automatic releases

I use Github actions to automatically build my resume on each commit:

  • Add a ref in papers.bib or confs.bib or another one, comit & push.
  • Wait a little for it to compile cv.pdf and output at the fixed URL.

The link to output is fixed so that the person clicking on it will always get the last version.

21 / 25

Github Pages

Your project and/or your personal website and/or projects websites can be hosted and compiled directly from source, online, and served as website.

Github Pages allows you to pulish staticely generated websites such as blogs, documentations, applications of your packages, etc. A great way to communicate results that do not fit into the pdf standard (such as datasets and other stuff) with a permanent link.

JOSS

JOSS is a journal that review and publishes code, entirely based on GitHub.

  • Get a deep review of your technical code
  • Dont write paper about code you already wrote.
  • Value your work via DOI and citability (also see Zenodo)
22 / 25

Protection of research ideas.

You may add a licence to you project to ensure that the ideas developed in your paper are legally yours, including code and everything.

The online version of the history proves that you wrote something at a certain time. If you choose to make the repo public before the article is published, then it can be argued on it.

Post-publication peer-review

Once your article is peer-reviewed, if it lives in a public github repository (and it should), you are oppened to post-publication peer-review:

  • Anyone can come by and take a look, post a comment and propose modification.
  • A new preprint can be produced if the modifications are important.

See there for deeper developpement.

23 / 25

Conclusion

24 / 25

Doggy-bag

The key messages today are :

  • You probably already use some kind of version control.
  • Learning an efficient VCS such as git is probably worth your time.
  • Asynchronous written collaboration is a very powerful collaboration scheme for writing software, but also for research !

Thanks !

Theses slides are at https://lrnv.github.io/yrd2023/. Few more ressources:

25 / 25

Version Control ?

Piled Higher and Deeper by Jorge Cham. Source: www.phdcomics.com
"Primitive" version controls are manual (and thus error-prone), but quite common:

1. When working on a project, you need some kind of history of what you did

2. Being able to roll back and forth between versions is nice

3. Collaboration is a mandatory feature

=> You need some kind of version control system.


2 / 25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow