Feb. 17, 2023
A Version Control System (VCS) is a system that keeps track of changes made to a collection of files or folder. Such systems usually facilitate collaborations.
Data is tracked as a list of snapshots, which also contains metadata.
Why is it usefull :
Remark: Printing + red pen + Scan is primitive VCS. Mailing is a valid but very manual VCS.
but mailing is error prone, merging two versions is a mess...
As academics, our work usually outputs a lot of plain text files:
Academics usually value reproducibility of results. This requires that you provide others a way to... reproduce your results.
A good VCS facilitates reproducibility as it allows you and others to understand how the current result was made, and therefore faciliate drastically the reproduction of your results.
Good reasons to use version control:
More precisely:
=> You should consider your work as a code base.
A git repository is a collection of files and folders in a directory, here called (root). A sample directory might look like that:
(root) (tree) | |- analysis (tree) | | | | data.csv (blob) | | script.jl (blob) | |- bibliograpy.bib (blob) |- paper.tex (blob) |- makefile (blob) |
Folders are called trees, files are called blobs. The data structure is recursive : trees can contain other trees and blobs.
Snapshots in git are called commits. They have metadata attached to them :
Git models history as a Directed acyclic graph (DAG). Each snapshot, encapsulating the whole set of files and folders, is timestamped and metadata are added. One metadata is the parent snapshot.
Exemple of a linear history:
O <-- O <-- O
You can work on diferent things in paralel via branches:
O <-- O <-- O <-- O (you add a missing proof) ^ \ --- O <-- O (your coauthor fixes a notation)
Branches can then be merged:
O <-- O <-- O <-- O <---- O (both the proof and the notation fix are included) ^ / \ v --- O <-- O
merge conflicts facilitation..
We can implement the git data model quite easily.
A file is just of bunch of bytes :
Blob = Vector{Byte}
A directory contains named files and directories
Tree = Dict{String,Union{Tree,Blob}}
A commit has parents, metadata, and the top-level tree
struct Commit parent::Vector{Commit} # might be empty author::String message::String snapshot::Treeend
This is a clean way to model history.
An object is a blob, tree or commit.
Object = Union{Blob,Tree,Commit}
Objects are content-addressed. What git maintains on disc is a store of objects with names :
# The main git store: objects = Dict{String, Object}() # we initialize it empty.
where the keys are the sha1-hashes of the objects:
function store!(objects,o) id = sha1(o) objects[id] = oendfunction load(objects,id) return objects[id]end
Note: everything is referenced by id and not copied around..
A Sha1 hash is a hexadecimal string of 40 characters (160 bits), such as e156393...71e
.
Problem: Sha1 names are really inconvenient. Thus git maintains a set of references:
references = Dict{String,String}() # maps readable human names to sha1 hashes.
Ex: fix_bug_in_proof
, master
, HEAD
, v0.1
, sudmitted_version
...
We can now refer to things by name in the commit graph.
Remark: The actual commit graph is immutable, but references are mutable. In other words, you can move the fix_bug_in_proof
reference, while you cannot change hashes of already-done commits.
All git commands make additions to the DAG and/or modifications to the references:
git commitgit checkoutgit branchgit merge
---------------------------------------------------O <-- O <-- O (HEAD -> master)--------------------------------------------------- git commitO <-- O <-- O <-- O (HEAD -> master)--------------------------------------------------- git branch; git checkout; git commit x2O <-- O <-- O <-- O (master) ^ \ --- O <-- O (HEAD -> bugfix)--------------------------------------------------- git checkout; git mergeO <-- O <-- O <-- O <---- O (HEAD -> master) ^ / \ v --- O <-- O (bugfix)---------------------------------------------------
Say we finished working on a paper.
lrnv@laptop paper $ git lg* b9f1617 (HEAD -> master, tag: v1.0) Fix a typo Oskar Laverny 3 minutes ago (2023-01-12)* 3cb7e6c Add proof of main Thm Oskar Laverny 4 minutes ago (2023-01-12)* 88d916e Add definition of XX Oskar Laverny 4 minutes ago (2023-01-12)* 147a080 First commit Oskar Laverny 5 minutes ago (2023-01-12)
The current version, tagged v1.0
, will be the version we push to arXiv
and send to review. But the paper requires that we send a version to review with specific notations, templating, etc... Use a branch:
lrnv@laptop paper $ git checkout -b templatingSwitched to a new branch 'templating'
Now do stuff in that branch to comply with the requirements of the journal, and commit them.
lrnv@laptop paper $ git lg* d380354 (HEAD -> templating) Comply with journal's XX template Oskar Laverny 18 seconds ago (2023-01-12)* b9f1617 (tag: v1.0, master) Fix a typo Oskar Laverny 8 minutes ago (2023-01-12)* 3cb7e6c Add proof of main Thm Oskar Laverny 8 minutes ago (2023-01-12)* 88d916e Add definition of XX Oskar Laverny 9 minutes ago (2023-01-12)* 147a080 First commit Oskar Laverny 10 minutes ago (2023-01-12)
Now we send the version from the templating
branch to the journal. While they revise is, we find some error in the paper. We need to fix the mistake in all the versions, right ?
* fde936d (HEAD -> master) Fix error about XXX Oskar Laverny 23 seconds ago (2023-01-12)| * d380354 (templating) Comply with journal's XX template Oskar Laverny 4 minutes ago (2023-01-12)|/ * b9f1617 (tag: v1.0) Fix a typo Oskar Laverny 11 minutes ago (2023-01-12)* 3cb7e6c Add proof of main Thm Oskar Laverny 12 minutes ago (2023-01-12)* 88d916e Add definition of XX Oskar Laverny 12 minutes ago (2023-01-12)* 147a080 First commit Oskar Laverny 13 minutes ago (2023-01-12)
lrnv@laptop paper $ git checkout templating && git merge master && git lg* b9faeb0 (HEAD -> templating) Merge commit Oskar Laverny 1 seconds ago (2023-01-12)|\ * | fde936d (HEAD -> master) Fix error about XXX Oskar Laverny 23 seconds ago (2023-01-12)| * d380354 Comply with journal's XX template Oskar Laverny 4 minutes ago (2023-01-12)|/ * b9f1617 (tag: v1.0) Fix a typo Oskar Laverny 11 minutes ago (2023-01-12)* 3cb7e6c Add proof of main Thm Oskar Laverny 12 minutes ago (2023-01-12)* 88d916e Add definition of XX Oskar Laverny 12 minutes ago (2023-01-12)* 147a080 First commit Oskar Laverny 13 minutes ago (2023-01-12)
Now both versions contain the error fix. You can of course do the same thing with more versions...
Remind that references are mutable, but objects are not. The hashes of each commits are still exactly the sames, but references have moved and the DAG was updated.
Github is a website that allows you to host git repositories and collaborate with others on the same projects. Github Actions are a great way to automatize some repetitive tasks around academic projects:
This type of continuous integration will ensures constant reproducibility of your results. If you end up doing something repetitively, then probably you should automatize it:
The git-latexdiff tool allows to compile a diffed version of the latex document. It can be done online directly at each commit to track changes on the go:
These latexdiff are also perfect to send back to reviewers... e.g. with a commit log!
If more than one author is involved, then a Github repository on which everyone writes modifications is usually a good thing. Github repo comes with a discussion area:
Written and asynchronous collaborative writting is neat. What is written never gets lost, and everyone can catch up easily.
Known example : Textbook on informal homotopy type theory
See Andrej Bauer's blog post about it.
Branching out is a neat way of experimenting, e.g. a large refactoring of notations, without loosing what you already have.
No risk of loosing your work => Costless inovation.
"Branches are like soap: you should use them."
I use Github actions to automatically build my resume on each commit:
papers.bib
or confs.bib
or another one, comit & push. cv.pdf
and output at the fixed URL.The link to output is fixed so that the person clicking on it will always get the last version.
Your project and/or your personal website and/or projects websites can be hosted and compiled directly from source, online, and served as website.
Github Pages allows you to pulish staticely generated websites such as blogs, documentations, applications of your packages, etc. A great way to communicate results that do not fit into the pdf standard (such as datasets and other stuff) with a permanent link.
JOSS is a journal that review and publishes code, entirely based on GitHub.
You may add a licence to you project to ensure that the ideas developed in your paper are legally yours, including code and everything.
The online version of the history proves that you wrote something at a certain time. If you choose to make the repo public before the article is published, then it can be argued on it.
Once your article is peer-reviewed, if it lives in a public github repository (and it should), you are oppened to post-publication peer-review:
See there for deeper developpement.
The key messages today are :
Theses slides are at https://lrnv.github.io/yrd2023/. Few more ressources:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |