Thread by @VLucet, I've spent a lot of time this past few weeks making my [...]

I& #39;ve spent a lot of time this past few weeks making my thesis code as reproducible as possible, for future paper-submitting-me and for my lab.
Here& #39;s what I learned. (a thread, with gifs!)

https://abs.twimg.com/emoji/v2/... draggable="false" alt="⬇️" title="Pfeil nach unten" aria-label="Emoji: Pfeil nach unten">

First, it is best to worry about reproducibility early on in your project. You can re-engineer a project into a more reproducible state later on like I did, but it will save you many a headaches to plan for that stuff early.
Have a plan!

Second, there are many frameworks out there and many concepts. And a plethora of tools. I& #39;ll focus on three concepts: computing environments, containerizaion and continuous integration

So, computing environments. Ever tried to run code your supervisor wrote ages ago? They assured you it was gonna run in no time but you spent many hours trying to make it work, to no avail? It was likely because the environment (package versions, dependencies) was different.

It& #39;s important to make sure whoever is going to run your code in the future can replicate on their computer the same environment you used to write the code. You want to make sure the versions of packages you are using are the same than what was used to write the code. Same!

In my project I use R a lot. Folks at @rstudio have developed the great

https://abs.twimg.com/emoji/v2/... draggable="false" alt="📦" title="Paket" aria-label="Emoji: Paket">{renv} which documents the versions of packages you need for your project in a "lock file" and makes sure it& #39;s the version that is used when your code is ran. You can use version control on the lock file.

I also use Python and Julia. Python is a notorious nightmare when it comes to dependencies and has a "virtual environment" feature which allows you to also save your computing environment and isolate it (I ended up not using it in the end but I& #39;ll come back to this later)

What about Julia? Well, the makers of @JuliaLanguage designed it for reproducibility. A couple of toml files (it& #39;s like yaml) keeps track of package versions, making sure you have the right dependencies, working all from GitHub repos, and can even use specific commit sha. Neat!

So we can make sure that when we use R and such, we use the right versions... But sometimes you also need the right versions of your "system libraries", the programs that run on your computer and which can be specific to your operating system, or "OS" (mac, windows or Linux)

That is where containerizaion comes in. It& #39;s a bit like a virtual machine, but also not really. It is a way to encapsulate the minimum set of programs that you need to run your code, down to the OS. Think of it as the "I only buy what I need" of computing.

In a container, most of the time, the operating system is a flavor of GNU/Linux, which makes the container lightweight in terms of memory usage and therefore easier to share.

The container is an instance of a Docker image. This image is based on a Dockerfile, a file that lists all the things you want to install in your container (a bit like the renv lockfile but for your OS). Sound complicated but very easy to learn!

You build and push the image to DockerHub, the GitHub for docker images. You can even automate the building of the image there so you don& #39;t have to do it on your computer. All.. in the cloud (do people still say that?)

Btw I thank some people with nice docker resources out there including @_ColinFay blog post. A good place to get started: https://colinfay.me/docker-r-reproducibility/">https://colinfay.me/docker-r-...

An Introduction to Docker for R Users

A quick introduction on using Docker for reproducibility in R.

https://colinfay.me/docker-r-reproducibility/

Again, you can install whatever you need. For my project I needed the OpenCV Python library which I compiled in the container, removing the need for futurr users to install it themselves (and avoiding Python virtual environments). It& #39;s cleaner that way!

Finally, once you are sure that you can always run the code under the same conditions and that anyone can access it easily, you can use continuous integration to test and proof the reproducibility of your analysis. I use GitHub actions for that.

Via a GitHub action, I pull my container from DockerHub, and make all my analyses and figures in the container. I publish the figures on a GitHub page. This process is repeated continuously to make sure it still works over time.

To do this I make use of R markdown, which allows you to compile pdf and/or html documents from code.

Of course it& #39;s not all perfect and there is not one single recipes. There are many frameworks (for example you can check out research compediums, another reproducibility standard from the excellent Turing Way project) https://github.com/alan-turing-institute/the-turing-way">https://github.com/alan-turi...

alan-turing-institute/the-turing-way

Host repository for The Turing Way: a how to guide for reproducible data science - alan-turing-institute/the-turing-way

https://github.com/alan-turing-institute/the-turing-way

I hope this was a little helpful! I wanted to share this journey into reproducibility. Any step taken in that direction, no matter how small, will improve the quality of our collective endeavor!

The link for the thesis code if you& #39;d like to see an example of application of those concepts: https://github.com/VLucet/landchange-connectivity-monteregie">https://github.com/VLucet/la...

VLucet/landchange-connectivity-monteregie

Valentin Lucet // Msc Thesis // 2018-2020. Contribute to VLucet/landchange-connectivity-monteregie development by creating an account on GitHub.

https://github.com/VLucet/landchange-connectivity-monteregie

Latest Threads Unrolled: