Thread by @tomkXY, I’ve seen a few discussions on whether we should cite software packages. [...]

I’ve seen a few discussions on whether we should cite software packages. As an author and maintainer of several packages and a contributor to open source projects, I think this misses the wider point. Citing them it a nice gesture but should not be necessary to maintain them.

I think we should all agree that it is important to cite recent packages that make significant contributions to your field. If they implement a novel method that you couldn’t do otherwise, you should cite it to acknowledge the authors and for reproducibility.

If it’s a method in your field, you should treat it like any experimental protocol. However, many packages are used in typical workflows and it’s impractical to cite all of them and their dependencies. So where do we draw the line?

Should you cite Suerat? What about ggplot? Should every python analysis cite NumPy? Is it needed to cite which BLAS/LAPACK implementation you’ve used? These are valid questions but miss the point. Publications and software packages are fundamentally different.

Papers and packages exist in different worlds. They should both be valued contributions for a scientific career. Citing packages as publications doesn’t address failure to recognise packages in their own right. Citing packages shouldn’t be necessary for authors to be recognised.

If packages are relegated to supplementary data, they’re often not reviewed during publication apart from the merits of their results. This doesn’t tell you if it’s user-friendly, reliable, can be applied to other data, or will be maintained.

Open source packages already get feedback separately. They have different metrics, but it is possible to demonstrate their impact. They’re already openly reviewed and it’s possible to acknowledge all contributors. We just collectively place more value on some metrics than others.

In the long-term we stop citing publications as well. Do you cite Mendel or Darwin or Watson and Crick when working in biology? Should we cite the human genome project for our reference data or Mullis every time we do a PCR?

Published results quickly become dated and eventually part of our collective common knowledge. The difference with software packages is that we cannot build on a foundation of knowledge. We need to choose an implementation of software methods, legacy code and all.

In the long-term, packages are infrastructure. These need to be maintained so they can be used in the future. A good package is well-documented, reliable, and useful. They must be maintained rather than left static in the void after a project is completed.

Should they be constantly striving for novelty? Should they funded on transient grant cycles? Imagine how insane it would be if our roads were maintained like this. Imagine if we let our highways decay into ruin and be maintained by volunteers while we fund new roads to nowhere.

Maintaining a stable package takes more effort than a barebones one that gives a new result. Needing citations for packages perversely incentivises novelty over anything else. Some groups make excellent software but few can spare the resources to do so, even if they wanted to.

So should we cite software packages? Citations are how we show that something has an impact and continued research should be funded. However a package is not a paper and it’s often difficult to make them fit that format. We should fund them regardless of whether they’re cited.

A highly cited paper is usually groundbreaking or controversial. A highly used package is often neither. They’re often adopted by the user community far before they’re “published”. Yet we use the same system and metrics to give credit and funding to authors.

Latest Threads Unrolled: