My statistician/data-science friends all tell me that if you're using a spreadsheet, you're not doing science, you're courting disaster. Real analysis requires Python, or, possibly, #julialang.

https://arstechnica.com/science/2020/10/the-unreasonable-effectiveness-of-the-julia-programming-language/

1/
But still, we keep using spreadsheets to do real work. I did it YESTERDAY. And made a stupid mistake.

The abstinence-only approach to spreadsheets has been a failure. Clearly, we need harm-reduction.

4/
That's where "Data Organization in Spreadsheets" - @kara_woo and @kwbroman's 2017 paper in @AmstatNews comes in. It lays out a crisp set of best practices for avoiding common errors, upping your CVS catastrophe game to really powerful mistakes!

https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

5/
Here's how to spreadsheet:

* Be consistent: Don't use "Male," "male" and "m" as labels. Pick one

* Don't let trailing spaces creep in

* Use a consistent code for missing values (not a blank space and ESPECIALLY not a number like -99999)

6/
* Have a column for explanations about missing data (don't fill empty cells with explanations for their emptiness)

* Use consistent variable names and subject identifiers; treat as case-sensitive. No spaces!

* Lay out all your data consistently, in every file

7/
* Have a consistent (case-sensitive, no-spaces) filename convention. Do not tempt fate by calling a file "final" lest you have to pay penance with files named "final_ver2"

* Use YYYY-MM-DD for dates. No exceptions!

* Guard against trailing spaces in data!

8/
* Don't use special characters apart from _ and - in variables (avoid $, @, %, #, &, *, (, ), !, /, and other chars that have special meanings in some programming languages

* Format cells as "Text" to keep Excel from turning things like gene-names ("Oct-4") into dates

9/
* Consider giving year, month and date their own columns to prevent Excel from munging them (or write as an integer: 20201014)

* Only put one piece of data in each cell; use column labels to indicate units (eg "45" not "45g")

* Only one row of variable names per sheet

10/
* Maintain a separate "Data dictionary" file that defines every variable

* Datasets should not contain calculations; minimize how much typing you do in your dataset files lest you contaminate them inadvertently (calculations go in separate files)

11/
* Font colors and highlights are not data - put data in cells, not formatting (this gets lost in transitions)

* Backup multiple versions of your files, onsite and offsite

* Develop data validation tactics and regularly validate your data

12/
* Use CSV, not xlsx, as your canonical file-format - good old hard-to-corrupt text, flensed of all the fooforaw that Microsoft likes to insert at random intervals

Lurking behind every one of these tips is a postmortem on a data-tragedy. Ignore them at your peril.

eof/
You can follow @doctorow.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: