My statistician/data-science friends all tell me that if you& #39;re using a spreadsheet, you& #39;re not doing science, you& #39;re courting disaster. Real analysis requires Python, or, possibly, #julialang.
https://arstechnica.com/science/2020/10/the-unreasonable-effectiveness-of-the-julia-programming-language/
1/">https://arstechnica.com/science/2...
https://arstechnica.com/science/2020/10/the-unreasonable-effectiveness-of-the-julia-programming-language/
1/">https://arstechnica.com/science/2...
Despite these warnings, plenty of mission-critical work gets done in spreadsheets, and (in support of these warnings), it can go horribly, horribly wrong.
It& #39;s not just the UK losing 16,000 covid cases:
https://www.popularmechanics.com/technology/a34274176/uk-coronavirus-excel-spreadsheet-lost-cases/
2/">https://www.popularmechanics.com/technolog...
It& #39;s not just the UK losing 16,000 covid cases:
https://www.popularmechanics.com/technology/a34274176/uk-coronavirus-excel-spreadsheet-lost-cases/
2/">https://www.popularmechanics.com/technolog...
It& #39;s years of destructive, crushing austerity - costing real human lives; trashing cities, regions, whole countries - due to spreadsheet formula errors:
https://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646
3/">https://theconversation.com/the-reinh...
https://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646
3/">https://theconversation.com/the-reinh...
But still, we keep using spreadsheets to do real work. I did it YESTERDAY. And made a stupid mistake.
The abstinence-only approach to spreadsheets has been a failure. Clearly, we need harm-reduction.
4/
The abstinence-only approach to spreadsheets has been a failure. Clearly, we need harm-reduction.
4/
That& #39;s where "Data Organization in Spreadsheets" - @kara_woo and @kwbroman& #39;s 2017 paper in @AmstatNews comes in. It lays out a crisp set of best practices for avoiding common errors, upping your CVS catastrophe game to really powerful mistakes!
https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989
5/">https://www.tandfonline.com/doi/full/...
https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989
5/">https://www.tandfonline.com/doi/full/...
Here& #39;s how to spreadsheet:
* Be consistent: Don& #39;t use "Male," "male" and "m" as labels. Pick one
* Don& #39;t let trailing spaces creep in
* Use a consistent code for missing values (not a blank space and ESPECIALLY not a number like -99999)
6/
* Be consistent: Don& #39;t use "Male," "male" and "m" as labels. Pick one
* Don& #39;t let trailing spaces creep in
* Use a consistent code for missing values (not a blank space and ESPECIALLY not a number like -99999)
6/
* Have a column for explanations about missing data (don& #39;t fill empty cells with explanations for their emptiness)
* Use consistent variable names and subject identifiers; treat as case-sensitive. No spaces!
* Lay out all your data consistently, in every file
7/
* Use consistent variable names and subject identifiers; treat as case-sensitive. No spaces!
* Lay out all your data consistently, in every file
7/
* Have a consistent (case-sensitive, no-spaces) filename convention. Do not tempt fate by calling a file "final" lest you have to pay penance with files named "final_ver2"
* Use YYYY-MM-DD for dates. No exceptions!
* Guard against trailing spaces in data!
8/
* Use YYYY-MM-DD for dates. No exceptions!
* Guard against trailing spaces in data!
8/
* Don& #39;t use special characters apart from _ and - in variables (avoid $, @, %, #, &, *, (, ), !, /, and other chars that have special meanings in some programming languages
* Format cells as "Text" to keep Excel from turning things like gene-names ("Oct-4") into dates
9/
* Format cells as "Text" to keep Excel from turning things like gene-names ("Oct-4") into dates
9/
* Consider giving year, month and date their own columns to prevent Excel from munging them (or write as an integer: 20201014)
* Only put one piece of data in each cell; use column labels to indicate units (eg "45" not "45g")
* Only one row of variable names per sheet
10/
* Only put one piece of data in each cell; use column labels to indicate units (eg "45" not "45g")
* Only one row of variable names per sheet
10/
* Maintain a separate "Data dictionary" file that defines every variable
* Datasets should not contain calculations; minimize how much typing you do in your dataset files lest you contaminate them inadvertently (calculations go in separate files)
11/
* Datasets should not contain calculations; minimize how much typing you do in your dataset files lest you contaminate them inadvertently (calculations go in separate files)
11/
* Font colors and highlights are not data - put data in cells, not formatting (this gets lost in transitions)
* Backup multiple versions of your files, onsite and offsite
* Develop data validation tactics and regularly validate your data
12/
* Backup multiple versions of your files, onsite and offsite
* Develop data validation tactics and regularly validate your data
12/
* Use CSV, not xlsx, as your canonical file-format - good old hard-to-corrupt text, flensed of all the fooforaw that Microsoft likes to insert at random intervals
Lurking behind every one of these tips is a postmortem on a data-tragedy. Ignore them at your peril.
eof/
Lurking behind every one of these tips is a postmortem on a data-tragedy. Ignore them at your peril.
eof/