A bioinformatician's thread on the use of #Excel for storing data. It's perfectly reasonable to have lab folks enter data into a spreadsheet - they aren't data experts and need some sort of system to interact with.
If using a spreadsheet though, it will be necessary to limit what data the user can input into cells, which is something spreadsheets can do. For example only allowing numbers to prevent the letter O being used as a zero.
The data pipeline for ingesting the spreadsheet data would then need to sense check the input, for example is there any missing or unexpected data? Have row or column limits been reached?
Some folks are criticising the use of CSV (comma separated value) format on the basis that it's human or spreadsheet readable and suggesting XML etc as alternatives that would prevent problems.
This is not helpful. If you share your data in XML format, Excel (or many other parsers) can happily read and corrupt that as happily as CSV.
Whatever file exchange format is used, the data ingestion pipeline will still need to perform all the checks you can think of on the incoming data.
Some have also sneered and said they only use databases for this sort of task, but this is a false distinction. It's perfectly easy to set up a database with limited data fields which cut off some of your data and don't complain (unless your pipeline picks up such errors)
So there is not an inherent problem with using CSV files or spreadsheets. The issue is not having folks with the right data expertise on the team. Using an SQL database doesn't spare you from having problems.
It's just more likely that someone who can build a mysql database will know enough about data science to build a robust system with error handling.
Have a nice day.
You can follow @BioGaryBarker.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: