Thread by @ianmakgill, I'm going to do my level best to explain why Excel is [...]

I& #39;m going to do my level best to explain why Excel is being used in Test and Trace, and why the Govt probably did the right thing. https://twitter.com/MaxCRoser/status/1313046638915706880">https://twitter.com/MaxCRoser...

https://twitter.com/MaxCRoser/status/1313046638915706880

Firstly, there are multiple orgs responsible for publishing T&T data. From existing hospitals, to Serco& #39;s test sites, through to the new Nightingale Hospitals. Also remember Serco is outsourcing some of their operations, so there are more orgs doing this than you think.

Each of those organisations will be running their own system for recording different types of data, e.g. trusts need to record admissions, tests results and deaths. T&T sites need to record tests administered and results. Everyone needs to record demographic data too.

Hospitals have one type of system around managing patient admin and medical records. T&T suppliers will use their own systems, some of which will be around fulfilment (did a test get from A to B and be fulfilled) others will be bespoke, case management systems.

Now Govt has to gather all of this data into a single database to get meaningful, nationwide statistics in place. That means getting each record from each of these systems and putting it into a single, database. They need to do this securely and very fast.

The best option is to build a link between to the two databases, so a central database calls out to collect the data in the secondary database, which checks that the right permissions are in place and then sends the data to the central database.

The second way to do this, is to say "let& #39;s just output the data from our secondary databases into a single file and then collate those records before importing them into our database". So the same data, but you don& #39;t need to build a bridge between the two databases.

The problem with the export to a file approach is that you stumble into all sorts of problems, like making sure that you have the most recent data, and that the data is in the right format.

You also inherit a whole load of issues associated with the file, e.g. Excel which might do something like limit the number of columns it uses, which is why the recent T&T data went wrong.

But... this isn& #39;t a simple choice between & #39;robust& #39; and & #39;flaky& #39;. If you try to build bridges you have to do this work for every different software package being used. And not just for each package, but for each configuration of each package.

You also have to know about each system, so do they store their dates in a standard way? Can you ask for all the new records since yesterday? Do you get the data in a format that is easy to & #39;unpack& #39; or do you have to write specific code to unpack each record?

And what if the data you need is spread across different systems? So what if the test data is in one system and the admissions data is in another? Do you have to build connectors to both of those systems and compile the data for the report you need?

What about security too? Are you proposing to build a bridge right into patient admission databases? If so, you might either end up gathering data on non-covid patients or worse still creating a security loophole for others to exploit.

And don& #39;t forget, not all of these packages will let another computer come and take the data. Software companies go to extraordinary lengths to prevent data leaving their systems, so they & #39;forget& #39; to add useful features like external connectors.

Finally, even if you do navigate all of these problems, it is likely that the whole thing will break if one system changes the configuration of their database or updates some records and not others.

In light of this, saying & #39;output your data to an Excel file like this& #39; and then building a single interface for those files is probably the best way to manage this data pipeline. It has a bunch of very obvious advantages:

It can be done quickly (big, big advantage); you only need to validate one type of data file; it doesn& #39;t need to access individual systems, especially ones with sensitive data and; the import won& #39;t break if a secondary system changes.

But you need to establish way more validation and monitoring for these systems, you probably need to check for missing files, anomalous rows, deviations and, if at all possible, provide tools for data publishers to validate their output.

However, absent the imagined alternative: lots of highly functioning APIs and easy to use, structured data Excel is the best option. The real problem is that data in Government systems is routinely obfuscated & inaccessible.

Even when the data is accessible, common standards are not being used. Overcoming these sorts of problems means adoption of open standards for data and data sharing. Until then, get used to Excel being the lingua franca of Government data sharing.

Since I wrote this, we’ve had some more information: https://www.bbc.co.uk/news/uk-54422505">https://www.bbc.co.uk/news/uk-5... conclusion, lax procedures and rubbish tools were used to compile the data prior to import. :(

Investigation launched into coronavirus test error

Tracers are "urgently contacting" those who came into contact with positive cases.

https://www.bbc.co.uk/news/uk-54422505

Latest Threads Unrolled: