Thread by @pjm56tw, Been seeing this critical "code review" of Professor Ferguson's COVID modelling in [...]

Been seeing this critical "code review" of Professor Ferguson& #39;s COVID modelling in quite a few places today. As I am a professional computer scientist, I thought you might want to know a bit more about the issues contained within.

THREAD https://lockdownsceptics.org/code-review-of-fergusons-model/">https://lockdownsceptics.org/code-revi...

Code Review of Ferguson's Model – Lockdown Sceptics

by Sue Denim Imperial finally released a derivative of Ferguson’s code. I figured I’d do a review of it and send you some of the things I noticed. I don’t know your background so apologies if some of...

https://lockdownsceptics.org/code-review-of-fergusons-model/

Firstly, what is this all about? Well, Prof. Ferguson used a computer program to model (simulate) the spread of COVID through society. As the article says, this is basically a video game without graphics. Think: virtual people, interacting in a Truman show world.

This program gets run many times, and then you average together the results, and write up your report for the government. So far so good. The program is written in code (in this case in an engineering language called C++).

A code review means that someone other than the author is going to go through the code and figure out whether they think it has problems, or can be improved. Basically a second pair of eyes.

In this case, the reviewer has not been able to get hold of the original code that Prof Ferguson used -- only a cleaned up version that was published on Github (a code storage and change tracking platform). It is clear that there are some problems!

Firstly we notice that the original code was apparently just a single 15,000 line file. My understanding is that this is quite common in academia - there is a tendency to just dump the code in one place, with little thought for modularity. This is a bad idea..

If you don& #39;t break your system down into small modules (think of them as bits of Lego) then you make it very hard for other people (including "your future self") to understand, and you make it very hard to test. More on this in a bit.

Firstly, let& #39;s talk about determinism and randomness. Determinism is that idea that if you have a known set of inputs, you will always get the same outputs. For example, 2 + 3 always equals 5; it doesn& #39;t sometimes produce different results. The inputs *determine* the outputs.

In Prof. Ferguson& #39;s model, the inputs are the "starting state" of the our simulation of society. So, how many people are there, whereabouts are they, what ages are they, how many hotels are there? etc.

Note that it doesn& #39;t have to be an exact model of society: it just has to be representative of the UK. So, there are many *plausible* starting states. By running the simulation many times, each with a plausible starting state, you get an idea of the likely outcome for the UK.

So in one run, you get 85,000 deaths. In another run, you get 87,000; in another you get 82,000 deaths. After maybe 1000 runs, you average this up and get the value for your report; say, 83,525 deaths +/- 3244.

Now, what happens if you go back to one of the starting states (say, the one that gave you 85,000 deaths) and you run the model again with exactly that same starting state? How many deaths do you get? If the code is deterministic, *you get 85,000, again*.

But what if you don& #39;t get this result again? Then your code is non-deterministic. And how is that possible? One answer is that there is *randomness* in your model. For example, maybe in each step you "roll a dice" to see what your virtual people will do...

Perhaps they will go to the shops (40%), or stay home (50%), or go to the cinema (10%).

Now your model can produce different outcomes even with the same starting state.

However, we don& #39;t use "real dice" in a computer. We simulate the dice! We use a "pseudo-random number generator" - a bit of maths that produces a sequence of numbers that look just like regular random dice rolls, but have a guilty secret...

Their secret is that you can replay the sequence *perfectly*. Each sequence is "seeded" with a number to start it. If you use the same "seed", you will get an identical replay of the dice rolls.

So if you use the same starting state, and the same seed number, then you should (if your code is deterministic) get exactly the same number of & #39;deaths& #39; coming out.

But what if you don& #39;t?

Then you have a bug! In this case, it looks as if there were a few places were errors had crept in:

Seed Re-Setting
Performance Modes
Multi-threading (& architecture dependencies)

I& #39;ll look at each in turn.

Seed Re-setting -- the code review only mentions this in passing, but it looks like the mechanism for setting the "starting state" and the "starting seed" back to previous known values, got broken by a change at some point.

I& #39;ve not seen the code, so it& #39;s hard to know the nature of this, but I get the impression that the "known seed" being supplied to the model, gets ignored. Thus the model runs with a new seed, and the results are not what you were expecting.

They might still be completely valid (your starting state was plausible, and the new seed still produces a valid sequence of "dice rolls") but the code isn& #39;t behaving how you wanted, so this is a bug.

Performance modes -- these models are large, and slow to run, so it seems that there were several attempts to speed it up. One of these "high performance modes" used data tables in a faster format. The problem? For the same inputs, it produced different outputs to the "slow mode"

This is a worrying bug. Two "modules" of code (one fast, one slow) that are supposed to do *exactly the same thing* somehow produce different results.

What is going on? Which module is correct? How can you be sure which result is the right one? This kind of bug will be invalidating your model (maybe it is losing track of infections or something). These results can& #39;t be trusted.

This is one example where "unit testing" could have saved the day. With unit testing you write a small piece of test code that checks the behaviour of each module. In this case you would use the same test for both the "fast" and "slow" modules - both should pass.

When one of them didn& #39;t pass, then you& #39;d have a much stronger idea of which one was wrong. Unit tests only really work when code is broken down into small, testable chunk, which is one reason why monolithic 15,000 line files are not a good idea.

Unit testing is a bit like a kids& #39; shape-matching toy. The wooden "frame" is the testing code, and the "pieces" are your modules. If the module is the wrong & #39;shape& #39;, it won& #39;t pass through the test. If at some point you change the shape without realising, it will no longer & #39;fit& #39;.

Next we look at multi-threading. Most of us know about "dual core" and "quad core" computers, that have multiple processors in them. Multi-threading (more or less) means you run your simulation on more than one processor at the same time.

This can cause a problem if two different processors try to access and modify a piece of data at the same time. For example, what if Processor 1 sends a person to the shops, but Processor 2 simultaneously sends *the same person* to the cinema. Your model state is now invalid.

These are called "race conditions" because the outcome can be different depending on which processor "wins the race" to get access to the data first. For this reason they produce non-deterministic behaviour *and* invalid behaviour. And they are hard to spot.

Making code "thread safe" is a discipline in itself. It is hard! I have seen big, industrial-scale banking systems that have failed because of this. It seems clear that Prof Ferguson& #39;s code was not safe to run in multi-threaded mode.

I will just add here that this kind of multi-threading bug also tends to lead to "architecture dependencies". So the code might work on, say, an AMD chip, but not an Intel one, because the processors schedule their work in slightly different ways.

I should point out that all this doesn& #39;t *necessarily* invalidate Prof Ferguson& #39;s report. Maybe all his modelling runs were in single-threaded mode, using the working performance options, and maybe the reseeding bug didn& #39;t invalidate the average results... BUT...

I would be a *lot* more confident if there was proper testing in place, results were reproducible, and the "lego bricks" of the code could be shown to be behaving correctly.

In conclusion, Prof Ferguson& #39;s code seems to suffer from a number of issues that seem to be especially common in academic coding:

- failure to break down into modules
- failure to write unit tests
- failure to demonstrate reproducible results
- willingness to gloss over issues

I think the reviewer goes too far in calling for any kind of defunding. I& #39;d much rather see that all academics who have to write code are *trained* in software engineering, at least up to a level where they save themselves huge pain & save us from untrustworthy results. END

Latest Threads Unrolled: