I’ve spent the past six months modelling covid-19 in households. Here’s a thread on some Matlab code I’ve written to generate synthetic populations for infectious disease models from publicly available data on household composition. 1/
The code represents a few solid weeks of work and should be useful to scientists modelling infection along age- and household-structured lines. We used it to generate the synthetic populations for this study: https://wellcomeopenresearch.org/articles/5-213 . 2/
The code is here: https://github.com/JBHilton/processing-household-composition-data. Its basic purpose is to produce distributions which capture the age-stratified household structure of a population at a chosen spatial scale. 3/
These distributions tell you things like the proportion of households in a country which contain X adults and Y children. From here you can build a synthetic population to use in an epidemic model, which looks close to the real population the original data came from. 4/
The code was designed with data from the ONS on households in England and Wales in mind. However, I will also be adding scripts to the repo in the near future which allow for data from other sources – in particular the 2014 Kenya Demographic and Health Survey. 5/
The ONS data is from the 2011 census and lists all of the household compositions found in each Output Area (usually a few postcodes) in England and Wales in terms of ten-year age bands, plus the number of times that composition appears in that Output Area. 6/
Some typical datapoints look like this. The Output Areas are nested within larger spatial units. What we don’t immediately know from this data is what the composition distribution looks like on the level of those larger spatial units, or in terms of other age bands. 7/
Getting distributions on the level of larger spatial units is difficult because of the large size of the dataset. My code speeds up this operation by converting the compositions to integers, making it much easier to count how many times a composition appears in a spatial unit. 8/
The code outputs a list of all the compositions appearing in the dataset, and the proportion of households in each spatial unit which are in each distribution. The level of spatial resolution is chosen by the user. 9/
The github repo’s readme explains how all this is done. The repo includes two examples which demonstrate how to calculate composition distributions for user-specified age bands and spatial resolutions. 10/
The first example calculates England-and-Wales level composition distribution in terms of children (<20yrs) and adults (20+). The second divides adults into vulnerable and non-vulnerable classes based on ONS shielding data – useful for modelling risk alongside age structure. 11/
This code hopefully provides a (relatively) quick an easy way to generate synthetic populations for household-structured disease modelling. I’m still actively working on the repo and am happy to talk about features I could add or things I could be doing better. 12/
I’m not too happy about this being in a proprietary language, but large matrix operations make Matlab a good choice performance-wise. I could look at a Julia/Python/R version if interest exists (particularly interested in the views of researchers based in LMICs on this point) /13
Fingers crossed, this repo will save a bit of work for modellers looking at age- and household-structure infectious transmission, cutting out the “generate a population” step and letting you go straight to the actual modelling. (end of thread)