Thread by @babeheim, This is an unusual #fieldwork trip not just because of political revolution. [...]

This is an unusual #fieldwork trip not just because of political revolution. Rather than collecting new data, I& #39;m trying to save old data: nearly 20 years worth of accumulated interviews from the Tsimane Health and Life History project. Here& #39;s a thread on how 1/

The basic problem is volume. The THLHP field team has been active continuously for years, and most of the major interviews are on paper. The end result is LOTS OF PAPER, too much to feasibly export to an archive. A few years ago, this was where things stood:

Transcribing the data off the paper interviews has been ongoing, but a persistent problem has been being unable to fall back on the original paper interviews.

Why is this important? Data provenance. We often want to directly reference the primary source documents that datasets originated from - for quality assurance, resolving discrepancies, checking unusual values, or coding in previously untranscribed data.

So for the last few years, we& #39;ve been heavily investing in building an archive at the site itself, and getting all the interviews organized. Here& #39;s how things look today:

In that time, I& #39;ve experimented with a few different methods for getting lots of interviews scanned in a practical timeframe - flatbed scanners, tablets, DSLR cameras. By far the most practical has been using SMARTPHONES with kickass PDF-making apps like Tiny Scanner.

The team is now using Androids, specifically the Samsung Galaxy A10. The best part is that these are available locally in San Borja, so I can get as many as I need without having to import.

The other advantage is scale - if you have N smartphones, you can have N people scanning simultaneously. It& #39;s completely parallel-izable. Right now I have 14 phones snapping pictures all day. To keep track of files, each device gets its own unique identifier (an animal).

Using tiny scanner to digitize our field interviews, each scanner is averaging about 93 pages per hour, so with 14 scanners we are digitizing about about 10,000 pages per *day*. We& #39;re also getting more efficient with practice.

After 9 days of work, the team has digitized 16,000 interviews, and with 9 days more to go, we are likely to break 30,000 total, which is about how many interviews are currently in the archive.

Managing all these PDFs is also tricky - a combination of shell scripts and R scripts I& #39;ve written pulls the files off the phones, renames them, organizes them by interview type, saves file metadata, and sends them to @MPI_EVA_Leipzig& #39;s Nextcloud server for further processing.

Our team in Germany, the Data Provenance Research Group, completes the final step, which is integrating the files themselves into the Tsimane project database. Humans open each PDF and record the essential metadata - the interview date, the community, the interviewee, etc.

This part takes a lot of time, but the end result will be pretty amazing: for every data point in the Tsimane Health and Life History database, we will be able to click a hyperlink to open a PDF of the *original paper interview* that data point came from.

Well, at least depending on the stability of Bolivian politics. Ojalá. /thread

Latest Threads Unrolled: