Folks who want to use AI/ML for good generally think of things like predictive models, but actually... smart methods for extracting data from forms would do more for journalism, climate science, medicine, democracy etc. than almost any other application. A THREAD.
1/x
Here's how form extraction could help climate science
https://twitter.com/ed_hawkins/status/1167769410238595072
2/x
Here's what I know about solving the form extraction problem. First, the basics:
- Tesseract 4.0 is near-SOTA OCR in 100+ languagtes, you probably want to start here https://github.com/tesseract-ocr/tesseract
- PDFPlumber will spit out tokens with their bounding boxes https://github.com/jsvine/pdfplumber
5/x
If you have a simple enough table, and only one type of form layout, you may be able use PDFPlumber or Tabula to solve the problem directly
https://tabula.technology/ 
https://github.com/jsvine/pdfplumber#extracting-tables
6/x
There are a several commercial products that may solve your form extraction needs
https://web.altair.com/monarch-pdf-to-excel
https://rossum.ai/ 
https://aws.amazon.com/textract/ 
7/x
If that doesn't work (or too expensive or you need an open solution), then it's research time! First off, what you want is "multi-modal" approaches that consider text, geometry, maybe even font weight to understand document structure. E.g. Fonduer https://sing.stanford.edu/site/publications/fonduer-sigmod18.pdf
8/x
Here's some Google work that does representation learning on tokens+geometry input (exactly what PDFPlumber outputs)
- "Representation Learning for Information Extraction
from Form-like Documents" https://aclweb.org/anthology/2020.acl-main.580.pdf
9/x
And here is what seems to be the (public?) SOTA of multi-modal form extraction. Uses text + geometry + image embeddings in a BERT model
"LayoutLM: Pre-training of Text and Layout for Document Image Understanding"
https://arxiv.org/abs/1912.13318 
11/x
Just in case you're not yet convinced, a longer argument about why form extraction (and data wrangling generally) is likely the most productive application of AI to journalism -- the problem is well-specified, and data prep takes a LOT of time.
https://www.researchgate.net/profile/Jonathan_Stray/publication/334182207_Making_Artificial_Intelligence_Work_for_Investigative_Journalism/links/5e41b987a6fdccd9659a1737/Making-Artificial-Intelligence-Work-for-Investigative-Journalism.pdf
12/x
Finally, a big shoutout to the deepform team: @metaphdor, @moredataneeded, @danielfennelly, Andrea Lowe and Gray Davidson. We've been working hard on extracting the FCC political TV Ad PDFs, "public" info which costs $100k to buy clean data. Code:
https://github.com/project-deepform/deepform
13/x
If you are an ML engineer who would like to get involved in form-extraction-for-democracy, @weights_biases is very kindly hosting a public benchmark for the FCC ad data. Can you advance the open state of the art and beat our baseline? We'd love that!
https://wandb.ai/deepform/political-ad-extraction/benchmark
~END~
You can follow @jonathanstray.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: