Here's the thing with #260papers: some days you read more, some days you read less, but averaging a paper per work day seems like a lot, especially since that expects you to work on holiday.

Perhaps I underestimate myself, but I'm aiming for #156papers this year (avg of 3/week)
I figure that #156papers encourages steady effort to keep reading while allowing for the fact that sometimes you're on holiday or have deadlines related to producing rather than consuming knowledge, keeping you from hitting #260papers. I don't want to think about #365papers...
Paper 3 of #156papers is "Rethinking the agreement in human evaluation tasks (Position Paper)" (Amidei, Piwek, Willis 2018, https://www.aclweb.org/anthology/C18-1281/)

The authors highlight some of the linguistic variation which can be erased in a quest for higher interannotator agreement.
Paper 5 of #156papers is "Survey on Evaluation Methods for Dialogue Systems" (Deriu et al. 2019, preprint/manuscript; https://arxiv.org/abs/1905.04071 ).

Nice place to start looking at what's been done in evaluating dialogue systems. Emphasizes automated metrics more than I'd like.
Paper 6 of #156papers is "A Quantitative Study of Data in the NLP community" (Mieskes 2017, https://www.aclweb.org/anthology/W17-1603/)

Looking at NAACL, ACL, EMNLP, LREC, and COLING pubs from 2016, about 40% reported publication of some data. More than 15% of the links provided didn't work.
Paper 8 of #156papers is "Why We Need New Evaluation Metrics for NLG" (Novikova et al. 2017, http://aclweb.org/anthology/D17-1238)

Looking at word-overlap metrics (e.g. BLEU) and grammar-based metrics (e.g. parsability scores) and the various correlations among them and human scores
Paper 9 of #156papers is "The reliability of acceptability judgments across languages" (Linzen & Oseki 2018, http://www.glossa-journal.org/articles/10.5334/gjgl.528/)

Acceptability judgements inform syntactic theory. They argue for experimental support for disputed cases, but not for every grammaticality jdgmnt
Paper 10 of #156papers is "Quantifying sentence acceptability measures: Reliability, bias, and variability" (Langsford et al. 2018, https://www.glossa-journal.org/article/10.5334/gjgl.396/)

They look at Likert scales, magnitude estimation, & forced choice tasks to assess acceptability, finding high reliability
Paper 11 of #156papers is "Investigating stability and reliability of crowdsourcing output" (Qarout et al. 2018, http://ceur-ws.org/Vol-2276/paper10.pdf)

Short one on MTurk & FigureEight for an IR task. Turkers: faster per subj w/higher acc. F8: faster overall. Both acc.s consistent over time.
Paper 12/ #156papers is "On Understanding the Relation between Expert Annotations of Text Readability & Target Reader Comprehension" (Vajjala & Lučić 2019, https://www.aclweb.org/anthology/W19-4437/)

Prev simplification&readability work focused on writer-leveled text. V&L study reader comp instead
Paper 13/ #156papers is "A method for comparing fluency measures and its application to ITS natural language question generation" (Wilson 2003, https://www.aaai.org/Library/Symposia/Spring/2003/ss03-06-027.php)

Compares bigram and parse-based language models for scoring question fluency.
Paper 14 of #156papers is "Evaluation metrics for generation" (Bangalore et al. 2000, https://www.aclweb.org/anthology/W00-1401/)

Pre-dating BLEU, they correlated string- and dependency-tree-based edit distance metrics with human evaluations of understandability and quality.
Their example text:
There was no cost estimate for the second phase.
versus
There was estimate for phase the second no cost.

Simple String Acc: 0.44
w/movement errors: 0.56
Simple Tree Acc: 0.33
w/movement errs: 0.67

BLEU: 17.75 (100.0/44.4/6.2/3.6)
These metrics still require references (like BLEU does) and so can only be used in relatively limited generation contexts, but I like that they make use of syntactic (dependency) trees rather than treating all token movements the same.

Don't know if anyone's used em recently.
Erm, this thread continues here https://twitter.com/_dmh/status/1226612902540513281
You can follow @_dmh.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: