Here& #39;s the thing with #260papers: some days you read more, some days you read less, but averaging a paper per work day seems like a lot, especially since that expects you to work on holiday.

Perhaps I underestimate myself, but I& #39;m aiming for #156papers this year (avg of 3/week)
I figure that #156papers encourages steady effort to keep reading while allowing for the fact that sometimes you& #39;re on holiday or have deadlines related to producing rather than consuming knowledge, keeping you from hitting #260papers. I don& #39;t want to think about #365papers...
Paper 3 of #156papers is "Rethinking the agreement in human evaluation tasks (Position Paper)" (Amidei, Piwek, Willis 2018, https://www.aclweb.org/anthology/C18-1281/)

The">https://www.aclweb.org/anthology... authors highlight some of the linguistic variation which can be erased in a quest for higher interannotator agreement.
Paper 4 of #156papers is "A Practical Taxonomy of Reproducibility for Machine Learning Research" (Tatman, VanderPlas, Dane 2018; https://openreview.net/forum?id=B1eYYK5QgX)

I& #39;d">https://openreview.net/forum... love to see folks sharing not just NLP code and data, but also evaluation materials (surveys, scores, analysis)
Paper 5 of #156papers is "Survey on Evaluation Methods for Dialogue Systems" (Deriu et al. 2019, preprint/manuscript; https://arxiv.org/abs/1905.04071 ).

Nice">https://arxiv.org/abs/1905.... place to start looking at what& #39;s been done in evaluating dialogue systems. Emphasizes automated metrics more than I& #39;d like.
Paper 6 of #156papers is "A Quantitative Study of Data in the NLP community" (Mieskes 2017, https://www.aclweb.org/anthology/W17-1603/)

Looking">https://www.aclweb.org/anthology... at NAACL, ACL, EMNLP, LREC, and COLING pubs from 2016, about 40% reported publication of some data. More than 15% of the links provided didn& #39;t work.
Paper 7 of #156papers is "What makes a good conversation? How controllable attributes affect human judgments" (See et al. 2019, https://www.aclweb.org/anthology/N19-1170/)

They">https://www.aclweb.org/anthology... use discrete input categories to manipulate repetition, etc, & examine their impact on user ratings for multiturn dialogs
Paper 8 of #156papers is "Why We Need New Evaluation Metrics for NLG" (Novikova et al. 2017, http://aclweb.org/anthology/D17-1238)

Looking">https://aclweb.org/anthology... at word-overlap metrics (e.g. BLEU) and grammar-based metrics (e.g. parsability scores) and the various correlations among them and human scores
Paper 9 of #156papers is "The reliability of acceptability judgments across languages" (Linzen & Oseki 2018, http://www.glossa-journal.org/articles/10.5334/gjgl.528/)

Acceptability">https://www.glossa-journal.org/articles/... judgements inform syntactic theory. They argue for experimental support for disputed cases, but not for every grammaticality jdgmnt
Paper 10 of #156papers is "Quantifying sentence acceptability measures: Reliability, bias, and variability" (Langsford et al. 2018, https://www.glossa-journal.org/article/10.5334/gjgl.396/)

They">https://www.glossa-journal.org/article/1... look at Likert scales, magnitude estimation, & forced choice tasks to assess acceptability, finding high reliability
Paper 11 of #156papers is "Investigating stability and reliability of crowdsourcing output" (Qarout et al. 2018, http://ceur-ws.org/Vol-2276/paper10.pdf)

Short">https://ceur-ws.org/Vol-2276/... one on MTurk & FigureEight for an IR task. Turkers: faster per subj w/higher acc. F8: faster overall. Both acc.s consistent over time.
Paper 12/ #156papers is "On Understanding the Relation between Expert Annotations of Text Readability & Target Reader Comprehension" (Vajjala & Lučić 2019, https://www.aclweb.org/anthology/W19-4437/)

Prev">https://www.aclweb.org/anthology... simplification&readability work focused on writer-leveled text. V&L study reader comp instead
Paper 13/ #156papers is "A method for comparing fluency measures and its application to ITS natural language question generation" (Wilson 2003, https://www.aaai.org/Library/Symposia/Spring/2003/ss03-06-027.php)

Compares">https://www.aaai.org/Library/S... bigram and parse-based language models for scoring question fluency.
Paper 14 of #156papers is "Evaluation metrics for generation" (Bangalore et al. 2000, https://www.aclweb.org/anthology/W00-1401/)

Pre-dating">https://www.aclweb.org/anthology... BLEU, they correlated string- and dependency-tree-based edit distance metrics with human evaluations of understandability and quality.
Their example text:
There was no cost estimate for the second phase.
versus
There was estimate for phase the second no cost.

Simple String Acc: 0.44
w/movement errors: 0.56
Simple Tree Acc: 0.33
w/movement errs: 0.67

BLEU: 17.75 (100.0/44.4/6.2/3.6)
These metrics still require references (like BLEU does) and so can only be used in relatively limited generation contexts, but I like that they make use of syntactic (dependency) trees rather than treating all token movements the same.

Don& #39;t know if anyone& #39;s used em recently.
Erm, this thread continues here https://twitter.com/_dmh/status/1226612902540513281">https://twitter.com/_dmh/stat...
You can follow @_dmh.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: