Thread by @_dmh, Here's the thing with #260papers: some days you read more, some days [...]

Here& #39;s the thing with #260papers: some days you read more, some days you read less, but averaging a paper per work day seems like a lot, especially since that expects you to work on holiday.

Perhaps I underestimate myself, but I& #39;m aiming for #156papers this year (avg of 3/week)

I figure that #156papers encourages steady effort to keep reading while allowing for the fact that sometimes you& #39;re on holiday or have deadlines related to producing rather than consuming knowledge, keeping you from hitting #260papers. I don& #39;t want to think about #365papers...

Anyway, back to work yesterday and just finished paper 1 of #156papers: "Ordinal Regression Models in Psychology: A Tutorial" (Bürkner & Vuorre 2019, https://journals.sagepub.com/doi/10.1177/2515245918823199)

Reading">https://journals.sagepub.com/doi/10.11... to aid my effort to better model human evaluation data for #NLG.

Ordinal Regression Models in Psychology: A Tutorial - Paul-Christian Bürkner, Matti Vuorre, 2019

Ordinal variables, although extremely common in psychology, are almost exclusively analyzed with statistical models that falsely assume them to be metric. This ...

https://journals.sagepub.com/doi/10.1177/2515245918823199

Paper 2 of #156papers is "Estimating the reproducibility of psychological science" (Open Science Collaboration 2015, http://www.sciencemag.org/cgi/doi/10.1126/science.aac4716)

Understanding">https://www.sciencemag.org/cgi/doi/1... reproducibility, replicability, etc, is important for understanding how we can improve #NLProc & #ComputationalLinguistics

Estimating the reproducibility of psychological science

One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts...

http://www.sciencemag.org/cgi/doi/10.1126/science.aac4716

Paper 3 of #156papers is "Rethinking the agreement in human evaluation tasks (Position Paper)" (Amidei, Piwek, Willis 2018, https://www.aclweb.org/anthology/C18-1281/)

The">https://www.aclweb.org/anthology... authors highlight some of the linguistic variation which can be erased in a quest for higher interannotator agreement.

Rethinking the Agreement in Human Evaluation Tasks

Jacopo Amidei, Paul Piwek, Alistair Willis. Proceedings of the 27th International Conference on Computational Linguistics. 2018.

https://www.aclweb.org/anthology/C18-1281/

Paper 4 of #156papers is "A Practical Taxonomy of Reproducibility for Machine Learning Research" (Tatman, VanderPlas, Dane 2018; https://openreview.net/forum?id=B1eYYK5QgX)

I& #39;d">https://openreview.net/forum... love to see folks sharing not just NLP code and data, but also evaluation materials (surveys, scores, analysis)

A Practical Taxonomy of Reproducibility for Machine Learning Research

Discussions of reproducibility in science are often framed from the perspective of scientists and researchers who want to validate published claims. A complementary perspective is that of the...

https://openreview.net/forum?id=B1eYYK5QgX

Paper 5 of #156papers is "Survey on Evaluation Methods for Dialogue Systems" (Deriu et al. 2019, preprint/manuscript; https://arxiv.org/abs/1905.04071 ).

Nice">https://arxiv.org/abs/1905.... place to start looking at what& #39;s been done in evaluating dialogue systems. Emphasizes automated metrics more than I& #39;d like.

Paper 6 of #156papers is "A Quantitative Study of Data in the NLP community" (Mieskes 2017, https://www.aclweb.org/anthology/W17-1603/)

Looking">https://www.aclweb.org/anthology... at NAACL, ACL, EMNLP, LREC, and COLING pubs from 2016, about 40% reported publication of some data. More than 15% of the links provided didn& #39;t work.

A Quantitative Study of Data in the NLP community

Margot Mieskes. Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. 2017.

https://www.aclweb.org/anthology/W17-1603/

Paper 7 of #156papers is "What makes a good conversation? How controllable attributes affect human judgments" (See et al. 2019, https://www.aclweb.org/anthology/N19-1170/)

They">https://www.aclweb.org/anthology... use discrete input categories to manipulate repetition, etc, & examine their impact on user ratings for multiturn dialogs

What makes a good conversation? How controllable attributes affect human judgments

Abigail See, Stephen Roller, Douwe Kiela, Jason Weston. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,...

https://www.aclweb.org/anthology/N19-1170/

Paper 8 of #156papers is "Why We Need New Evaluation Metrics for NLG" (Novikova et al. 2017, http://aclweb.org/anthology/D17-1238)

Looking">https://aclweb.org/anthology... at word-overlap metrics (e.g. BLEU) and grammar-based metrics (e.g. parsability scores) and the various correlations among them and human scores

Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, Verena Rieser. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

http://aclweb.org/anthology/D17-1238

Paper 9 of #156papers is "The reliability of acceptability judgments across languages" (Linzen & Oseki 2018, http://www.glossa-journal.org/articles/10.5334/gjgl.528/)

Acceptability">https://www.glossa-journal.org/articles/... judgements inform syntactic theory. They argue for experimental support for disputed cases, but not for every grammaticality jdgmnt

Glossa: a journal of general linguistics

Article: The reliability of acceptability judgments across languages

http://www.glossa-journal.org/articles/10.5334/gjgl.528/

Paper 10 of #156papers is "Quantifying sentence acceptability measures: Reliability, bias, and variability" (Langsford et al. 2018, https://www.glossa-journal.org/article/10.5334/gjgl.396/)

They">https://www.glossa-journal.org/article/1... look at Likert scales, magnitude estimation, & forced choice tasks to assess acceptability, finding high reliability

Glossa: a journal of general linguistics

Article: Quantifying sentence acceptability measures: Reliability, bias, and variability

https://www.glossa-journal.org/article/10.5334/gjgl.396/

Paper 11 of #156papers is "Investigating stability and reliability of crowdsourcing output" (Qarout et al. 2018, http://ceur-ws.org/Vol-2276/paper10.pdf)

Short">https://ceur-ws.org/Vol-2276/... one on MTurk & FigureEight for an IR task. Turkers: faster per subj w/higher acc. F8: faster overall. Both acc.s consistent over time.

Paper 12/ #156papers is "On Understanding the Relation between Expert Annotations of Text Readability & Target Reader Comprehension" (Vajjala & Lučić 2019, https://www.aclweb.org/anthology/W19-4437/)

Prev">https://www.aclweb.org/anthology... simplification&readability work focused on writer-leveled text. V&L study reader comp instead

On Understanding the Relation between Expert Annotations of Text Readability and Target Reader...

Sowmya Vajjala, Ivana Lucic. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2019.

https://www.aclweb.org/anthology/W19-4437/

Paper 13/ #156papers is "A method for comparing ﬂuency measures and its application to ITS natural language question generation" (Wilson 2003, https://www.aaai.org/Library/Symposia/Spring/2003/ss03-06-027.php)

Compares">https://www.aaai.org/Library/S... bigram and parse-based language models for scoring question fluency.

Paper 14 of #156papers is "Evaluation metrics for generation" (Bangalore et al. 2000, https://www.aclweb.org/anthology/W00-1401/)

Pre-dating">https://www.aclweb.org/anthology... BLEU, they correlated string- and dependency-tree-based edit distance metrics with human evaluations of understandability and quality.

Evaluation Metrics for Generation

Srinivas Bangalore, Owen Rambow, Steve Whittaker. INLG’2000 Proceedings of the First International Conference on Natural Language Generation. 2000.

https://www.aclweb.org/anthology/W00-1401/

Their example text:
There was no cost estimate for the second phase.
versus
There was estimate for phase the second no cost.

Simple String Acc: 0.44
w/movement errors: 0.56
Simple Tree Acc: 0.33
w/movement errs: 0.67

BLEU: 17.75 (100.0/44.4/6.2/3.6)

These metrics still require references (like BLEU does) and so can only be used in relatively limited generation contexts, but I like that they make use of syntactic (dependency) trees rather than treating all token movements the same.

Don& #39;t know if anyone& #39;s used em recently.

Erm, this thread continues here https://twitter.com/_dmh/status/1226612902540513281">https://twitter.com/_dmh/stat...

https://twitter.com/_dmh/status/1226612902540513281

Latest Threads Unrolled: