Here's the thing with #260papers: some days you read more, some days you read less, but averaging a paper per work day seems like a lot, especially since that expects you to work on holiday.
Perhaps I underestimate myself, but I'm aiming for #156papers this year (avg of 3/week)
Perhaps I underestimate myself, but I'm aiming for #156papers this year (avg of 3/week)
I figure that #156papers encourages steady effort to keep reading while allowing for the fact that sometimes you're on holiday or have deadlines related to producing rather than consuming knowledge, keeping you from hitting #260papers. I don't want to think about #365papers...
Anyway, back to work yesterday and just finished paper 1 of #156papers: "Ordinal Regression Models in Psychology: A Tutorial" (BĂĽrkner & Vuorre 2019, https://journals.sagepub.com/doi/10.1177/2515245918823199)
Reading to aid my effort to better model human evaluation data for #NLG.
Reading to aid my effort to better model human evaluation data for #NLG.
Paper 2 of #156papers is "Estimating the reproducibility of psychological science" (Open Science Collaboration 2015, http://www.sciencemag.org/cgi/doi/10.1126/science.aac4716)
Understanding reproducibility, replicability, etc, is important for understanding how we can improve #NLProc & #ComputationalLinguistics
Understanding reproducibility, replicability, etc, is important for understanding how we can improve #NLProc & #ComputationalLinguistics
Paper 3 of #156papers is "Rethinking the agreement in human evaluation tasks (Position Paper)" (Amidei, Piwek, Willis 2018, https://www.aclweb.org/anthology/C18-1281/)
The authors highlight some of the linguistic variation which can be erased in a quest for higher interannotator agreement.
The authors highlight some of the linguistic variation which can be erased in a quest for higher interannotator agreement.
Paper 4 of #156papers is "A Practical Taxonomy of Reproducibility for Machine Learning Research" (Tatman, VanderPlas, Dane 2018; https://openreview.net/forum?id=B1eYYK5QgX)
I'd love to see folks sharing not just NLP code and data, but also evaluation materials (surveys, scores, analysis)
I'd love to see folks sharing not just NLP code and data, but also evaluation materials (surveys, scores, analysis)
Paper 5 of #156papers is "Survey on Evaluation Methods for Dialogue Systems" (Deriu et al. 2019, preprint/manuscript; https://arxiv.org/abs/1905.04071 ).
Nice place to start looking at what's been done in evaluating dialogue systems. Emphasizes automated metrics more than I'd like.
Nice place to start looking at what's been done in evaluating dialogue systems. Emphasizes automated metrics more than I'd like.
Paper 6 of #156papers is "A Quantitative Study of Data in the NLP community" (Mieskes 2017, https://www.aclweb.org/anthology/W17-1603/)
Looking at NAACL, ACL, EMNLP, LREC, and COLING pubs from 2016, about 40% reported publication of some data. More than 15% of the links provided didn't work.
Looking at NAACL, ACL, EMNLP, LREC, and COLING pubs from 2016, about 40% reported publication of some data. More than 15% of the links provided didn't work.
Paper 7 of #156papers is "What makes a good conversation? How controllable attributes affect human judgments" (See et al. 2019, https://www.aclweb.org/anthology/N19-1170/)
They use discrete input categories to manipulate repetition, etc, & examine their impact on user ratings for multiturn dialogs
They use discrete input categories to manipulate repetition, etc, & examine their impact on user ratings for multiturn dialogs
Paper 8 of #156papers is "Why We Need New Evaluation Metrics for NLG" (Novikova et al. 2017, http://aclweb.org/anthology/D17-1238)
Looking at word-overlap metrics (e.g. BLEU) and grammar-based metrics (e.g. parsability scores) and the various correlations among them and human scores
Looking at word-overlap metrics (e.g. BLEU) and grammar-based metrics (e.g. parsability scores) and the various correlations among them and human scores
Paper 9 of #156papers is "The reliability of acceptability judgments across languages" (Linzen & Oseki 2018, http://www.glossa-journal.org/articles/10.5334/gjgl.528/)
Acceptability judgements inform syntactic theory. They argue for experimental support for disputed cases, but not for every grammaticality jdgmnt
Acceptability judgements inform syntactic theory. They argue for experimental support for disputed cases, but not for every grammaticality jdgmnt
Paper 10 of #156papers is "Quantifying sentence acceptability measures: Reliability, bias, and variability" (Langsford et al. 2018, https://www.glossa-journal.org/article/10.5334/gjgl.396/)
They look at Likert scales, magnitude estimation, & forced choice tasks to assess acceptability, finding high reliability
They look at Likert scales, magnitude estimation, & forced choice tasks to assess acceptability, finding high reliability
Paper 11 of #156papers is "Investigating stability and reliability of crowdsourcing output" (Qarout et al. 2018, http://ceur-ws.org/Vol-2276/paper10.pdf)
Short one on MTurk & FigureEight for an IR task. Turkers: faster per subj w/higher acc. F8: faster overall. Both acc.s consistent over time.
Short one on MTurk & FigureEight for an IR task. Turkers: faster per subj w/higher acc. F8: faster overall. Both acc.s consistent over time.
Paper 12/ #156papers is "On Understanding the Relation between Expert Annotations of Text Readability & Target Reader Comprehension" (Vajjala & Lučić 2019, https://www.aclweb.org/anthology/W19-4437/)
Prev simplification&readability work focused on writer-leveled text. V&L study reader comp instead
Prev simplification&readability work focused on writer-leveled text. V&L study reader comp instead
Paper 13/ #156papers is "A method for comparing fluency measures and its application to ITS natural language question generation" (Wilson 2003, https://www.aaai.org/Library/Symposia/Spring/2003/ss03-06-027.php)
Compares bigram and parse-based language models for scoring question fluency.
Compares bigram and parse-based language models for scoring question fluency.
Paper 14 of #156papers is "Evaluation metrics for generation" (Bangalore et al. 2000, https://www.aclweb.org/anthology/W00-1401/)
Pre-dating BLEU, they correlated string- and dependency-tree-based edit distance metrics with human evaluations of understandability and quality.
Pre-dating BLEU, they correlated string- and dependency-tree-based edit distance metrics with human evaluations of understandability and quality.
Their example text:
There was no cost estimate for the second phase.
versus
There was estimate for phase the second no cost.
Simple String Acc: 0.44
w/movement errors: 0.56
Simple Tree Acc: 0.33
w/movement errs: 0.67
BLEU: 17.75 (100.0/44.4/6.2/3.6)
There was no cost estimate for the second phase.
versus
There was estimate for phase the second no cost.
Simple String Acc: 0.44
w/movement errors: 0.56
Simple Tree Acc: 0.33
w/movement errs: 0.67
BLEU: 17.75 (100.0/44.4/6.2/3.6)
These metrics still require references (like BLEU does) and so can only be used in relatively limited generation contexts, but I like that they make use of syntactic (dependency) trees rather than treating all token movements the same.
Don't know if anyone's used em recently.
Don't know if anyone's used em recently.
Erm, this thread continues here https://twitter.com/_dmh/status/1226612902540513281