Some notes regarding inter-annotator agreement in NLG:

Krippendorff (2004; https://repository.upenn.edu/cgi/viewcontent.cgi?article=1250&context=asc_papers) compares different measures for inter-annotator agreement and shows (following earlier literature) that Cohen’s Kappa has some serious flaws.
Maybe a good idea to abandon Kappa in favor of other measures.

Or at least check that all the requirements apply for safely using Kappa. But we all know that this is easily forgotten.
This also led me to re-read Amidei et al. (INLG 2019; https://www.aclweb.org/anthology/W19-8642.pdf).

I’m still not sure whether I fully support the arguments presented in the paper (although I do support the main point: use multiple measures, raher than 1), but it’s useful to think about evaluation.
The paper does have one flaw, which I hope won’t get repeated in the literature:

The tables presenting cutoff values for IAA measures suggest that there can be a single value for all measures. Not true: each measure is defined differently. Krippendorff’s scale is only for alpha.
Some other points:

The paper suggests it’s not good to specify percentage agreement. I disagree, for the same reason that the authors argue we should use both IAA and correlation: it’s an informative measure!
Combined with IAA, it tells you something about the skew of the data. Ideally I’d like to see a confusion matrix for papers using a categorical annotation task, because that provides all the information at a glance. (Especially if you represent it as a heatmap table.)
Why I have doubts about the arguments presented in the Amidei et al. paper: it suggests the problem lies with the evaluation measure, and NLG evaluation is just inherently subjective. I.e. we cannot make the ratings more objective.

I don’t think we’ve tried hard enough.
So how can we make human evaluation more objective? There’s quite a few suggestions from the social sciences.

For example: “quality” is a complex construct, that shouldn't be measured with a single question. Ideally you would have multiple questions per item.
By having multiple questions per item, you can break down a complex construct into different aspects. This reduces variation between annotators because their interpretation of the scales is more uniform.
This does come with a cost: more questions per item means that either your task takes longer to annotate, or you’ll have to reduce the number of items to judge.

So what about the practical issues here?
I’ve discussed this with Chris van der lee, my colleague at @TilburgU, and his argument was simple: would you rather have more items but an unreliable measure, or fewer items but a reliable measure?

Reliability trumps quantity. But it does mean you have to think about sampling.
Evaluation in #NLProc doesn’t always take sampling into account. It’s just assumed that an 80-10-10 train-val-test split will give you three representative sets of items. But does it? And what does it mean to be representative? We should probably think more about this.
For example: maybe we should use stratified ramdom sampling, to make sure we include all relevant variation in the test/val data. And then report overall results + results for each of the strata.
And as @EhudReiter has suggested: talk about worst-case performance! Many papers just report average-case performance, and throw in some cherries and lemons for a qualitative impression of best/worst cases. We can probably do better than this! E.g. give proportion of bad results.
You can follow @evanmiltenburg.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: