Conjecture: High quality MT test sets are useless for MT since the automatic metrics that we have cannot tell if you are actually improving if the test set is too good.

WMT test sets might just be the right amount of bad.
Or the other way round, our metrics are useless since we cannot tell if we improved on high quality test sets. But that we kinda knew.
I have compiled a test set from the English version of a German on-line news site. They have an English editorial team, and it is easy to find corresponding articles and identify translation direction. The translations are produced by the team members for publication. Very HQ.
I have four WMT test sets from the last years, produced by non-associated translation companies. I have MT systems that improved by 4-6 (!) BLEU points on the WMT system without domain-adaptation (just from general data cleaning). WMT19 improves from 40.1 to 47.0 BLEU, visibly.
How much does the HQ test set improve under BLEU? About 0.1 BLEU. From 26.6 to 26.7. Also note the much lower BLEU score in general.

Now what?
You can follow @marian_nmt.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: