HI THERE would YOU like to know why your cnovel mtl is so unreadable? Well guess who's been researching mtl for the past few months (meee)? Idk if my research has produced any conclusive results but i do now know a decent amount about how mtl systems work, so let's GO (thread)
(2/?) most modern mtl is some mix of statistical machine translation (smt) and neural machine translation (nmt). smt provide what it perceives to be most statistically likely translation based on the corpus of texts on which the model was trained, nmt is a more advanced version
(3/?) that uses an artificial neural network to translate sentences end to end simultaneously (only sentences though. this is a problem). FYI google translate switched over to a full nmt system in 2016; they say that its systems are trained on millions of source texts.
(4/?) how well mtl works gets worse as a language gets linguistically further away from english; spanish is going to get a much better (still not perfect) translation than chinese or arabic, and this is due to the wide grammatical and lexical differences between the languages.
(6/?) Now mtl cn actually works ok when it comes to news articles and research papers! This is because these texts tend to use very standard grammar and set phrases, with few if any literary flourishes or neologisms.
(7/?) But in general, some of the key differences in how cn works vs. en (no spaces delimiters btwn words, more limited use of articles/pronouns, sentence and phrase structures that would be considered run-ons in en) make cn very difficult for mtl systems to parse to english.
(8/?) Unfortunately for us, we want to translate novels/fic, which are full of slang, neologisms, and literary references. And it's here that the limits of mtl become really glaring. If you mtl a cnovel, here are some of the most common issues you'll encounter:
(9/?) Non context-sensitive translations. mtl will produce the most likely translation of a line of text, but won't consider the larger context of that line. ex: if a character is id'd as male once, that won't necessarily carry over to other times where his gender is ambiguous.
(10/?) This also applies over to names, which is why 三郎 in tgcf occasionally gets translated as "saburo" and "shiro" i feel like it should be fairly easy to find the most likely tl of a name and stick with that throughout a doc but apparently not?
(11/?) Relatedly, genders may switch seemingly randomly – or not so randomly. as seen here, mtl incorporates the existing biases of the corpus of texts that the system is trained on: https://twitter.com/seyyedreza/status/935291445723922432?s=20
(12/?) Inability to recognize idioms or classical references. These may have been drilled into anyone who’s ever attended school in China but they haven’t been drilled into google’s brain! Both are frequently mangled; you can see examples here: https://www.thebeijinger.com/blog/2020/02/17/chinese-words-trip-up-google-translate
(13/?) Completely wrong transliterations of names? I sincerely have no idea how this happens but it's very common – I've seen Xie Lian as Xie Lin, Xie Pi, Xie Pity etc, though to google's credit this does seem to be getting better.
(14/?) and then sometimes google decides to transliterate a name (hbn…) as something completely different to any possible pronunciation of the characters in that name! For no apparent reason!
thread continues here https://twitter.com/IneluctableM/status/1265109795314548737?s=20
You can follow @IneluctableM.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: