Today I'm going to try combining two things I often tweet about: tech and Cantonese. So, let's talk about parsing.

I'm currently studying Cantonese at #CUHK in #HongKong.
In English I would say: "My name is Nathan." The written representation maps 1:1 with the words spoken. For bonus points, the characters in each word themselves contain information about how they are pronounced. Put differently, serialization and deserialization are symmetric.
In Cantonese, using the Yale romanization, I would say out loud to somebody, "Ngóh ge méng haih Nathan." One possible written serialization of that is: "我嘅名係Nathan." Wait, what? One *possible* serialization!?
Cantonese is a diglossic language. It has different written and spoken forms. "我嘅名係Nathan" is a direct 1:1 serialization of the spoken form of Cantonese to characters. If I am texting a friend, or writing tweets, I might use this symmetric serialization.
However, if I were a public official writing tweets, I might instead write the more-formal, written Chinese, "我的名字是Nathan." For bonus points, this has its own intrinsic pronunciation: "ngóh dīk mìhngjih sih Nathan."
A few of things happened here:
- Some 1:1 changes happened.
- A character was added (字).
- And 名 went from being pronounced "méng" to "mìhng"!

This is not at all a context-free grammar!
But fine, language is weird, it's by rule foreign, and it isn't a precise science. "I before E except after C" is a garbage rule too.
Notice above that I said "我的名字是Nathan" has its own "intrinsic pronunciation"? I chose those words specifically instead of "reading" because if you were instead to read them aloud (not just noting their pronunciation), you would actually say "我嘅名係Nathan" instead!
You deserialize the more-formal written Chinese into spoken Cantonese. On the fly. In a streaming manner. 🤯
Now, imagine you're reading a bedtime story to your kid. How do you teach your kid the difference between written and spoken characters when they're looking at one thing and you're saying something entirely different?
What if you're watching a movie with subtitles? Do you match what is being said, or what you're used to reading? Or do you mix it up and use the symmetric serialization for dialogue and written serialization for audio notes?
Or imagine you're a politician reading off a teleprompter to give a speech. Which serialization of spoken Cantonese would you use? Symmetric so you don't make mistakes, or the specifically more-formal, written Chinese for correctness?
The answer is, of course, "it depends." But more often than not you end up with the more-formal, written Chinese.
(Which is why the next one of you who tells me to check out subtitles as a way to learn Cantonese and/or written Chinese will henceforth be redirected to this thread.)
My mental model for this as a programmer is that you must read from the input stream a lookahead of N characters before you know which token you have received. I'm not a good parser, my lookahead for Cantonese is "until the end of the sentence" but I'm working on ... upgrading?
All that to say: every Chinese-reading human out there has a built-in parser and transpiler!
You can follow @nathanhammond.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: