Thread by @Cor3ntin, Most people teach text representation in computers very poorly, leading to a [...]

Most people teach text representation in computers very poorly, leading to a very bad understanding by everyone. A thread.

See, as computer scientists and engineers, we are fascinated by bits. And a lot less by text.

As a result, most lectures will focus on bit representations.
You will learn to represent the letter A in funny ways with 5, 6 or 7 bits.

You will learn about how clever UTF-8 is.

You might learn about surrogate pairs, shift states & many idiosyncrasies of the 1000s of encodings designed in the 20th century.

Every time a character doesn& #39;t render properly, you will think that the bits must be wrong,
and reminisce all these bits patterns you learned about.

But what are encodings?

Merely a serialization format for text - text being something humans can interpret.
The fact that there thousands of encodings in an artifact of history.

So given a blob of memory, how does your computer knows which encoding to use to decode that text?

It doesn& #39;t.

There is no general purpose mechanism through which the encoding is communicated, you NEED out of band information.

Some programs will use a tag in the program
like <meta charset="UTF-8"> in HTML documents, the encoding can also be communicated through a HTTP header or any other application specific mechanism.

More often though, the program will just guess

Some, like browsers will do statistical analysis to see if the bit patterns in your pages are likely to be of one encoding. It works 100% of the time 70% of the the time.

Older systems, like windows, assume applications that
output things on the console use the locale encoding.

What& #39;s that?
In the 80s people though that the characters used in text depended on the language you speak and the region you leave in, so if you bought a computer in western Europe it would use a different encoding as one in Japan etc. Windows still uses that model

Because see, not all encodings are equals. They can& #39;t all represent the same number or characters, nor the same characters. The goal of these encodings was to use as few bytes as possible rather than be able to represent text.

Only encoding characters likely to be used in a given place was a good way to save bytes.

Like not encoding colors that are unlikely to be perceived b the human eye is a good way to encode a movie.

But then Internet happened

And you see maybe it is okay for blue to be slightly more blue, or for a black to be slightly less deep, but when a character is poorly decoded, meaning is lost immediately.
Characters ARE meaning.
Either an encoding can convey the meaning of the text, or it can& #39;t. Period.

But focusing on bits, it& #39;s easy too loose track of that.
So how make sure meaning is not lost when converting between encodings?
For a long time, encodings assigned characters to an index in a font file, conversions were difficult.
At best you had one table per encodings pair

The thing is there are hundreds thousands of these things
people call characters.
Circa 1991, Enters Unicode.
Unicode aims to build a big table of all known characters,
which makes it actually possible to identify characters, and convert to/from other encodings

Sure, Unicode made a few mistakes, the first was to think that 65K characters ought to be enough, second was to miss identify different characters as identical. Both mistakes were addressed.

Unicode is not complete, some characters existing in other encodings are not in Unicode

There are also a lot of characters not in any computer system.
[Chinese writing system interlude]
These characters are usually people and place names in Hanzi.
Hanzi are based on strokes, and despite rules, there are an almost infinite numbers of stroke arrangement possible

New Hanzi and Kanji are minted regularly, creating characters is a hobby and an art.
It was realized early on that encoding strokes would not be a viable solution, so Unicode and other encodings encode a few tens thousands of the most used one

Nobody even knows how many such Hanzi there are (we know of upwards of 106K)
Computers may strictly speaking never support _all_ characters.
But Unicode knows about 99.9% of used characters. pretty cool!

So Unicode, as a set of encoding is great, an encoding that can encode tantalizingly close to all human text.

But the great achievement of Unicode is that it is an index of characters we know about. It offers us a framework to talk about characters and their *representability*

This is why everyone uses Unicode today. It& #39;s a super set of everything else.
(Almost) everything convert to Unicode and when converting from Unicode (never the best idea), we can fail when a character is not representable in the target encoding

But anyway, that is how you should be thinking about text encodings: serialization for text, such that, if it is lossy destroys the meaning of text.
Stop talking about bits, care about preserving meaning.

Writing system are serialization formats for human thoughts.

Latest Threads Unrolled: