Tom's wiki

Unicode

A standard that aims to unify all human languages. In practice, it is a table that assigns unique numbers (code points) to different characters.

Learn

UTF

Unicode Transformation Format. An encoding used to store code points in memory.

The Byte Order Mark (BOM) is a two-byte marker at the beginning of a file that tells what encoding the file is using.

UTF-8

A variable-length encoding, backwards compatible with ASCII. Good for English text, not so good for Asian text.

ASCII characters U+0000 to U+007F take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes.

See also UTF-8: Bits, Bytes, and Benefits by Russ Cox.

UTF-16

A variable-length encoding. Bad for English text, good for Asian text.

Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes.

UTF-32

A fixed-length encoding, all code points take 4 bytes. Fast, but needs a lot of memory. Rarely used.