Unicode
A standard that aims to unify all human languages. In practice, it is a table that assigns unique numbers (code points) to different characters.
Learn
- The Absolute Minimum Every Software Developer Must Know About Unicode in 2023
- Computerphile: Characters, Symbols and the Unicode Miracle
UTF
Unicode Transformation Format. An encoding used to store code points in memory.
The Byte Order Mark (BOM) is a two-byte marker at the beginning of a file that tells what encoding the file is using.
UTF-8
A variable-length encoding, backwards compatible with ASCII. Good for English text, not so good for Asian text.
ASCII characters U+0000 to U+007F take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes.
See also UTF-8: Bits, Bytes, and Benefits by Russ Cox.
UTF-16
A variable-length encoding. Bad for English text, good for Asian text.
Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes.
UTF-32
A fixed-length encoding, all code points take 4 bytes. Fast, but needs a lot of memory. Rarely used.