Boneyard Tools

Unicode code points, planes and UTF-8

What a code point is, how Unicode planes and the BMP are organized, and how UTF-8 encodes any character in one to four bytes.

Code points and the U+ notation

A code point is the unique number Unicode assigns to a character, written as U+ followed by hexadecimal digits. The letter A is U+0041, which is 65 in decimal, and a grinning face emoji is U+1F600, which is 128512. The current range runs from U+0000 up to U+10FFFF, giving room for more than a million code points. A code point is an abstract identity; it says which character you mean but not how it is stored in memory or a file.

Planes and the Basic Multilingual Plane

Unicode divides its code space into 17 planes of 65,536 code points each. Plane 0, the Basic Multilingual Plane or BMP, holds almost every character in common use: Latin, Greek, Cyrillic, Arabic, Hebrew, the main CJK ideographs and thousands of symbols. The higher planes, often called astral or supplementary, hold most emoji, historic scripts, rarer ideographs and specialized symbols. Anything with a code point above U+FFFF lives outside the BMP.

How UTF-8 encodes a code point

UTF-8 is a variable-length encoding that stores a code point in one to four bytes. Code points up to U+007F, the ASCII range, take a single byte, so A stays 0x41. Values up to U+07FF take two bytes, up to U+FFFF take three, and everything above that takes four. The leading byte signals the length and the continuation bytes each begin with the bits 10, which is why a two-byte character like é becomes 0xC3 0xA9 and a four-byte emoji becomes four bytes such as 0xF0 0x9F 0x98 0x80.

Surrogate pairs in UTF-16

JavaScript strings use UTF-16, which stores BMP characters in one 16-bit unit but needs two units, a surrogate pair, for anything above U+FFFF. That is why a single emoji has a string length of 2 even though it is one character. Iterating with the spread operator or a for-of loop walks whole code points and rejoins the pair, which is exactly what this tool does so each astral character is counted and shown once.

Frequently asked questions

Why does an emoji have a JavaScript length of 2?

Emoji above U+FFFF are stored as a UTF-16 surrogate pair, which is two 16-bit units. The .length property counts those units, while this tool counts whole code points, so it reports the emoji as one character.

Is a code point the same as a byte?

No. A code point is the character's number, while bytes are the encoded form. In UTF-8 one code point becomes one to four bytes, so the two values only match for plain ASCII characters.

What is the largest possible code point?

U+10FFFF. Unicode caps the range there so every code point can be encoded in UTF-16, which leaves just over a million usable positions across the 17 planes.