Boneyard Tools

How text becomes binary: characters, bytes, and UTF-8

Follow a character from keystroke to 8-bit binary: code points, UTF-8 bytes, why padding matters, and how decoding reverses each step.

From character to code point to byte

Computers do not store letters, they store numbers. Every character has a Unicode code point, a number that identifies it, such as 72 for a capital H. To put that number into a file or a stream, an encoding turns the code point into one or more bytes, and UTF-8 is the encoding the modern web overwhelmingly uses. For the letter H the code point 72 becomes a single byte with the value 72, and the binary you see is just that byte written in base two.

Why each byte is eight padded digits

A byte is eight bits, so it can represent 256 distinct values, from 0 up to 255. The number 72 in binary is 1001000, which is only seven digits, so the tool pads it on the left with a zero to make 01001000. That fixed width is not decoration: it is what lets a decoder read a long binary string and know that every eight characters is one complete byte. Without consistent padding, the boundary between one byte and the next would be ambiguous.

Multi-byte characters and emoji

ASCII characters fit in a single UTF-8 byte, but most of the world's characters do not. Accented letters, non-Latin scripts, and emoji use sequences of two, three, or four bytes, and each of those bytes still becomes its own 8-bit group in the output. A single emoji can therefore expand into four space-separated binary groups. Because UTF-8 defines exactly how those multi-byte sequences are formed, the decoder can reassemble them back into the original character without guesswork.

Decoding reverses every step

Going from binary back to text simply undoes the process. The decoder splits the input on whitespace, checks that each group is a valid eight-bit byte, and collects the byte values into a buffer. It then asks the UTF-8 decoder to interpret that buffer, which rebuilds the original characters, including any multi-byte ones. If a group is the wrong length or the bytes do not form a legal UTF-8 sequence, decoding fails cleanly with an error rather than returning scrambled text.

Frequently asked questions

Is this the same as ASCII binary?

For plain English text the output is identical to ASCII, because UTF-8 encodes ASCII characters as the same single-byte values. The difference only appears with characters beyond the ASCII range, which UTF-8 spreads across several bytes.

Why did decoding my binary fail?

The most common causes are a group that is not exactly eight 0s and 1s, or bytes that do not form valid UTF-8. Check that every group has eight digits and is separated by a space, then try again.