ToolboxHub

How Computers Store Text: ASCII, Unicode, and UTF-8

7 min read

Text is everywhere on a computer, yet a computer has no concept of a letter. At its core a machine stores only numbers, and underneath those numbers only bits, the 0s and 1s of binary. So how does the letter A, a Chinese character, or a smiling emoji end up on your screen? The answer is character encoding: an agreed-upon system that maps every character to a number, and every number to a sequence of bytes. This article traces that story from the early days of ASCII to the universal system, UTF-8, that powers almost all modern text.

Computers Only Store Numbers

A computer cannot store the shape of a letter directly. Everything in memory, on disk, or travelling across a network is ultimately a pattern of bits grouped into bytes. To represent text, we need a rule that assigns each character a number, and a way to write that number as bytes.

This two-part idea is the heart of the topic. First, decide which number stands for which character. Second, decide how that number is physically stored as one or more bytes. Different systems have answered these two questions in different ways, and that is the source of both the elegance and the historical chaos of text encoding.

ASCII: The Original Standard

ASCII (the American Standard Code for Information Interchange) was the first widely adopted answer. Defined in the 1960s, it uses 7 bits to represent 128 characters, numbered 0 to 127. That range covers the uppercase and lowercase English letters, the digits 0 through 9, common punctuation, the space, and a set of non-printing control codes such as newline, tab, and carriage return.

ASCII is beautifully simple. The letter A is 65, B is 66, and the digit 0 is 48. Because each character fits in 7 bits, it sits comfortably inside a single 8-bit byte with one bit to spare. For decades, if you were writing English, ASCII was all you needed.

The Limits of ASCII and the Code-Page Chaos

The problem is that the world does not write only in English. ASCII has no accented letters for French or Spanish, no characters for Greek, Cyrillic, Arabic, Hebrew, or the thousands of characters used in Chinese, Japanese, and Korean.

With one spare bit, a byte can hold 256 values, so the unused range from 128 to 255 became a free-for-all. Different regions filled it with different characters, creating dozens of incompatible 8-bit code pages. One code page put accented Western European letters there; another put Cyrillic; another put Greek. A document written with one code page would turn to nonsense when opened with another, because byte 200 might mean one letter in Paris and a completely different one in Moscow. This fragmentation made truly global text nearly impossible.

Unicode: One Number for Every Character

Unicode was created to end the code-page wars. Its goal is sweeping: assign a single, unique number, called a code point, to every character in every writing system in the world, plus symbols, mathematical signs, and emoji.

Code points are usually written with a U+ prefix in hexadecimal, so the letter A is U+0041 (decimal 65, matching ASCII on purpose), and a common smiling emoji is U+1F600. Unicode has room for over a million code points and currently defines well over 140,000 of them. Crucially, Unicode by itself only assigns numbers to characters. It is a giant lookup table, not a rule for how those numbers are stored as bytes. That second job belongs to an encoding.

Character Set vs Encoding

This distinction trips up many people, so it is worth stating plainly. A character set, like Unicode, decides which number represents which character. An encoding, like UTF-8 or UTF-16, decides how those numbers are turned into actual bytes.

Unicode is the map from characters to code points. UTF-8 and UTF-16 are two different ways of writing those code points as bytes. The same Unicode code point can be stored as different byte sequences depending on which encoding you choose, just as the same amount of money can be paid in different combinations of coins. Confusing the character set with the encoding is the root cause of most text-handling bugs.

How UTF-8 Works and Why It Won

UTF-8 is the encoding that conquered the web, and for good reason. It is a variable-length encoding: a character takes between 1 and 4 bytes depending on its code point.

The genius of UTF-8 is that the first 128 code points, exactly the ASCII characters, are stored in a single byte with the identical value ASCII used. This makes UTF-8 perfectly backward compatible: any plain ASCII file is already valid UTF-8. Characters beyond ASCII use 2, 3, or 4 bytes, with the leading bits of each byte signalling how many bytes the character occupies. This design keeps English text compact, represents every Unicode character, and never wastes space on simple content. Those properties are why UTF-8 now encodes the overwhelming majority of all web pages, far ahead of fixed-width alternatives like UTF-16.

Common Problems: Mojibake and the BOM

When the encoding used to read text does not match the one used to write it, you get mojibake: garbled characters where accented letters or emoji should be. The classic symptom is a word like cafe showing strange symbols in place of the accented e, which happens when UTF-8 bytes are mistakenly read as an old single-byte code page. The bytes are fine; only the interpretation is wrong.

Another source of confusion is the byte order mark, or BOM, a special invisible sequence sometimes placed at the start of a file to signal its encoding. In UTF-8 the BOM is unnecessary and can cause trouble, occasionally appearing as stray characters at the beginning of a file or breaking scripts that expect the very first byte to be meaningful. Knowing about the BOM helps explain otherwise baffling glitches.

A Worked Example: From Character to Bytes

Consider the letter A. Its Unicode code point is U+0041, which is 65 in decimal and 01000001 in binary. In UTF-8 it occupies a single byte, the same one ASCII always used, so A is simply the byte 65.

Now take the accented letter e in cafe (e with an acute accent), whose code point is U+00E9, decimal 233. This sits beyond the ASCII range, so UTF-8 stores it in two bytes rather than one. An emoji goes further still: the smiling face U+1F600 needs all four bytes in UTF-8. The same characters in UTF-16 would use a different number of bytes entirely. This is exactly why the count of characters in a string can differ from the count of bytes it occupies, a distinction that matters whenever you measure text length.

See Text as Bytes with ToolboxHub

The easiest way to build intuition for all of this is to watch it happen. The free ToolboxHub Binary Text Converter takes any text you type and shows it as binary, hexadecimal, and decimal, with full UTF-8 support so that accented letters and emoji reveal their multi-byte representations. Type the letter A and see 65; type an emoji and watch it expand into several bytes. It runs entirely in your browser with no sign-up, making it a quick way to demystify how your words become numbers.

Key Takeaways

Computers store only numbers and bits, so every character of text relies on an encoding that maps characters to numbers and numbers to bytes. ASCII came first, using 7 bits for 128 characters that cover English, but its limits spawned a chaos of incompatible 8-bit code pages for the rest of the world.

Unicode fixed this by giving every character a unique code point, while encodings like UTF-8 decide how those code points become bytes. UTF-8 won the web because it is backward compatible with ASCII, compact for English, and able to represent every character using 1 to 4 bytes. Mismatched encodings cause mojibake, and the BOM can introduce subtle glitches. To see the whole process for yourself, try the free ToolboxHub Binary Text Converter.

Try these tools now — free, no sign-up required:

Related Articles