Loading content...
A developer creates a database column: VARCHAR(100). The form says "Enter your name (max 100 characters)." Everything works perfectly—until a Japanese user tries to register with a 30-character name and gets an error.
How can 30 characters exceed a limit of 100?
The answer lies in understanding that "character" and "byte" are not the same thing. The database measured in bytes; the form measured in characters. The Japanese name, using multi-byte encoding, consumed more bytes than expected.
This confusion—between characters, code points, bytes, and encoded sizes—is responsible for countless bugs. By the end of this page, you'll have solid intuition for how characters actually occupy memory and storage, enabling you to avoid these traps entirely.
By the end of this page, you will understand why character size varies depending on encoding, how fixed-width and variable-width encodings differ, the byte costs of different character types in UTF-8 and UTF-16, and how to reason about text storage size accurately.
At the most basic level, computers store everything in bytes. A byte is 8 bits, capable of representing 256 different values (0-255). The fundamental question of character encoding is:
How many bytes does each character require?
This seems simple, but the answer depends on:
Let's build a comprehensive mental model.
| Character | UTF-8 Bytes | UTF-16 Bytes | UTF-32 Bytes |
|---|---|---|---|
| A (U+0041) | 1 | 2 | 4 |
| é (U+00E9) | 2 | 2 | 4 |
| 中 (U+4E2D) | 3 | 2 | 4 |
| 😀 (U+1F600) | 4 | 4 (surrogate pair) | 4 |
| 👨👩👧 (family) | 18 (ZWJ sequence) | 22 | 28 |
The shocking variety:
Notice that the simple letter 'A' takes 1 byte in UTF-8 but 4 bytes in UTF-32. The family emoji takes 18 bytes in UTF-8—for what appears as a single "character" to the user.
This variability is not a flaw; it's a deliberate design choice. Different encodings optimize for different use cases:
There is no single answer to 'How many bytes is a character?' The answer always depends on which character and which encoding. Treating all characters as equal-sized is the source of countless bugs.
In a fixed-width encoding, every character uses the same number of bytes. This seems ideal—simple, predictable, easy to work with. But it comes with significant tradeoffs.
ASCII (7 bits, effectively 1 byte):
ASCII is the simplest fixed-width encoding:
UTF-32 (4 bytes):
UTF-32 is a fixed-width Unicode encoding:
1234567891011121314151617181920212223242526
// UTF-32 fixed-width example (conceptual) const text = "Hello"; // 5 characters // In UTF-32, this occupies:// 5 characters × 4 bytes = 20 bytes // Memory layout (hex, big-endian):// 00 00 00 48 (H = U+0048)// 00 00 00 65 (e = U+0065)// 00 00 00 6C (l = U+006C)// 00 00 00 6C (l = U+006C)// 00 00 00 6F (o = U+006F) // Random access is O(1):// To find character at index 3:// byte_offset = 3 × 4 = 12// read 4 bytes starting at offset 12 // Compare to ASCII using same bytes:// UTF-32: 5 chars in 20 bytes// ASCII: 5 chars in 5 bytes (4× more efficient for ASCII text) // For Chinese "你好世界" (Hello World):// UTF-32: 4 characters × 4 bytes = 16 bytes// UTF-8: 4 characters × 3 bytes = 12 bytes (more efficient)UTF-32 is rarely used for storage or transmission due to space waste. However, some programs convert to UTF-32 internally for processing convenience, then convert back to UTF-8 for storage. Python 3.3+ uses a flexible internal representation that may use UTF-32 for strings with characters beyond the BMP.
In a variable-width encoding, different characters use different numbers of bytes. This seems complicated, but it enables dramatic space savings for common text.
The variable-width principle:
Assign fewer bytes to common characters, more bytes to rare ones. For English text, ASCII characters (the most common) use 1 byte, while rare scripts use 3-4 bytes.
UTF-8 byte allocation:
UTF-8 uses 1 to 4 bytes per code point, determined by the code point's value:
| Code Point Range | Bytes | Bit Pattern | Usable Bits | Example Characters |
|---|---|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx | 7 | ASCII: A-Z, a-z, 0-9, basic punctuation |
| U+0080 – U+07FF | 2 | 110xxxxx 10xxxxxx | 11 | Latin extensions: é, ñ, ü; Greek, Cyrillic, Hebrew, Arabic basics |
| U+0800 – U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx | 16 | CJK: Chinese, Japanese, Korean; most of BMP |
| U+10000 – U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 21 | Emoji, historic scripts, rare CJK |
1234567891011121314151617181920212223242526272829
// UTF-8 encoding examples // 1-byte character: 'A' (U+0041)// Binary: 0100 0001// UTF-8: 41 (just the byte itself) // 2-byte character: 'é' (U+00E9)// Binary: 0000 0000 1110 1001// UTF-8 pattern: 110xxxxx 10xxxxxx// Fill in bits: 110 00011 10 101001// UTF-8: C3 A9 (2 bytes) // 3-byte character: '中' (U+4E2D)// Binary: 0100 1110 0010 1101// UTF-8 pattern: 1110xxxx 10xxxxxx 10xxxxxx// Fill in bits: 1110 0100 10 111000 10 101101// UTF-8: E4 B8 AD (3 bytes) // 4-byte character: '😀' (U+1F600)// Binary: 0001 1111 0110 0000 0000// UTF-8 pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx// Fill in bits: 11110 000 10 011111 10 011000 10 000000// UTF-8: F0 9F 98 80 (4 bytes) // Practical size calculations:const english = "Hello World"; // 11 chars × 1 byte = 11 bytesconst chinese = "你好世界"; // 4 chars × 3 bytes = 12 bytesconst emoji = "Hi! 👋"; // 4×1 + 1×4 = 8 bytesconst mixed = "Café"; // 3×1 + 1×2 = 5 bytesUTF-16's approach:
UTF-16 uses 2 or 4 bytes per code point:
This makes UTF-16 more efficient than UTF-8 for CJK-heavy text (2 bytes vs 3), but less efficient for ASCII text (2 bytes vs 1).
Variable-width encodings sacrifice O(1) character indexing. To find the Nth character, you must scan from the beginning, counting bytes. For large strings with random access needs, this matters. Most string operations are sequential, however, where the tradeoff is worthwhile.
The choice between UTF-8 and UTF-16 affects storage size significantly, depending on the content.
Size comparison by text type:
| Text Type | UTF-8 Size | UTF-16 Size | More Efficient |
|---|---|---|---|
| English prose | ~1,000 bytes | ~2,000 bytes | UTF-8 (2× smaller) |
| Source code (ASCII) | ~1,000 bytes | ~2,000 bytes | UTF-8 (2× smaller) |
| German/French text | ~1,100 bytes | ~2,000 bytes | UTF-8 (~1.8× smaller) |
| Russian (Cyrillic) | ~2,000 bytes | ~2,000 bytes | Equal |
| Chinese text | ~3,000 bytes | ~2,000 bytes | UTF-16 (1.5× smaller) |
| Japanese text | ~2,500 bytes | ~2,000 bytes | UTF-16 (1.25× smaller) |
| Emoji-heavy text | ~3,000 bytes | ~3,200 bytes | Roughly equal |
Why platforms choose differently:
The choices made by major platforms reflect their historical contexts:
123456789101112131415161718192021222324252627
// Calculating actual byte sizes in different languages // JavaScript: Strings are UTF-16 internally, but TextEncoder gives UTF-8 sizeconst encoder = new TextEncoder(); function getUTF8Size(str) { return encoder.encode(str).length;} function getUTF16Size(str) { // Each code unit is 2 bytes in UTF-16 return str.length * 2; // Simplified; doesn't account for BOM} console.log(getUTF8Size("Hello")); // 5 bytesconsole.log(getUTF16Size("Hello")); // 10 bytes console.log(getUTF8Size("你好")); // 6 bytes (2 chars × 3 bytes)console.log(getUTF16Size("你好")); // 4 bytes (2 chars × 2 bytes) console.log(getUTF8Size("👍")); // 4 bytesconsole.log(getUTF16Size("👍")); // 4 bytes (surrogate pair = 2 code units × 2 bytes) // Python: Measure UTF-8 byte size# string = "Hello 你好 👍"# utf8_size = len(string.encode('utf-8'))# utf16_size = len(string.encode('utf-16'))For new projects, use UTF-8 everywhere. The web standard, command-line tools, and modern APIs all default to UTF-8. UTF-16's advantages (smaller CJK) rarely outweigh UTF-8's broad compatibility and simpler tooling. When in doubt, UTF-8.
Even understanding byte counts per code point doesn't tell the full story. Users see grapheme clusters—visual characters that may consist of multiple code points combined together.
Single code point = single grapheme:
For simple characters, one code point equals one visual character:
Multiple code points = single grapheme:
Many visual characters are composed of multiple code points:
| Visual Character | Code Points | UTF-8 Bytes | User Perception |
|---|---|---|---|
| é | U+0065 + U+0301 (or U+00E9) | 3 or 2 | One letter |
| 🇺🇸 | U+1F1FA + U+1F1F8 | 8 | One flag |
| 👨👩👧👦 | 7 code points (persons + ZWJ) | 25 | One family |
| 👩🏿🚀 | 4 code points (person + skin + ZWJ + rocket) | 17 | One astronaut |
| नमस्ते | 6 code points (with combining marks) | 18 | 'Namaste' in Hindi |
123456789101112131415161718192021222324252627282930313233
// The danger of naive character counting const family = "👨👩👧👦"; // Wrong ways to measure:console.log(family.length); // 11 (UTF-16 code units)console.log([...family].length); // 7 (code points) // Correct way (grapheme clusters):const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });const graphemes = [...segmenter.segment(family)];console.log(graphemes.length); // 1 ✓ (what the user sees) // Byte size:const encoder = new TextEncoder();console.log(encoder.encode(family).length); // 25 bytes for "one character"! // Why this matters:// User types "👨👩👧👦" in a field limited to "10 characters"// Old validation: family.length = 11 > 10, REJECTED!// The user sees ONE character and gets "input too long" // Flag emoji:const flag = "🇯🇵"; // Japanese flagconsole.log(flag.length); // 4 (2 code units × 2)console.log([...flag].length); // 2 (regional indicator symbols)// User sees: 1 flag // Skin-toned emoji:const wave = "👋🏽"; // Waving hand, medium skinconsole.log(wave.length); // 4console.log([...wave].length); // 2 (hand + skin modifier)// User sees: 1 handEvery string has four different 'lengths': byte length (storage), code unit count (JavaScript .length), code point count (spread operator), and grapheme count (user-perceived). Using the wrong measure for validation, truncation, or display causes bugs. For user-facing limits, use grapheme count.
Understanding character sizing has practical implications for system design.
Database column sizing:
When databases define VARCHAR(N), what does N mean?
| Database | VARCHAR(100) Means | Storage Limit |
|---|---|---|
| MySQL (utf8mb4) | 100 characters | Up to 400 bytes |
| PostgreSQL | 100 characters | Up to 400 bytes (with UTF-8) |
| SQL Server | 100 bytes (VARCHAR) or 100 chars (NVARCHAR) | Varies |
| Oracle | 100 bytes (default) or CHAR(100 CHAR) for chars | Varies by mode |
| SQLite | Soft limit only | No hard enforcement |
Practical sizing example:
Designing a username field that allows up to 30 characters:
To safely store 30 characters of any script, allocate space for 120 bytes.
Memory in programming languages:
Different languages store strings differently:
12345678910111213141516171819202122232425
// Python's flexible string representation # Python uses smallest representation that fits all characterss1 = "hello" # Latin-1: 5 bytes (1 byte per char)s2 = "héllo" # Still Latin-1: 5 bytes (é fits in Latin-1)s3 = "hεllo" # UCS-2: 10 bytes (Greek ε requires 2 bytes per char)s4 = "h😀llo" # UCS-4: 24 bytes (emoji requires 4 bytes per char + overhead) import sysprint(sys.getsizeof(s1)) # ~54 bytes (5 chars + object overhead)print(sys.getsizeof(s4)) # ~76 bytes (5 chars × 4 + overhead) # The entire string is upgraded to widest character's requirement# This is why mixing emoji with ASCII uses more memory # Go's UTF-8 approach/*s := "Hello 你好"len(s) // 12 (bytes: 6 ASCII + 6 UTF-8 for Chinese)len([]rune(s)) // 8 (code points: 6 + 2) // Iterating by bytes vs code pointsfor i := 0; i < len(s); i++ { s[i] } // Iterates 12 bytesfor _, r := range s { r } // Iterates 8 code points*/Before manipulating strings, understand your language's internal representation. What does .length return? What happens when you index into a string? What does slicing do? These behaviors vary dramatically across languages, and assuming one behavior in another language causes bugs.
Let's consolidate practical rules of thumb for estimating and working with character sizes.
UTF-8 sizing heuristics:
Safe capacity planning:
When planning storage for international text:
Common mistakes to avoid:
When uncertain, assume worst case: 4 bytes per user-perceived character. This accommodates all Unicode including emoji sequences. Storage is cheap; data corruption is expensive. Better to over-allocate than to truncate user data.
We've built a comprehensive understanding of how characters are sized and stored. Let's consolidate the key insights:
Module complete: Character Data Types
You've now completed a comprehensive exploration of character data types:
With this foundation, you understand how text is represented at the primitive level—essential knowledge for any software engineer working with real-world data.
You now have solid intuition for character sizing and encoding—knowledge that prevents entire classes of bugs around truncation, storage, and validation. You understand the tradeoffs between fixed and variable-width encodings, and you know how to reason about text storage accurately.