Character Data Types - Learning Module

Loading content...

0/276

Character Size and Encoding Intuition

The Question Nobody Asks (Until It Breaks)

A developer creates a database column: VARCHAR(100). The form says "Enter your name (max 100 characters)." Everything works perfectly—until a Japanese user tries to register with a 30-character name and gets an error.

How can 30 characters exceed a limit of 100?

The answer lies in understanding that "character" and "byte" are not the same thing. The database measured in bytes; the form measured in characters. The Japanese name, using multi-byte encoding, consumed more bytes than expected.

This confusion—between characters, code points, bytes, and encoded sizes—is responsible for countless bugs. By the end of this page, you'll have solid intuition for how characters actually occupy memory and storage, enabling you to avoid these traps entirely.

What You Will Learn

By the end of this page, you will understand why character size varies depending on encoding, how fixed-width and variable-width encodings differ, the byte costs of different character types in UTF-8 and UTF-16, and how to reason about text storage size accurately.

The Fundamental Byte Question

At the most basic level, computers store everything in bytes. A byte is 8 bits, capable of representing 256 different values (0-255). The fundamental question of character encoding is:

How many bytes does each character require?

This seems simple, but the answer depends on:

Which character — Some characters need more bytes than others
Which encoding — UTF-8, UTF-16, and UTF-32 use different strategies
What you're measuring — Code units, code points, or grapheme clusters

Let's build a comprehensive mental model.

Character Size in Different Encodings
Character	UTF-8 Bytes	UTF-16 Bytes	UTF-32 Bytes
A (U+0041)	1	2	4
é (U+00E9)	2	2	4
中 (U+4E2D)	3	2	4
😀 (U+1F600)	4	4 (surrogate pair)	4
👨‍👩‍👧 (family)	18 (ZWJ sequence)	22	28

The shocking variety:

Notice that the simple letter 'A' takes 1 byte in UTF-8 but 4 bytes in UTF-32. The family emoji takes 18 bytes in UTF-8—for what appears as a single "character" to the user.

This variability is not a flaw; it's a deliberate design choice. Different encodings optimize for different use cases:

UTF-8 optimizes for ASCII-heavy text (Western languages, code)
UTF-16 optimizes for mixed content with significant CJK
UTF-32 optimizes for simplicity (every character is exactly 4 bytes)

The First Insight

There is no single answer to 'How many bytes is a character?' The answer always depends on which character and which encoding. Treating all characters as equal-sized is the source of countless bugs.

Fixed-Width Encodings

In a fixed-width encoding, every character uses the same number of bytes. This seems ideal—simple, predictable, easy to work with. But it comes with significant tradeoffs.

ASCII (7 bits, effectively 1 byte):

ASCII is the simplest fixed-width encoding:

Every character is exactly 1 byte
128 possible characters (0-127)
Limitation: Only English letters, digits, and basic symbols

UTF-32 (4 bytes):

UTF-32 is a fixed-width Unicode encoding:

Every code point is exactly 4 bytes (32 bits)
Can represent all 1.1 million Unicode code points
Simple indexing: Character N is at byte offset N × 4

fixed-width-example
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// UTF-32 fixed-width example (conceptual)
 
const text = "Hello";  // 5 characters
 
// In UTF-32, this occupies:
// 5 characters × 4 bytes = 20 bytes
 
// Memory layout (hex, big-endian):
// 00 00 00 48  (H = U+0048)
// 00 00 00 65  (e = U+0065)
// 00 00 00 6C  (l = U+006C)
// 00 00 00 6C  (l = U+006C)
// 00 00 00 6F  (o = U+006F)
 
// Random access is O(1):
// To find character at index 3:
//   byte_offset = 3 × 4 = 12
//   read 4 bytes starting at offset 12
 
// Compare to ASCII using same bytes:
// UTF-32: 5 chars in 20 bytes
// ASCII:  5 chars in 5 bytes (4× more efficient for ASCII text)
 
// For Chinese "你好世界" (Hello World):
// UTF-32: 4 characters × 4 bytes = 16 bytes
// UTF-8:  4 characters × 3 bytes = 12 bytes (more efficient)

Fixed-Width Advantages

•O(1) indexing — Jump directly to any character position
•Simple length calculation — Bytes ÷ width = characters
•Predictable memory — String of N chars = N × width bytes
•Easy substrings — Slice by index without scanning
•Simple implementation — No complex decoding logic

Fixed-Width Disadvantages

•Wasted space — 'A' as 4 bytes wastes 3 bytes per character
•ASCII incompatibility — UTF-32 text isn't valid ASCII
•Network inefficiency — 4× bandwidth for English text
•Endianness issues — Big vs little endian matters
•Still not grapheme-aware — Emoji sequences still vary

UTF-32 in Practice

UTF-32 is rarely used for storage or transmission due to space waste. However, some programs convert to UTF-32 internally for processing convenience, then convert back to UTF-8 for storage. Python 3.3+ uses a flexible internal representation that may use UTF-32 for strings with characters beyond the BMP.

Variable-Width Encodings

In a variable-width encoding, different characters use different numbers of bytes. This seems complicated, but it enables dramatic space savings for common text.

The variable-width principle:

Assign fewer bytes to common characters, more bytes to rare ones. For English text, ASCII characters (the most common) use 1 byte, while rare scripts use 3-4 bytes.

UTF-8 byte allocation:

UTF-8 uses 1 to 4 bytes per code point, determined by the code point's value:

UTF-8 Byte Allocation by Code Point Range
Code Point Range	Bytes	Bit Pattern	Usable Bits	Example Characters
U+0000 – U+007F	1	0xxxxxxx	7	ASCII: A-Z, a-z, 0-9, basic punctuation
U+0080 – U+07FF	2	110xxxxx 10xxxxxx	11	Latin extensions: é, ñ, ü; Greek, Cyrillic, Hebrew, Arabic basics
U+0800 – U+FFFF	3	1110xxxx 10xxxxxx 10xxxxxx	16	CJK: Chinese, Japanese, Korean; most of BMP
U+10000 – U+10FFFF	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	21	Emoji, historic scripts, rare CJK

utf8-encoding
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// UTF-8 encoding examples
 
// 1-byte character: 'A' (U+0041)
// Binary: 0100 0001
// UTF-8: 41 (just the byte itself)
 
// 2-byte character: 'é' (U+00E9)
// Binary: 0000 0000 1110 1001
// UTF-8 pattern: 110xxxxx 10xxxxxx
// Fill in bits: 110 00011 10 101001
// UTF-8: C3 A9 (2 bytes)
 
// 3-byte character: '中' (U+4E2D)
// Binary: 0100 1110 0010 1101
// UTF-8 pattern: 1110xxxx 10xxxxxx 10xxxxxx
// Fill in bits: 1110 0100 10 111000 10 101101
// UTF-8: E4 B8 AD (3 bytes)
 
// 4-byte character: '😀' (U+1F600)
// Binary: 0001 1111 0110 0000 0000
// UTF-8 pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
// Fill in bits: 11110 000 10 011111 10 011000 10 000000
// UTF-8: F0 9F 98 80 (4 bytes)
 
// Practical size calculations:
const english = "Hello World";  // 11 chars × 1 byte = 11 bytes
const chinese = "你好世界";      // 4 chars × 3 bytes = 12 bytes
const emoji = "Hi! 👋";          // 4×1 + 1×4 = 8 bytes
const mixed = "Café";            // 3×1 + 1×2 = 5 bytes

UTF-16's approach:

UTF-16 uses 2 or 4 bytes per code point:

2 bytes for BMP characters (U+0000 – U+FFFF)
4 bytes (surrogate pair) for characters beyond BMP (U+10000+)

This makes UTF-16 more efficient than UTF-8 for CJK-heavy text (2 bytes vs 3), but less efficient for ASCII text (2 bytes vs 1).

The Indexing Tradeoff

Variable-width encodings sacrifice O(1) character indexing. To find the Nth character, you must scan from the beginning, counting bytes. For large strings with random access needs, this matters. Most string operations are sequential, however, where the tradeoff is worthwhile.

UTF-8 vs UTF-16: Sizing Comparison

The choice between UTF-8 and UTF-16 affects storage size significantly, depending on the content.

Size comparison by text type:

Encoding Size Comparison for 1000-Character Texts
Text Type	UTF-8 Size	UTF-16 Size	More Efficient
English prose	~1,000 bytes	~2,000 bytes	UTF-8 (2× smaller)
Source code (ASCII)	~1,000 bytes	~2,000 bytes	UTF-8 (2× smaller)
German/French text	~1,100 bytes	~2,000 bytes	UTF-8 (~1.8× smaller)
Russian (Cyrillic)	~2,000 bytes	~2,000 bytes	Equal
Chinese text	~3,000 bytes	~2,000 bytes	UTF-16 (1.5× smaller)
Japanese text	~2,500 bytes	~2,000 bytes	UTF-16 (1.25× smaller)
Emoji-heavy text	~3,000 bytes	~3,200 bytes	Roughly equal

Why platforms choose differently:

The choices made by major platforms reflect their historical contexts:

Web, Linux, modern APIs → UTF-8: ASCII compatibility, smaller for Western text, no byte-order issues
Windows, Java, JavaScript → UTF-16: Made sense when Unicode fit in 16 bits (pre-emoji era); now legacy
Database storage → UTF-8 (usually): Maximum compatibility, reasonable size for mixed content

size-calculation
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Calculating actual byte sizes in different languages
 
// JavaScript: Strings are UTF-16 internally, but TextEncoder gives UTF-8 size
const encoder = new TextEncoder();
 
function getUTF8Size(str) {
    return encoder.encode(str).length;
}
 
function getUTF16Size(str) {
    // Each code unit is 2 bytes in UTF-16
    return str.length * 2;  // Simplified; doesn't account for BOM
}
 
console.log(getUTF8Size("Hello"));    // 5 bytes
console.log(getUTF16Size("Hello"));   // 10 bytes
 
console.log(getUTF8Size("你好"));      // 6 bytes (2 chars × 3 bytes)
console.log(getUTF16Size("你好"));     // 4 bytes (2 chars × 2 bytes)
 
console.log(getUTF8Size("👍"));       // 4 bytes
console.log(getUTF16Size("👍"));      // 4 bytes (surrogate pair = 2 code units × 2 bytes)
 
// Python: Measure UTF-8 byte size
# string = "Hello 你好 👍"
# utf8_size = len(string.encode('utf-8'))
# utf16_size = len(string.encode('utf-16'))

Modern Recommendation

For new projects, use UTF-8 everywhere. The web standard, command-line tools, and modern APIs all default to UTF-8. UTF-16's advantages (smaller CJK) rarely outweigh UTF-8's broad compatibility and simpler tooling. When in doubt, UTF-8.

The Grapheme Complexity

Even understanding byte counts per code point doesn't tell the full story. Users see grapheme clusters—visual characters that may consist of multiple code points combined together.

Single code point = single grapheme:

For simple characters, one code point equals one visual character:

'A' = U+0041 (1 code point, 1 grapheme)
'中' = U+4E2D (1 code point, 1 grapheme)

Multiple code points = single grapheme:

Many visual characters are composed of multiple code points:

Grapheme Cluster Examples
Visual Character	Code Points	UTF-8 Bytes	User Perception
é	U+0065 + U+0301 (or U+00E9)	3 or 2	One letter
🇺🇸	U+1F1FA + U+1F1F8	8	One flag
👨‍👩‍👧‍👦	7 code points (persons + ZWJ)	25	One family
👩🏿‍🚀	4 code points (person + skin + ZWJ + rocket)	17	One astronaut
नमस्ते	6 code points (with combining marks)	18	'Namaste' in Hindi

grapheme-example
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// The danger of naive character counting
 
const family = "👨‍👩‍👧‍👦";
 
// Wrong ways to measure:
console.log(family.length);               // 11 (UTF-16 code units)
console.log([...family].length);          // 7  (code points)
 
// Correct way (grapheme clusters):
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const graphemes = [...segmenter.segment(family)];
console.log(graphemes.length);            // 1 ✓ (what the user sees)
 
// Byte size:
const encoder = new TextEncoder();
console.log(encoder.encode(family).length); // 25 bytes for "one character"!
 
// Why this matters:
// User types "👨‍👩‍👧‍👦" in a field limited to "10 characters"
// Old validation: family.length = 11 > 10, REJECTED!
// The user sees ONE character and gets "input too long"
 
// Flag emoji:
const flag = "🇯🇵";  // Japanese flag
console.log(flag.length);              // 4 (2 code units × 2)
console.log([...flag].length);         // 2 (regional indicator symbols)
// User sees: 1 flag
 
// Skin-toned emoji:
const wave = "👋🏽";  // Waving hand, medium skin
console.log(wave.length);              // 4
console.log([...wave].length);         // 2 (hand + skin modifier)
// User sees: 1 hand

The Four Lengths of a String

Every string has four different 'lengths': byte length (storage), code unit count (JavaScript .length), code point count (spread operator), and grapheme count (user-perceived). Using the wrong measure for validation, truncation, or display causes bugs. For user-facing limits, use grapheme count.

Memory and Storage Implications

Understanding character sizing has practical implications for system design.

Database column sizing:

When databases define VARCHAR(N), what does N mean?

VARCHAR(N) Interpretation by Database
Database	VARCHAR(100) Means	Storage Limit
MySQL (utf8mb4)	100 characters	Up to 400 bytes
PostgreSQL	100 characters	Up to 400 bytes (with UTF-8)
SQL Server	100 bytes (VARCHAR) or 100 chars (NVARCHAR)	Varies
Oracle	100 bytes (default) or CHAR(100 CHAR) for chars	Varies by mode
SQLite	Soft limit only	No hard enforcement

Practical sizing example:

Designing a username field that allows up to 30 characters:

Minimum bytes (ASCII only): 30 bytes
Maximum bytes (all 4-byte UTF-8 chars): 120 bytes
Typical bytes (mixed): 30-60 bytes

To safely store 30 characters of any script, allocate space for 120 bytes.

Memory in programming languages:

Different languages store strings differently:

Internal String Representation

•JavaScript/Java — UTF-16 internally. Each code unit is 2 bytes. String.length returns code unit count.
•Python 3.3+ — Flexible representation: Latin-1 (1 byte), UCS-2 (2 bytes), or UCS-4 (4 bytes) depending on max character. len() returns code points.
•Go — UTF-8 encoded. len() returns bytes. len([]rune(s)) returns code points.
•Rust — UTF-8 encoded. .len() returns bytes. .chars().count() returns code points.
•C/C++ — charset depends on source file and settings. Typically char is 1 byte, wchar_t varies (2-4 bytes by platform).

memory-example
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Python's flexible string representation
 
# Python uses smallest representation that fits all characters
s1 = "hello"      # Latin-1: 5 bytes (1 byte per char)
s2 = "héllo"      # Still Latin-1: 5 bytes (é fits in Latin-1)
s3 = "hεllo"      # UCS-2: 10 bytes (Greek ε requires 2 bytes per char)
s4 = "h😀llo"     # UCS-4: 24 bytes (emoji requires 4 bytes per char + overhead)
 
import sys
print(sys.getsizeof(s1))  # ~54 bytes (5 chars + object overhead)
print(sys.getsizeof(s4))  # ~76 bytes (5 chars × 4 + overhead)
 
# The entire string is upgraded to widest character's requirement
# This is why mixing emoji with ASCII uses more memory
 
# Go's UTF-8 approach
/*
s := "Hello 你好"
len(s)            // 12 (bytes: 6 ASCII + 6 UTF-8 for Chinese)
len([]rune(s))    // 8 (code points: 6 + 2)
 
// Iterating by bytes vs code points
for i := 0; i < len(s); i++ { s[i] }  // Iterates 12 bytes
for _, r := range s { r }              // Iterates 8 code points
*/

Know Your Language's Model

Before manipulating strings, understand your language's internal representation. What does .length return? What happens when you index into a string? What does slicing do? These behaviors vary dramatically across languages, and assuming one behavior in another language causes bugs.

Practical Encoding Intuition

Let's consolidate practical rules of thumb for estimating and working with character sizes.

UTF-8 sizing heuristics:

Quick UTF-8 Size Rules

•ASCII text (English, code): ~1 byte per character
•European languages (German, French, Spanish): ~1.1 bytes per character
•Cyrillic, Greek, Hebrew, Arabic: ~2 bytes per character
•CJK (Chinese, Japanese, Korean): ~3 bytes per character
•Emoji: 4 bytes each, but emoji sequences can be much larger
•Mixed content: Weighted average based on composition

Safe capacity planning:

When planning storage for international text:

For user names: Allow 4 bytes per expected character (accommodates all scripts)
For general text fields: Estimate 2-3 bytes per character average
For emoji-heavy content (chat, social): Allow 4 bytes per visual character
Always test with: Chinese text, emoji, long Arabic names, Cyrillic

Common mistakes to avoid:

Mistakes

•Assuming 1 character = 1 byte
•Using .length for byte size
•Trusting VARCHAR(N) means N bytes
•Truncating at byte boundaries
•Counting code units as characters

Correct Approaches

•Calculate actual byte size
•Use proper length functions
•Verify database charset settings
•Use grapheme-aware truncation
•Count grapheme clusters for display

The Conservative Approach

When uncertain, assume worst case: 4 bytes per user-perceived character. This accommodates all Unicode including emoji sequences. Storage is cheap; data corruption is expensive. Better to over-allocate than to truncate user data.

Summary: Character Size and Encoding Intuition

We've built a comprehensive understanding of how characters are sized and stored. Let's consolidate the key insights:

Key Takeaways

•Character size depends on encoding and character — 'A' is 1 byte in UTF-8, 4 bytes in UTF-32; there's no universal answer
•Fixed-width encodings trade space for simplicity — UTF-32 enables O(1) indexing but wastes space for ASCII
•Variable-width encodings trade complexity for efficiency — UTF-8 optimizes for common characters at the cost of O(n) character access
•UTF-8 dominates for storage and transmission — ASCII compatibility, Web standard, no byte-order issues
•UTF-16 is used internally by Java/JavaScript/Windows — Historical legacy, still common in runtimes
•Grapheme clusters complicate counting — User-visible characters like emoji may be multiple code points and many bytes
•Four 'lengths' exist for any string — Bytes, code units, code points, and graphemes are all different measures
•Database sizing requires understanding charset — VARCHAR(100) might mean 100 bytes or 100 characters depending on database

Module complete: Character Data Types

You've now completed a comprehensive exploration of character data types:

Characters vs Numbers — The fundamental distinction between symbolic text and arithmetic values
ASCII vs Unicode — From 128 English-only characters to universal text support
Why Unicode Matters — Real-world bugs, security implications, and ethical considerations
Character Sizing — How characters are encoded and the implications for storage and processing

With this foundation, you understand how text is represented at the primitive level—essential knowledge for any software engineer working with real-world data.

Module Complete

You now have solid intuition for character sizing and encoding—knowledge that prevents entire classes of bugs around truncation, storage, and validation. You understand the tradeoffs between fixed and variable-width encodings, and you know how to reason about text storage accurately.