SQLData Types

SQL Data Types

LevelBeginner

Duration60 mins

TopicData Types

2 / 5

Character Types

The Art and Science of Storing Text

Names, addresses, emails, product descriptions, comments, JSON documents, XML payloads, log messages—textual data pervades every database system. Yet text storage is deceptively complex. A single 'character' can be 1 byte or 4 bytes. A 'simple' string comparison can produce different results depending on collation settings. Storage space for VARCHAR versus CHAR can differ by orders of magnitude.

Character types are not a solved problem—they're a design decision with cascading implications. This page provides comprehensive coverage of SQL character types, from fundamental concepts to advanced Unicode considerations and vendor-specific optimizations.

What You Will Learn

By the end of this page, you will master fixed-length versus variable-length character types, understand the critical role of character sets and collations, navigate Unicode storage decisions, and apply best practices for text storage in production databases.

The Character Type Landscape

SQL provides several character types, each designed for specific use cases. Understanding the distinctions is fundamental to effective schema design.

Core SQL Standard Character Types:

SQL Character Type Overview
Type	Syntax	Length Behavior	Storage	Primary Use
CHAR	CHAR(n)	Fixed-length, right-padded	Exactly n chars	Codes, IDs with fixed format
VARCHAR	VARCHAR(n)	Variable-length, up to n	Actual length + overhead	Most text data
TEXT/CLOB	TEXT or CLOB	Variable, very large	Up to GB of text	Documents, logs, JSON
NCHAR	NCHAR(n)	Fixed-length Unicode	n × 2-4 bytes	International fixed text
NVARCHAR	NVARCHAR(n)	Variable-length Unicode	Actual × 2-4 bytes	International variable text

The Fundamental Choice: Fixed vs. Variable Length

The distinction between CHAR and VARCHAR represents a fundamental storage trade-off:

CHAR(n): Always stores exactly n characters. Shorter values are right-padded with spaces. Every row uses identical storage, enabling predictable performance but potentially wasting space.
VARCHAR(n): Stores only the actual characters plus a small length prefix (typically 1-2 bytes). Efficient for strings of varying length but requires dynamic storage management.

The Modern Default: VARCHAR

In contemporary database design, VARCHAR is the default choice for nearly all text columns. CHAR is reserved for truly fixed-format data like ISO country codes (CHAR(2)), currency codes (CHAR(3)), or legacy system codes. Storage is cheap; wasted bytes in CHAR columns add up across billions of rows.

CHAR: Fixed-Length Character Data

CHAR(n) stores exactly n characters, regardless of the actual string length. Values shorter than n are right-padded with spaces; values longer than n cause an error (or silent truncation in some configurations).

Characteristics:

Storage: Always n bytes (or n × bytes-per-character for Unicode)
Trailing spaces are significant in storage but often ignored in comparisons
Predictable row sizes enable some storage optimizations
Waste storage when actual lengths vary significantly

char_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- CHAR type declarations and behavior
CREATE TABLE country_codes (
    country_code    CHAR(2) PRIMARY KEY,      -- 'US', 'GB', 'JP'
    country_name    VARCHAR(100) NOT NULL
);
 
CREATE TABLE currencies (
    currency_code   CHAR(3) PRIMARY KEY,      -- 'USD', 'EUR', 'GBP'
    currency_name   VARCHAR(50) NOT NULL,
    symbol          CHAR(1)                   -- '$', '€', '£'
);
 
-- Fixed-format identifiers
CREATE TABLE products (
    -- SKU format: 3 letters + 5 digits + 2 letters (e.g., 'ABC12345XY')
    sku             CHAR(10) PRIMARY KEY,
    product_name    VARCHAR(200) NOT NULL
);
 
-- Demonstrating padding behavior
INSERT INTO country_codes (country_code, country_name)
VALUES ('US', 'United States');
 
-- The stored value is 'US' (no padding needed - exact length)
 
INSERT INTO country_codes (country_code, country_name)
VALUES ('A', 'Test');  -- Stored as 'A ' (right-padded with space)
 
-- Comparison behavior varies by database!
-- Standard SQL: 'A' = 'A ' returns TRUE (spaces ignored)
-- Some configurations: 'A' = 'A ' returns FALSE
 
-- Length functions may or may not include padding
SELECT country_code, LENGTH(country_code), CHAR_LENGTH(country_code)
FROM country_codes;
-- Results vary: some DBs return 2, others return actual length without padding
 
-- Concatenation includes padding!
SELECT country_code || '-' || country_name  -- 'US-United States' or 'US -United States'?
FROM country_codes
WHERE country_code = 'US';

When to Use CHAR

•ISO country codes (CHAR(2))
•ISO currency codes (CHAR(3))
•Fixed-format product codes
•Legacy system identifiers
•Hash values of known length
•Status codes with fixed width

When NOT to Use CHAR

•Names (highly variable length)
•Email addresses
•Descriptions and comments
•URLs and file paths
•Any user-generated content
•JSON or XML data

The Trailing Space Trap

CHAR padding behavior causes subtle bugs. If you store 'AB' in CHAR(5) and compare it to 'AB' (without padding), different databases handle this differently. PostgreSQL pads during comparison (equal). MySQL in some modes does not (not equal). Always test comparison behavior in your specific database configuration.

VARCHAR: Variable-Length Character Data

VARCHAR(n) stores variable-length character data up to a maximum of n characters. Only the actual characters plus a small length prefix are stored, making it efficient for strings of varying length.

Storage Mechanics:

VARCHAR(100) storing 'Hello'
├── Length prefix: 1-2 bytes (stores value 5)
├── Data: 5 bytes ('Hello')
└── Total: 6-7 bytes (vs. 100 bytes for CHAR(100))

The n in VARCHAR(n):

Represents the maximum allowed length, not storage allocation
Exceeding n causes an error (or truncation with warning)
In modern databases, the choice of n has minimal storage impact for values well under n
However, n affects memory allocation for query processing

varchar_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- VARCHAR type declarations
CREATE TABLE customers (
    customer_id     INTEGER PRIMARY KEY,
    first_name      VARCHAR(50) NOT NULL,     -- Few names exceed 50 chars
    last_name       VARCHAR(50) NOT NULL,
    email           VARCHAR(255) NOT NULL,    -- Email spec allows up to 254
    phone           VARCHAR(20),              -- International formats vary
    address_line1   VARCHAR(100),
    address_line2   VARCHAR(100),
    city            VARCHAR(50),
    postal_code     VARCHAR(20),              -- Some countries have long codes
    notes           VARCHAR(2000)             -- Extended comments
);
 
-- Email-specific considerations
CREATE TABLE users (
    user_id         INTEGER PRIMARY KEY,
    username        VARCHAR(30) NOT NULL,     -- Enforce reasonable limits
    email           VARCHAR(255) NOT NULL UNIQUE,
    -- RFC 5321 specifies 254 max, but 255 is commonly used
    display_name    VARCHAR(100)
);
 
-- URL storage
CREATE TABLE web_pages (
    page_id         INTEGER PRIMARY KEY,
    url             VARCHAR(2083) NOT NULL,   -- IE historical limit
    -- Modern browsers support 64K+, but 2083 is safe default
    title           VARCHAR(500),
    meta_description VARCHAR(160)             -- SEO best practice
);
 
-- Demonstrating variable-length storage efficiency
-- Assume average name length is 8 characters
-- CHAR(50): 50 bytes per name × 1 million rows = 50 MB
-- VARCHAR(50): ~9 bytes per name × 1 million rows = 9 MB
-- Savings: 41 MB (82% reduction in storage for names alone)
 
-- Length validation in application or database
CREATE TABLE products (
    product_id      INTEGER PRIMARY KEY,
    product_name    VARCHAR(200) NOT NULL,
    short_desc      VARCHAR(500),
    CONSTRAINT chk_name_min_length CHECK (LENGTH(product_name) >= 3)
);

VARCHAR Length Guidelines

•Names (first/last): VARCHAR(50-100) — Most names fit in 50; 100 accommodates edge cases
•Email addresses: VARCHAR(255) — RFC 5321 spec is 254, round up for safety
•Usernames: VARCHAR(20-50) — Application-defined, enforce in CHECK constraint
•Phone numbers: VARCHAR(20) — Accommodates international formats with extensions
•Postal codes: VARCHAR(20) — Some countries have alphanumeric codes up to 10 chars
•URLs: VARCHAR(2000-2083) — Balance between storage and compatibility
•Short descriptions: VARCHAR(500-1000) — UI-driven length
•General text: TEXT or VARCHAR(MAX) for unbounded content

Don't Over-Optimize VARCHAR Length

In modern databases, VARCHAR(100) and VARCHAR(1000) use the same storage for a 20-character string. The difference is in validation (rejecting >n characters) and memory allocation during queries. Choose meaningful limits that document expected data, not micro-optimized byte counts.

TEXT and CLOB: Large Text Storage

For very large text content—documents, log files, JSON payloads, HTML content—SQL provides large object types. Terminology and limits vary by database:

Database	Type	Maximum Size
PostgreSQL	TEXT	Unlimited (1 GB practical)
MySQL	TEXT	65,535 bytes
MySQL	MEDIUMTEXT	16 MB
MySQL	LONGTEXT	4 GB
SQL Server	VARCHAR(MAX)	2 GB
SQL Server	TEXT (deprecated)	2 GB
Oracle	CLOB	4 GB

TEXT vs. VARCHAR: TEXT types are designed for large content that may exceed typical VARCHAR limits. They often have different storage mechanisms (out-of-row storage), indexing limitations, and performance characteristics.

text_clob_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
-- TEXT/CLOB use cases
 
-- Document content storage
CREATE TABLE articles (
    article_id      INTEGER PRIMARY KEY,
    title           VARCHAR(200) NOT NULL,
    summary         VARCHAR(1000),
    content         TEXT NOT NULL,              -- Full article body
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- JSON storage (before native JSON types)
CREATE TABLE api_logs (
    log_id          INTEGER PRIMARY KEY,
    endpoint        VARCHAR(500) NOT NULL,
    request_body    TEXT,                       -- JSON request payload
    response_body   TEXT,                       -- JSON response payload
    logged_at       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Log aggregation
CREATE TABLE audit_logs (
    log_id          BIGINT PRIMARY KEY,
    log_level       VARCHAR(10) NOT NULL,
    message         TEXT NOT NULL,              -- Variable-length log messages
    stack_trace     TEXT,                       -- Exception details
    context_json    TEXT                        -- Additional structured data
);
 
-- HTML content
CREATE TABLE email_templates (
    template_id     INTEGER PRIMARY KEY,
    template_name   VARCHAR(100) NOT NULL,
    subject         VARCHAR(500) NOT NULL,
    html_body       TEXT NOT NULL,              -- Full HTML email
    text_body       TEXT                        -- Plain text version
);
 
-- PostgreSQL: TEXT is truly unlimited (1 GB practical limit)
-- No need for MEDIUMTEXT/LONGTEXT variants
 
-- MySQL: Choose based on expected size
CREATE TABLE mysql_content (
    id INT PRIMARY KEY,
    short_content TEXT,           -- Up to 64 KB
    medium_content MEDIUMTEXT,    -- Up to 16 MB
    large_content LONGTEXT        -- Up to 4 GB
);
 
-- SQL Server: VARCHAR(MAX) is preferred over deprecated TEXT
CREATE TABLE sqlserver_content (
    id INT PRIMARY KEY,
    content VARCHAR(MAX)          -- Up to 2 GB, modern approach
);

TEXT Type Limitations

TEXT columns often cannot be used as primary keys or in indexes (or have length restrictions for indexing). Some databases store TEXT out-of-row, impacting query performance. Functions like comparison and sorting may behave differently. Always verify TEXT type behavior in your specific database system.

TEXT vs. VARCHAR Decision Guide

•Use VARCHAR(n) when you have a reasonable maximum length and need full indexing/comparison support
•Use TEXT for truly variable, potentially large content: articles, logs, JSON, XML, code snippets
•Consider JSON types (native in PostgreSQL, MySQL 5.7+) instead of TEXT for structured JSON data
•Avoid TEXT for frequently queried/filtered columns—performance is typically worse than VARCHAR
•Index strategy: Create computed columns or full-text indexes for TEXT search, not B-tree indexes

Character Sets and Encoding

A character set defines the set of characters that can be stored; encoding defines how those characters are represented as bytes. This distinction is critical for international applications storing non-ASCII text.

Evolution of Character Encodings:

ASCII (1963): 7-bit encoding, 128 characters (English letters, digits, symbols)
Extended ASCII/Latin-1 (ISO-8859-1): 8-bit, 256 characters (Western European)
Unicode: Universal character set encompassing all writing systems (~150,000 characters)

Unicode Encodings:

UTF-8: Variable-length (1-4 bytes per character), ASCII-compatible, most common
UTF-16: Variable-length (2 or 4 bytes), used by Windows, Java, JavaScript
UTF-32: Fixed 4 bytes per character, simple but storage-inefficient

Character Encoding Comparison
Encoding	Bytes per Character	ASCII Compatible	Use Case
ASCII	1	Yes	Legacy systems, English-only
Latin-1 (ISO-8859-1)	1	Yes	Western European languages
UTF-8	1-4	Yes	Web, modern applications, international
UTF-16	2-4	No	Windows internals, Java strings
UTF-32	4	No	Internal processing (rarely stored)

encoding_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Database-level character set configuration
 
-- PostgreSQL: Set encoding at database creation
CREATE DATABASE myapp_db
    WITH ENCODING = 'UTF8'
    LC_COLLATE = 'en_US.UTF-8'
    LC_CTYPE = 'en_US.UTF-8';
 
-- MySQL: Database and table character sets
CREATE DATABASE myapp_db
    CHARACTER SET utf8mb4           -- Full UTF-8 (4-byte support)
    COLLATE utf8mb4_unicode_ci;     -- Case-insensitive Unicode collation
 
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50) CHARACTER SET utf8mb4,
    bio TEXT CHARACTER SET utf8mb4
);
 
-- SQL Server: Uses NVARCHAR for Unicode
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username NVARCHAR(50),          -- Unicode-capable
    legacy_code VARCHAR(10)          -- Non-Unicode (uses database collation)
);
 
-- Oracle: AL32UTF8 is recommended for Unicode
ALTER DATABASE CHARACTER SET AL32UTF8;
 
-- Storage implications of encoding
-- ASCII text: 1 byte per character in any encoding
-- European accents (é, ü): 1 byte in Latin-1, 2 bytes in UTF-8
-- Asian characters (中, 日): 3 bytes in UTF-8
-- Emoji (😀): 4 bytes in UTF-8
 
-- Example: Japanese text storage
-- 'こんにちは' (5 characters)
-- UTF-8: 15 bytes (3 bytes per hiragana character)
-- NVARCHAR with UTF-16: 10 bytes (2 bytes per character)

The utf8 vs utf8mb4 Trap (MySQL)

MySQL's 'utf8' is actually a 3-byte subset of UTF-8 that cannot store 4-byte characters like emoji (😀) or some rare Chinese characters. Always use 'utf8mb4' for true UTF-8 support. This is one of the most common MySQL configuration mistakes, causing silent data corruption for emoji and special characters.

Collations: Comparison and Sorting Rules

A collation defines rules for comparing and sorting character data. Collations determine:

Whether 'A' equals 'a' (case sensitivity)
Whether 'é' equals 'e' (accent sensitivity)
Sort order of characters (is 'ä' after 'a' or after 'z'?)
How locale-specific rules apply

Collation Naming Conventions:

Most databases use structured collation names encoding their properties:

Collation: en_US_ci_ai
├── en_US  - Locale (English, United States)
├── ci     - Case Insensitive
└── ai     - Accent Insensitive

Common Suffixes:

_ci / _cs: Case Insensitive / Case Sensitive
_ai / _as: Accent Insensitive / Accent Sensitive
_bin: Binary comparison (byte-by-byte, locale-unaware)
_unicode: Unicode Collation Algorithm

collation_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- Collation effects on comparison and sorting
 
-- PostgreSQL: Collation specification
CREATE TABLE users (
    email VARCHAR(255) COLLATE "en_US" UNIQUE
);
 
-- Case-insensitive comparison (PostgreSQL)
SELECT * FROM users WHERE email = 'John@Example.com' COLLATE "en_US";
-- With case-insensitive collation: matches 'john@example.com'
 
-- MySQL: Column-level collation
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(200) 
        CHARACTER SET utf8mb4 
        COLLATE utf8mb4_unicode_ci,    -- Case-insensitive
    product_code VARCHAR(20)
        CHARACTER SET utf8mb4
        COLLATE utf8mb4_bin             -- Binary, case-sensitive
);
 
-- Comparison behavior demonstration
-- With utf8mb4_unicode_ci:
SELECT * FROM products WHERE product_name = 'LAPTOP';
-- Matches: 'Laptop', 'laptop', 'LAPTOP', 'LaPtOp'
 
-- With utf8mb4_bin:
SELECT * FROM products WHERE product_code = 'ABC123';
-- Only matches exactly 'ABC123', not 'abc123'
 
-- SQL Server: Collation in queries
SELECT *
FROM customers
WHERE last_name = 'müller' COLLATE Latin1_General_CI_AI;
-- CI_AI = Case Insensitive, Accent Insensitive
-- Matches: 'Müller', 'MULLER', 'muller', 'Muller'
 
-- Sorting behavior varies by collation
-- German collation: ä sorts with a (Müller before Nyman)
-- Swedish collation: ä sorts after z (Müller after Zylstra)
 
SELECT last_name FROM customers
ORDER BY last_name COLLATE "de_DE";  -- German sorting
 
SELECT last_name FROM customers
ORDER BY last_name COLLATE "sv_SE";  -- Swedish sorting
 
-- Binary collation for performance
-- Faster comparisons (no linguistic rules)
-- But: 'A' ≠ 'a', 'é' ≠ 'e'
CREATE INDEX idx_username_bin ON users (username COLLATE "C");

Collation Selection Guidelines

•Email addresses: Case-insensitive collation—'User@Example.com' should find 'user@example.com'
•Usernames: Usually case-insensitive to prevent 'Admin' and 'admin' being different accounts
•Product codes/SKUs: Binary or case-sensitive—'ABC-123' ≠ 'abc-123'
•Passwords/hashes: Binary—exact byte comparison required
•Names: Consider accent sensitivity—'José' and 'Jose' might be different people
•Search/filtering: Often case-insensitive, accent-insensitive for user convenience
•Performance-critical: Binary collations are fastest but lose linguistic features

Collation Mismatch Errors

Joining or comparing columns with different collations causes errors or implicit conversions. Consistency is essential: choose one collation per database and stick to it. If you must mix, use explicit COLLATE clauses in queries.

NCHAR and NVARCHAR: Unicode Types

NCHAR and NVARCHAR are Unicode-specific character types found in SQL Server and Oracle (and supported for compatibility in other databases). The 'N' prefix stands for 'National' (as in national character set, i.e., Unicode).

SQL Server Unicode Types:

Type	Storage	Encoding	Maximum
NCHAR(n)	2n bytes	UTF-16	n = 4000 chars
NVARCHAR(n)	2× actual	UTF-16	n = 4000 chars
NVARCHAR(MAX)	2× actual	UTF-16	~1 billion chars

When NVARCHAR is Required:

Storing characters outside the database's default code page
International applications with multilingual content
Asian languages (Chinese, Japanese, Korean)
Emoji and symbols
Any application where character set requirements are uncertain

nvarchar_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- SQL Server: VARCHAR vs NVARCHAR comparison
 
-- Non-Unicode (uses database's collation code page)
CREATE TABLE legacy_customers (
    customer_id INT PRIMARY KEY,
    first_name VARCHAR(50),          -- Cannot store characters outside code page
    last_name VARCHAR(50)
);
 
-- Unicode (supports all languages)
CREATE TABLE international_customers (
    customer_id INT PRIMARY KEY,
    first_name NVARCHAR(50),         -- Can store any Unicode character
    last_name NVARCHAR(50),
    display_name NVARCHAR(100)
);
 
-- Inserting international data
INSERT INTO international_customers (customer_id, first_name, last_name, display_name)
VALUES 
    (1, N'José', N'García', N'José García'),      -- Spanish
    (2, N'山田', N'太郎', N'山田 太郎'),           -- Japanese
    (3, N'Müller', N'Hans', N'Hans Müller'),      -- German
    (4, N'Иван', N'Петров', N'Иван Петров');      -- Russian
 
-- The N prefix is crucial for literal Unicode strings
INSERT INTO international_customers VALUES (5, N'😀', N'Emoji', N'😀 Emoji');
 
-- Without N prefix, Unicode characters may be corrupted or cause errors
INSERT INTO international_customers VALUES (6, '中文', 'Test', 'Test中文');  -- WRONG!
INSERT INTO international_customers VALUES (6, N'中文', N'Test', N'Test中文');  -- Correct
 
-- Storage comparison
-- VARCHAR: 'Hello' = 5 bytes
-- NVARCHAR: N'Hello' = 10 bytes (2 bytes per character)
 
-- Performance consideration: NVARCHAR uses 2× storage
-- For pure ASCII data, this doubles table and index sizes
 
-- Oracle: NVARCHAR2 and NCHAR
CREATE TABLE oracle_international (
    id NUMBER(10) PRIMARY KEY,
    name NVARCHAR2(100)              -- Unicode variable-length
);
 
-- PostgreSQL: No separate NVARCHAR (TEXT/VARCHAR with UTF-8 covers all)
-- Just ensure database encoding is UTF8
CREATE TABLE pg_international (
    id INTEGER PRIMARY KEY,
    name VARCHAR(100)                -- UTF-8 handles all Unicode
);

PostgreSQL vs SQL Server Approach

PostgreSQL uses UTF-8 encoding for all text types, making NVARCHAR unnecessary. SQL Server's VARCHAR uses the database collation's code page (potentially single-byte), requiring NVARCHAR for full Unicode. MySQL with utf8mb4 also handles all Unicode in regular VARCHAR types.

Use NVARCHAR When

•Storing user-generated content (unknown language)
•International application deployment
•Asian language requirements
•Emoji support needed
•Integration with Unicode APIs
•Future-proofing for globalization

VARCHAR May Suffice When

•Pure ASCII data (codes, IDs, hashes)
•System-generated identifiers
•Email addresses (ASCII-only spec)
•Storage optimization is critical
•Legacy system compatibility required
•English-only application

Summary and Best Practices

Character type selection impacts data integrity, storage efficiency, query performance, and internationalization capability. Apply these principles for effective text storage:

Key Takeaways

•VARCHAR is the default — Use fixed-length CHAR only for truly fixed-format codes (country codes, currency codes). VARCHAR handles everything else more efficiently.
•Plan for Unicode — Use UTF-8 encoding (PostgreSQL, MySQL with utf8mb4) or NVARCHAR (SQL Server). Character sets that can't store all characters will eventually cause data loss.
•Choose meaningful length limits — VARCHAR(100) vs VARCHAR(1000) doesn't change storage for short strings, but does document expected data ranges and affects memory allocation.
•Understand collation implications — Case sensitivity, accent sensitivity, and sort order all depend on collation. Consistency across the database prevents comparison errors.
•Use TEXT for large content — Articles, logs, JSON documents—but understand the indexing and performance limitations.
•Test internationalization early — Insert actual multilingual data, emoji, and edge cases during development, not after production launch.

Quick Reference: Character Type Selection
Data Domain	Recommended Type	Key Considerations
Fixed codes (ISO country)	CHAR(n)	Exact length, predictable storage
Names, addresses	VARCHAR(50-100)	Most fit under 50; buffer for edge cases
Email addresses	VARCHAR(255)	RFC 5321 spec, case-insensitive collation
URLs	VARCHAR(2000+)	Browser limits vary, consider TEXT
Descriptions	VARCHAR(500-2000)	UI-driven, searchable
Articles, documents	TEXT	Large content, consider full-text indexing
JSON data	JSON type or TEXT	Prefer native JSON where available
International text	UTF-8/NVARCHAR	Full Unicode support required

Page Complete

You now have comprehensive knowledge of SQL character types, from basic VARCHAR/CHAR distinctions through Unicode encoding and collation semantics. Next, we'll explore date and time types—critical for temporal data and often a source of subtle bugs.

2 / 5

Loading learning content...

SQLData Types

SQL Data Types

LevelBeginner

Duration60 mins

TopicData Types

2 / 5

Character Types

The Art and Science of Storing Text

What You Will Learn

The Character Type Landscape

SQL provides several character types, each designed for specific use cases. Understanding the distinctions is fundamental to effective schema design.

Core SQL Standard Character Types:

SQL Character Type Overview
Type	Syntax	Length Behavior	Storage	Primary Use
CHAR	CHAR(n)	Fixed-length, right-padded	Exactly n chars	Codes, IDs with fixed format
VARCHAR	VARCHAR(n)	Variable-length, up to n	Actual length + overhead	Most text data
TEXT/CLOB	TEXT or CLOB	Variable, very large	Up to GB of text	Documents, logs, JSON
NCHAR	NCHAR(n)	Fixed-length Unicode	n × 2-4 bytes	International fixed text
NVARCHAR	NVARCHAR(n)	Variable-length Unicode	Actual × 2-4 bytes	International variable text

The Fundamental Choice: Fixed vs. Variable Length

The distinction between CHAR and VARCHAR represents a fundamental storage trade-off:

CHAR(n): Always stores exactly n characters. Shorter values are right-padded with spaces. Every row uses identical storage, enabling predictable performance but potentially wasting space.
VARCHAR(n): Stores only the actual characters plus a small length prefix (typically 1-2 bytes). Efficient for strings of varying length but requires dynamic storage management.

The Modern Default: VARCHAR

CHAR: Fixed-Length Character Data

Characteristics:

Storage: Always n bytes (or n × bytes-per-character for Unicode)
Trailing spaces are significant in storage but often ignored in comparisons
Predictable row sizes enable some storage optimizations
Waste storage when actual lengths vary significantly

char_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- CHAR type declarations and behavior
CREATE TABLE country_codes (
    country_code    CHAR(2) PRIMARY KEY,      -- 'US', 'GB', 'JP'
    country_name    VARCHAR(100) NOT NULL
);
 
CREATE TABLE currencies (
    currency_code   CHAR(3) PRIMARY KEY,      -- 'USD', 'EUR', 'GBP'
    currency_name   VARCHAR(50) NOT NULL,
    symbol          CHAR(1)                   -- '$', '€', '£'
);
 
-- Fixed-format identifiers
CREATE TABLE products (
    -- SKU format: 3 letters + 5 digits + 2 letters (e.g., 'ABC12345XY')
    sku             CHAR(10) PRIMARY KEY,
    product_name    VARCHAR(200) NOT NULL
);
 
-- Demonstrating padding behavior
INSERT INTO country_codes (country_code, country_name)
VALUES ('US', 'United States');
 
-- The stored value is 'US' (no padding needed - exact length)
 
INSERT INTO country_codes (country_code, country_name)
VALUES ('A', 'Test');  -- Stored as 'A ' (right-padded with space)
 
-- Comparison behavior varies by database!
-- Standard SQL: 'A' = 'A ' returns TRUE (spaces ignored)
-- Some configurations: 'A' = 'A ' returns FALSE
 
-- Length functions may or may not include padding
SELECT country_code, LENGTH(country_code), CHAR_LENGTH(country_code)
FROM country_codes;
-- Results vary: some DBs return 2, others return actual length without padding
 
-- Concatenation includes padding!
SELECT country_code || '-' || country_name  -- 'US-United States' or 'US -United States'?
FROM country_codes
WHERE country_code = 'US';

When to Use CHAR

•ISO country codes (CHAR(2))
•ISO currency codes (CHAR(3))
•Fixed-format product codes
•Legacy system identifiers
•Hash values of known length
•Status codes with fixed width

When NOT to Use CHAR

•Names (highly variable length)
•Email addresses
•Descriptions and comments
•URLs and file paths
•Any user-generated content
•JSON or XML data

The Trailing Space Trap

VARCHAR: Variable-Length Character Data

VARCHAR(n) stores variable-length character data up to a maximum of n characters. Only the actual characters plus a small length prefix are stored, making it efficient for strings of varying length.

Storage Mechanics:

VARCHAR(100) storing 'Hello'
├── Length prefix: 1-2 bytes (stores value 5)
├── Data: 5 bytes ('Hello')
└── Total: 6-7 bytes (vs. 100 bytes for CHAR(100))

The n in VARCHAR(n):

Represents the maximum allowed length, not storage allocation
Exceeding n causes an error (or truncation with warning)
In modern databases, the choice of n has minimal storage impact for values well under n
However, n affects memory allocation for query processing

varchar_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- VARCHAR type declarations
CREATE TABLE customers (
    customer_id     INTEGER PRIMARY KEY,
    first_name      VARCHAR(50) NOT NULL,     -- Few names exceed 50 chars
    last_name       VARCHAR(50) NOT NULL,
    email           VARCHAR(255) NOT NULL,    -- Email spec allows up to 254
    phone           VARCHAR(20),              -- International formats vary
    address_line1   VARCHAR(100),
    address_line2   VARCHAR(100),
    city            VARCHAR(50),
    postal_code     VARCHAR(20),              -- Some countries have long codes
    notes           VARCHAR(2000)             -- Extended comments
);
 
-- Email-specific considerations
CREATE TABLE users (
    user_id         INTEGER PRIMARY KEY,
    username        VARCHAR(30) NOT NULL,     -- Enforce reasonable limits
    email           VARCHAR(255) NOT NULL UNIQUE,
    -- RFC 5321 specifies 254 max, but 255 is commonly used
    display_name    VARCHAR(100)
);
 
-- URL storage
CREATE TABLE web_pages (
    page_id         INTEGER PRIMARY KEY,
    url             VARCHAR(2083) NOT NULL,   -- IE historical limit
    -- Modern browsers support 64K+, but 2083 is safe default
    title           VARCHAR(500),
    meta_description VARCHAR(160)             -- SEO best practice
);
 
-- Demonstrating variable-length storage efficiency
-- Assume average name length is 8 characters
-- CHAR(50): 50 bytes per name × 1 million rows = 50 MB
-- VARCHAR(50): ~9 bytes per name × 1 million rows = 9 MB
-- Savings: 41 MB (82% reduction in storage for names alone)
 
-- Length validation in application or database
CREATE TABLE products (
    product_id      INTEGER PRIMARY KEY,
    product_name    VARCHAR(200) NOT NULL,
    short_desc      VARCHAR(500),
    CONSTRAINT chk_name_min_length CHECK (LENGTH(product_name) >= 3)
);

VARCHAR Length Guidelines

•Names (first/last): VARCHAR(50-100) — Most names fit in 50; 100 accommodates edge cases
•Email addresses: VARCHAR(255) — RFC 5321 spec is 254, round up for safety
•Usernames: VARCHAR(20-50) — Application-defined, enforce in CHECK constraint
•Phone numbers: VARCHAR(20) — Accommodates international formats with extensions
•Postal codes: VARCHAR(20) — Some countries have alphanumeric codes up to 10 chars
•URLs: VARCHAR(2000-2083) — Balance between storage and compatibility
•Short descriptions: VARCHAR(500-1000) — UI-driven length
•General text: TEXT or VARCHAR(MAX) for unbounded content

Don't Over-Optimize VARCHAR Length

TEXT and CLOB: Large Text Storage

For very large text content—documents, log files, JSON payloads, HTML content—SQL provides large object types. Terminology and limits vary by database:

Database	Type	Maximum Size
PostgreSQL	TEXT	Unlimited (1 GB practical)
MySQL	TEXT	65,535 bytes
MySQL	MEDIUMTEXT	16 MB
MySQL	LONGTEXT	4 GB
SQL Server	VARCHAR(MAX)	2 GB
SQL Server	TEXT (deprecated)	2 GB
Oracle	CLOB	4 GB

text_clob_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
-- TEXT/CLOB use cases
 
-- Document content storage
CREATE TABLE articles (
    article_id      INTEGER PRIMARY KEY,
    title           VARCHAR(200) NOT NULL,
    summary         VARCHAR(1000),
    content         TEXT NOT NULL,              -- Full article body
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- JSON storage (before native JSON types)
CREATE TABLE api_logs (
    log_id          INTEGER PRIMARY KEY,
    endpoint        VARCHAR(500) NOT NULL,
    request_body    TEXT,                       -- JSON request payload
    response_body   TEXT,                       -- JSON response payload
    logged_at       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Log aggregation
CREATE TABLE audit_logs (
    log_id          BIGINT PRIMARY KEY,
    log_level       VARCHAR(10) NOT NULL,
    message         TEXT NOT NULL,              -- Variable-length log messages
    stack_trace     TEXT,                       -- Exception details
    context_json    TEXT                        -- Additional structured data
);
 
-- HTML content
CREATE TABLE email_templates (
    template_id     INTEGER PRIMARY KEY,
    template_name   VARCHAR(100) NOT NULL,
    subject         VARCHAR(500) NOT NULL,
    html_body       TEXT NOT NULL,              -- Full HTML email
    text_body       TEXT                        -- Plain text version
);
 
-- PostgreSQL: TEXT is truly unlimited (1 GB practical limit)
-- No need for MEDIUMTEXT/LONGTEXT variants
 
-- MySQL: Choose based on expected size
CREATE TABLE mysql_content (
    id INT PRIMARY KEY,
    short_content TEXT,           -- Up to 64 KB
    medium_content MEDIUMTEXT,    -- Up to 16 MB
    large_content LONGTEXT        -- Up to 4 GB
);
 
-- SQL Server: VARCHAR(MAX) is preferred over deprecated TEXT
CREATE TABLE sqlserver_content (
    id INT PRIMARY KEY,
    content VARCHAR(MAX)          -- Up to 2 GB, modern approach
);

TEXT Type Limitations

TEXT vs. VARCHAR Decision Guide

•Use VARCHAR(n) when you have a reasonable maximum length and need full indexing/comparison support
•Use TEXT for truly variable, potentially large content: articles, logs, JSON, XML, code snippets
•Consider JSON types (native in PostgreSQL, MySQL 5.7+) instead of TEXT for structured JSON data
•Avoid TEXT for frequently queried/filtered columns—performance is typically worse than VARCHAR
•Index strategy: Create computed columns or full-text indexes for TEXT search, not B-tree indexes

Character Sets and Encoding

Evolution of Character Encodings:

ASCII (1963): 7-bit encoding, 128 characters (English letters, digits, symbols)
Extended ASCII/Latin-1 (ISO-8859-1): 8-bit, 256 characters (Western European)
Unicode: Universal character set encompassing all writing systems (~150,000 characters)

Unicode Encodings:

UTF-8: Variable-length (1-4 bytes per character), ASCII-compatible, most common
UTF-16: Variable-length (2 or 4 bytes), used by Windows, Java, JavaScript
UTF-32: Fixed 4 bytes per character, simple but storage-inefficient

Character Encoding Comparison
Encoding	Bytes per Character	ASCII Compatible	Use Case
ASCII	1	Yes	Legacy systems, English-only
Latin-1 (ISO-8859-1)	1	Yes	Western European languages
UTF-8	1-4	Yes	Web, modern applications, international
UTF-16	2-4	No	Windows internals, Java strings
UTF-32	4	No	Internal processing (rarely stored)

encoding_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Database-level character set configuration
 
-- PostgreSQL: Set encoding at database creation
CREATE DATABASE myapp_db
    WITH ENCODING = 'UTF8'
    LC_COLLATE = 'en_US.UTF-8'
    LC_CTYPE = 'en_US.UTF-8';
 
-- MySQL: Database and table character sets
CREATE DATABASE myapp_db
    CHARACTER SET utf8mb4           -- Full UTF-8 (4-byte support)
    COLLATE utf8mb4_unicode_ci;     -- Case-insensitive Unicode collation
 
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50) CHARACTER SET utf8mb4,
    bio TEXT CHARACTER SET utf8mb4
);
 
-- SQL Server: Uses NVARCHAR for Unicode
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username NVARCHAR(50),          -- Unicode-capable
    legacy_code VARCHAR(10)          -- Non-Unicode (uses database collation)
);
 
-- Oracle: AL32UTF8 is recommended for Unicode
ALTER DATABASE CHARACTER SET AL32UTF8;
 
-- Storage implications of encoding
-- ASCII text: 1 byte per character in any encoding
-- European accents (é, ü): 1 byte in Latin-1, 2 bytes in UTF-8
-- Asian characters (中, 日): 3 bytes in UTF-8
-- Emoji (😀): 4 bytes in UTF-8
 
-- Example: Japanese text storage
-- 'こんにちは' (5 characters)
-- UTF-8: 15 bytes (3 bytes per hiragana character)
-- NVARCHAR with UTF-16: 10 bytes (2 bytes per character)

The utf8 vs utf8mb4 Trap (MySQL)

Collations: Comparison and Sorting Rules

A collation defines rules for comparing and sorting character data. Collations determine:

Whether 'A' equals 'a' (case sensitivity)
Whether 'é' equals 'e' (accent sensitivity)
Sort order of characters (is 'ä' after 'a' or after 'z'?)
How locale-specific rules apply

Collation Naming Conventions:

Most databases use structured collation names encoding their properties:

Collation: en_US_ci_ai
├── en_US  - Locale (English, United States)
├── ci     - Case Insensitive
└── ai     - Accent Insensitive

Common Suffixes:

_ci / _cs: Case Insensitive / Case Sensitive
_ai / _as: Accent Insensitive / Accent Sensitive
_bin: Binary comparison (byte-by-byte, locale-unaware)
_unicode: Unicode Collation Algorithm

collation_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- Collation effects on comparison and sorting
 
-- PostgreSQL: Collation specification
CREATE TABLE users (
    email VARCHAR(255) COLLATE "en_US" UNIQUE
);
 
-- Case-insensitive comparison (PostgreSQL)
SELECT * FROM users WHERE email = 'John@Example.com' COLLATE "en_US";
-- With case-insensitive collation: matches 'john@example.com'
 
-- MySQL: Column-level collation
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(200) 
        CHARACTER SET utf8mb4 
        COLLATE utf8mb4_unicode_ci,    -- Case-insensitive
    product_code VARCHAR(20)
        CHARACTER SET utf8mb4
        COLLATE utf8mb4_bin             -- Binary, case-sensitive
);
 
-- Comparison behavior demonstration
-- With utf8mb4_unicode_ci:
SELECT * FROM products WHERE product_name = 'LAPTOP';
-- Matches: 'Laptop', 'laptop', 'LAPTOP', 'LaPtOp'
 
-- With utf8mb4_bin:
SELECT * FROM products WHERE product_code = 'ABC123';
-- Only matches exactly 'ABC123', not 'abc123'
 
-- SQL Server: Collation in queries
SELECT *
FROM customers
WHERE last_name = 'müller' COLLATE Latin1_General_CI_AI;
-- CI_AI = Case Insensitive, Accent Insensitive
-- Matches: 'Müller', 'MULLER', 'muller', 'Muller'
 
-- Sorting behavior varies by collation
-- German collation: ä sorts with a (Müller before Nyman)
-- Swedish collation: ä sorts after z (Müller after Zylstra)
 
SELECT last_name FROM customers
ORDER BY last_name COLLATE "de_DE";  -- German sorting
 
SELECT last_name FROM customers
ORDER BY last_name COLLATE "sv_SE";  -- Swedish sorting
 
-- Binary collation for performance
-- Faster comparisons (no linguistic rules)
-- But: 'A' ≠ 'a', 'é' ≠ 'e'
CREATE INDEX idx_username_bin ON users (username COLLATE "C");

Collation Selection Guidelines

•Email addresses: Case-insensitive collation—'User@Example.com' should find 'user@example.com'
•Usernames: Usually case-insensitive to prevent 'Admin' and 'admin' being different accounts
•Product codes/SKUs: Binary or case-sensitive—'ABC-123' ≠ 'abc-123'
•Passwords/hashes: Binary—exact byte comparison required
•Names: Consider accent sensitivity—'José' and 'Jose' might be different people
•Search/filtering: Often case-insensitive, accent-insensitive for user convenience
•Performance-critical: Binary collations are fastest but lose linguistic features

Collation Mismatch Errors

NCHAR and NVARCHAR: Unicode Types

SQL Server Unicode Types:

Type	Storage	Encoding	Maximum
NCHAR(n)	2n bytes	UTF-16	n = 4000 chars
NVARCHAR(n)	2× actual	UTF-16	n = 4000 chars
NVARCHAR(MAX)	2× actual	UTF-16	~1 billion chars

When NVARCHAR is Required:

Storing characters outside the database's default code page
International applications with multilingual content
Asian languages (Chinese, Japanese, Korean)
Emoji and symbols
Any application where character set requirements are uncertain

nvarchar_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- SQL Server: VARCHAR vs NVARCHAR comparison
 
-- Non-Unicode (uses database's collation code page)
CREATE TABLE legacy_customers (
    customer_id INT PRIMARY KEY,
    first_name VARCHAR(50),          -- Cannot store characters outside code page
    last_name VARCHAR(50)
);
 
-- Unicode (supports all languages)
CREATE TABLE international_customers (
    customer_id INT PRIMARY KEY,
    first_name NVARCHAR(50),         -- Can store any Unicode character
    last_name NVARCHAR(50),
    display_name NVARCHAR(100)
);
 
-- Inserting international data
INSERT INTO international_customers (customer_id, first_name, last_name, display_name)
VALUES 
    (1, N'José', N'García', N'José García'),      -- Spanish
    (2, N'山田', N'太郎', N'山田 太郎'),           -- Japanese
    (3, N'Müller', N'Hans', N'Hans Müller'),      -- German
    (4, N'Иван', N'Петров', N'Иван Петров');      -- Russian
 
-- The N prefix is crucial for literal Unicode strings
INSERT INTO international_customers VALUES (5, N'😀', N'Emoji', N'😀 Emoji');
 
-- Without N prefix, Unicode characters may be corrupted or cause errors
INSERT INTO international_customers VALUES (6, '中文', 'Test', 'Test中文');  -- WRONG!
INSERT INTO international_customers VALUES (6, N'中文', N'Test', N'Test中文');  -- Correct
 
-- Storage comparison
-- VARCHAR: 'Hello' = 5 bytes
-- NVARCHAR: N'Hello' = 10 bytes (2 bytes per character)
 
-- Performance consideration: NVARCHAR uses 2× storage
-- For pure ASCII data, this doubles table and index sizes
 
-- Oracle: NVARCHAR2 and NCHAR
CREATE TABLE oracle_international (
    id NUMBER(10) PRIMARY KEY,
    name NVARCHAR2(100)              -- Unicode variable-length
);
 
-- PostgreSQL: No separate NVARCHAR (TEXT/VARCHAR with UTF-8 covers all)
-- Just ensure database encoding is UTF8
CREATE TABLE pg_international (
    id INTEGER PRIMARY KEY,
    name VARCHAR(100)                -- UTF-8 handles all Unicode
);

PostgreSQL vs SQL Server Approach

Use NVARCHAR When

•Storing user-generated content (unknown language)
•International application deployment
•Asian language requirements
•Emoji support needed
•Integration with Unicode APIs
•Future-proofing for globalization

VARCHAR May Suffice When

•Pure ASCII data (codes, IDs, hashes)
•System-generated identifiers
•Email addresses (ASCII-only spec)
•Storage optimization is critical
•Legacy system compatibility required
•English-only application

Summary and Best Practices

Character type selection impacts data integrity, storage efficiency, query performance, and internationalization capability. Apply these principles for effective text storage:

Key Takeaways

•VARCHAR is the default — Use fixed-length CHAR only for truly fixed-format codes (country codes, currency codes). VARCHAR handles everything else more efficiently.
•Plan for Unicode — Use UTF-8 encoding (PostgreSQL, MySQL with utf8mb4) or NVARCHAR (SQL Server). Character sets that can't store all characters will eventually cause data loss.
•Choose meaningful length limits — VARCHAR(100) vs VARCHAR(1000) doesn't change storage for short strings, but does document expected data ranges and affects memory allocation.
•Understand collation implications — Case sensitivity, accent sensitivity, and sort order all depend on collation. Consistency across the database prevents comparison errors.
•Use TEXT for large content — Articles, logs, JSON documents—but understand the indexing and performance limitations.
•Test internationalization early — Insert actual multilingual data, emoji, and edge cases during development, not after production launch.

Quick Reference: Character Type Selection
Data Domain	Recommended Type	Key Considerations
Fixed codes (ISO country)	CHAR(n)	Exact length, predictable storage
Names, addresses	VARCHAR(50-100)	Most fit under 50; buffer for edge cases
Email addresses	VARCHAR(255)	RFC 5321 spec, case-insensitive collation
URLs	VARCHAR(2000+)	Browser limits vary, consider TEXT
Descriptions	VARCHAR(500-2000)	UI-driven, searchable
Articles, documents	TEXT	Large content, consider full-text indexing
JSON data	JSON type or TEXT	Prefer native JSON where available
International text	UTF-8/NVARCHAR	Full Unicode support required

Page Complete

2 / 5