Loading learning content...
Names, addresses, emails, product descriptions, comments, JSON documents, XML payloads, log messages—textual data pervades every database system. Yet text storage is deceptively complex. A single 'character' can be 1 byte or 4 bytes. A 'simple' string comparison can produce different results depending on collation settings. Storage space for VARCHAR versus CHAR can differ by orders of magnitude.
Character types are not a solved problem—they're a design decision with cascading implications. This page provides comprehensive coverage of SQL character types, from fundamental concepts to advanced Unicode considerations and vendor-specific optimizations.
By the end of this page, you will master fixed-length versus variable-length character types, understand the critical role of character sets and collations, navigate Unicode storage decisions, and apply best practices for text storage in production databases.
SQL provides several character types, each designed for specific use cases. Understanding the distinctions is fundamental to effective schema design.
Core SQL Standard Character Types:
| Type | Syntax | Length Behavior | Storage | Primary Use |
|---|---|---|---|---|
| CHAR | CHAR(n) | Fixed-length, right-padded | Exactly n chars | Codes, IDs with fixed format |
| VARCHAR | VARCHAR(n) | Variable-length, up to n | Actual length + overhead | Most text data |
| TEXT/CLOB | TEXT or CLOB | Variable, very large | Up to GB of text | Documents, logs, JSON |
| NCHAR | NCHAR(n) | Fixed-length Unicode | n × 2-4 bytes | International fixed text |
| NVARCHAR | NVARCHAR(n) | Variable-length Unicode | Actual × 2-4 bytes | International variable text |
The Fundamental Choice: Fixed vs. Variable Length
The distinction between CHAR and VARCHAR represents a fundamental storage trade-off:
CHAR(n): Always stores exactly n characters. Shorter values are right-padded with spaces. Every row uses identical storage, enabling predictable performance but potentially wasting space.
VARCHAR(n): Stores only the actual characters plus a small length prefix (typically 1-2 bytes). Efficient for strings of varying length but requires dynamic storage management.
In contemporary database design, VARCHAR is the default choice for nearly all text columns. CHAR is reserved for truly fixed-format data like ISO country codes (CHAR(2)), currency codes (CHAR(3)), or legacy system codes. Storage is cheap; wasted bytes in CHAR columns add up across billions of rows.
CHAR(n) stores exactly n characters, regardless of the actual string length. Values shorter than n are right-padded with spaces; values longer than n cause an error (or silent truncation in some configurations).
Characteristics:
1234567891011121314151617181920212223242526272829303132333435363738394041
-- CHAR type declarations and behaviorCREATE TABLE country_codes ( country_code CHAR(2) PRIMARY KEY, -- 'US', 'GB', 'JP' country_name VARCHAR(100) NOT NULL); CREATE TABLE currencies ( currency_code CHAR(3) PRIMARY KEY, -- 'USD', 'EUR', 'GBP' currency_name VARCHAR(50) NOT NULL, symbol CHAR(1) -- '$', '€', '£'); -- Fixed-format identifiersCREATE TABLE products ( -- SKU format: 3 letters + 5 digits + 2 letters (e.g., 'ABC12345XY') sku CHAR(10) PRIMARY KEY, product_name VARCHAR(200) NOT NULL); -- Demonstrating padding behaviorINSERT INTO country_codes (country_code, country_name)VALUES ('US', 'United States'); -- The stored value is 'US' (no padding needed - exact length) INSERT INTO country_codes (country_code, country_name)VALUES ('A', 'Test'); -- Stored as 'A ' (right-padded with space) -- Comparison behavior varies by database!-- Standard SQL: 'A' = 'A ' returns TRUE (spaces ignored)-- Some configurations: 'A' = 'A ' returns FALSE -- Length functions may or may not include paddingSELECT country_code, LENGTH(country_code), CHAR_LENGTH(country_code)FROM country_codes;-- Results vary: some DBs return 2, others return actual length without padding -- Concatenation includes padding!SELECT country_code || '-' || country_name -- 'US-United States' or 'US -United States'?FROM country_codesWHERE country_code = 'US';CHAR padding behavior causes subtle bugs. If you store 'AB' in CHAR(5) and compare it to 'AB' (without padding), different databases handle this differently. PostgreSQL pads during comparison (equal). MySQL in some modes does not (not equal). Always test comparison behavior in your specific database configuration.
VARCHAR(n) stores variable-length character data up to a maximum of n characters. Only the actual characters plus a small length prefix are stored, making it efficient for strings of varying length.
Storage Mechanics:
VARCHAR(100) storing 'Hello'
├── Length prefix: 1-2 bytes (stores value 5)
├── Data: 5 bytes ('Hello')
└── Total: 6-7 bytes (vs. 100 bytes for CHAR(100))
The n in VARCHAR(n):
123456789101112131415161718192021222324252627282930313233343536373839404142434445
-- VARCHAR type declarationsCREATE TABLE customers ( customer_id INTEGER PRIMARY KEY, first_name VARCHAR(50) NOT NULL, -- Few names exceed 50 chars last_name VARCHAR(50) NOT NULL, email VARCHAR(255) NOT NULL, -- Email spec allows up to 254 phone VARCHAR(20), -- International formats vary address_line1 VARCHAR(100), address_line2 VARCHAR(100), city VARCHAR(50), postal_code VARCHAR(20), -- Some countries have long codes notes VARCHAR(2000) -- Extended comments); -- Email-specific considerationsCREATE TABLE users ( user_id INTEGER PRIMARY KEY, username VARCHAR(30) NOT NULL, -- Enforce reasonable limits email VARCHAR(255) NOT NULL UNIQUE, -- RFC 5321 specifies 254 max, but 255 is commonly used display_name VARCHAR(100)); -- URL storageCREATE TABLE web_pages ( page_id INTEGER PRIMARY KEY, url VARCHAR(2083) NOT NULL, -- IE historical limit -- Modern browsers support 64K+, but 2083 is safe default title VARCHAR(500), meta_description VARCHAR(160) -- SEO best practice); -- Demonstrating variable-length storage efficiency-- Assume average name length is 8 characters-- CHAR(50): 50 bytes per name × 1 million rows = 50 MB-- VARCHAR(50): ~9 bytes per name × 1 million rows = 9 MB-- Savings: 41 MB (82% reduction in storage for names alone) -- Length validation in application or databaseCREATE TABLE products ( product_id INTEGER PRIMARY KEY, product_name VARCHAR(200) NOT NULL, short_desc VARCHAR(500), CONSTRAINT chk_name_min_length CHECK (LENGTH(product_name) >= 3));In modern databases, VARCHAR(100) and VARCHAR(1000) use the same storage for a 20-character string. The difference is in validation (rejecting >n characters) and memory allocation during queries. Choose meaningful limits that document expected data, not micro-optimized byte counts.
For very large text content—documents, log files, JSON payloads, HTML content—SQL provides large object types. Terminology and limits vary by database:
| Database | Type | Maximum Size |
|---|---|---|
| PostgreSQL | TEXT | Unlimited (1 GB practical) |
| MySQL | TEXT | 65,535 bytes |
| MySQL | MEDIUMTEXT | 16 MB |
| MySQL | LONGTEXT | 4 GB |
| SQL Server | VARCHAR(MAX) | 2 GB |
| SQL Server | TEXT (deprecated) | 2 GB |
| Oracle | CLOB | 4 GB |
TEXT vs. VARCHAR: TEXT types are designed for large content that may exceed typical VARCHAR limits. They often have different storage mechanisms (out-of-row storage), indexing limitations, and performance characteristics.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
-- TEXT/CLOB use cases -- Document content storageCREATE TABLE articles ( article_id INTEGER PRIMARY KEY, title VARCHAR(200) NOT NULL, summary VARCHAR(1000), content TEXT NOT NULL, -- Full article body created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP); -- JSON storage (before native JSON types)CREATE TABLE api_logs ( log_id INTEGER PRIMARY KEY, endpoint VARCHAR(500) NOT NULL, request_body TEXT, -- JSON request payload response_body TEXT, -- JSON response payload logged_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP); -- Log aggregationCREATE TABLE audit_logs ( log_id BIGINT PRIMARY KEY, log_level VARCHAR(10) NOT NULL, message TEXT NOT NULL, -- Variable-length log messages stack_trace TEXT, -- Exception details context_json TEXT -- Additional structured data); -- HTML contentCREATE TABLE email_templates ( template_id INTEGER PRIMARY KEY, template_name VARCHAR(100) NOT NULL, subject VARCHAR(500) NOT NULL, html_body TEXT NOT NULL, -- Full HTML email text_body TEXT -- Plain text version); -- PostgreSQL: TEXT is truly unlimited (1 GB practical limit)-- No need for MEDIUMTEXT/LONGTEXT variants -- MySQL: Choose based on expected sizeCREATE TABLE mysql_content ( id INT PRIMARY KEY, short_content TEXT, -- Up to 64 KB medium_content MEDIUMTEXT, -- Up to 16 MB large_content LONGTEXT -- Up to 4 GB); -- SQL Server: VARCHAR(MAX) is preferred over deprecated TEXTCREATE TABLE sqlserver_content ( id INT PRIMARY KEY, content VARCHAR(MAX) -- Up to 2 GB, modern approach);TEXT columns often cannot be used as primary keys or in indexes (or have length restrictions for indexing). Some databases store TEXT out-of-row, impacting query performance. Functions like comparison and sorting may behave differently. Always verify TEXT type behavior in your specific database system.
A character set defines the set of characters that can be stored; encoding defines how those characters are represented as bytes. This distinction is critical for international applications storing non-ASCII text.
Evolution of Character Encodings:
Unicode Encodings:
| Encoding | Bytes per Character | ASCII Compatible | Use Case |
|---|---|---|---|
| ASCII | 1 | Yes | Legacy systems, English-only |
| Latin-1 (ISO-8859-1) | 1 | Yes | Western European languages |
| UTF-8 | 1-4 | Yes | Web, modern applications, international |
| UTF-16 | 2-4 | No | Windows internals, Java strings |
| UTF-32 | 4 | No | Internal processing (rarely stored) |
123456789101112131415161718192021222324252627282930313233343536373839
-- Database-level character set configuration -- PostgreSQL: Set encoding at database creationCREATE DATABASE myapp_db WITH ENCODING = 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8'; -- MySQL: Database and table character setsCREATE DATABASE myapp_db CHARACTER SET utf8mb4 -- Full UTF-8 (4-byte support) COLLATE utf8mb4_unicode_ci; -- Case-insensitive Unicode collation CREATE TABLE users ( user_id INT PRIMARY KEY, username VARCHAR(50) CHARACTER SET utf8mb4, bio TEXT CHARACTER SET utf8mb4); -- SQL Server: Uses NVARCHAR for UnicodeCREATE TABLE users ( user_id INT PRIMARY KEY, username NVARCHAR(50), -- Unicode-capable legacy_code VARCHAR(10) -- Non-Unicode (uses database collation)); -- Oracle: AL32UTF8 is recommended for UnicodeALTER DATABASE CHARACTER SET AL32UTF8; -- Storage implications of encoding-- ASCII text: 1 byte per character in any encoding-- European accents (é, ü): 1 byte in Latin-1, 2 bytes in UTF-8-- Asian characters (中, 日): 3 bytes in UTF-8-- Emoji (😀): 4 bytes in UTF-8 -- Example: Japanese text storage-- 'こんにちは' (5 characters)-- UTF-8: 15 bytes (3 bytes per hiragana character)-- NVARCHAR with UTF-16: 10 bytes (2 bytes per character)MySQL's 'utf8' is actually a 3-byte subset of UTF-8 that cannot store 4-byte characters like emoji (😀) or some rare Chinese characters. Always use 'utf8mb4' for true UTF-8 support. This is one of the most common MySQL configuration mistakes, causing silent data corruption for emoji and special characters.
A collation defines rules for comparing and sorting character data. Collations determine:
Collation Naming Conventions:
Most databases use structured collation names encoding their properties:
Collation: en_US_ci_ai
├── en_US - Locale (English, United States)
├── ci - Case Insensitive
└── ai - Accent Insensitive
Common Suffixes:
_ci / _cs: Case Insensitive / Case Sensitive_ai / _as: Accent Insensitive / Accent Sensitive_bin: Binary comparison (byte-by-byte, locale-unaware)_unicode: Unicode Collation Algorithm12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
-- Collation effects on comparison and sorting -- PostgreSQL: Collation specificationCREATE TABLE users ( email VARCHAR(255) COLLATE "en_US" UNIQUE); -- Case-insensitive comparison (PostgreSQL)SELECT * FROM users WHERE email = 'John@Example.com' COLLATE "en_US";-- With case-insensitive collation: matches 'john@example.com' -- MySQL: Column-level collationCREATE TABLE products ( product_id INT PRIMARY KEY, product_name VARCHAR(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci, -- Case-insensitive product_code VARCHAR(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin -- Binary, case-sensitive); -- Comparison behavior demonstration-- With utf8mb4_unicode_ci:SELECT * FROM products WHERE product_name = 'LAPTOP';-- Matches: 'Laptop', 'laptop', 'LAPTOP', 'LaPtOp' -- With utf8mb4_bin:SELECT * FROM products WHERE product_code = 'ABC123';-- Only matches exactly 'ABC123', not 'abc123' -- SQL Server: Collation in queriesSELECT *FROM customersWHERE last_name = 'müller' COLLATE Latin1_General_CI_AI;-- CI_AI = Case Insensitive, Accent Insensitive-- Matches: 'Müller', 'MULLER', 'muller', 'Muller' -- Sorting behavior varies by collation-- German collation: ä sorts with a (Müller before Nyman)-- Swedish collation: ä sorts after z (Müller after Zylstra) SELECT last_name FROM customersORDER BY last_name COLLATE "de_DE"; -- German sorting SELECT last_name FROM customersORDER BY last_name COLLATE "sv_SE"; -- Swedish sorting -- Binary collation for performance-- Faster comparisons (no linguistic rules)-- But: 'A' ≠ 'a', 'é' ≠ 'e'CREATE INDEX idx_username_bin ON users (username COLLATE "C");Joining or comparing columns with different collations causes errors or implicit conversions. Consistency is essential: choose one collation per database and stick to it. If you must mix, use explicit COLLATE clauses in queries.
NCHAR and NVARCHAR are Unicode-specific character types found in SQL Server and Oracle (and supported for compatibility in other databases). The 'N' prefix stands for 'National' (as in national character set, i.e., Unicode).
SQL Server Unicode Types:
| Type | Storage | Encoding | Maximum |
|---|---|---|---|
| NCHAR(n) | 2n bytes | UTF-16 | n = 4000 chars |
| NVARCHAR(n) | 2× actual | UTF-16 | n = 4000 chars |
| NVARCHAR(MAX) | 2× actual | UTF-16 | ~1 billion chars |
When NVARCHAR is Required:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
-- SQL Server: VARCHAR vs NVARCHAR comparison -- Non-Unicode (uses database's collation code page)CREATE TABLE legacy_customers ( customer_id INT PRIMARY KEY, first_name VARCHAR(50), -- Cannot store characters outside code page last_name VARCHAR(50)); -- Unicode (supports all languages)CREATE TABLE international_customers ( customer_id INT PRIMARY KEY, first_name NVARCHAR(50), -- Can store any Unicode character last_name NVARCHAR(50), display_name NVARCHAR(100)); -- Inserting international dataINSERT INTO international_customers (customer_id, first_name, last_name, display_name)VALUES (1, N'José', N'García', N'José García'), -- Spanish (2, N'山田', N'太郎', N'山田 太郎'), -- Japanese (3, N'Müller', N'Hans', N'Hans Müller'), -- German (4, N'Иван', N'Петров', N'Иван Петров'); -- Russian -- The N prefix is crucial for literal Unicode stringsINSERT INTO international_customers VALUES (5, N'😀', N'Emoji', N'😀 Emoji'); -- Without N prefix, Unicode characters may be corrupted or cause errorsINSERT INTO international_customers VALUES (6, '中文', 'Test', 'Test中文'); -- WRONG!INSERT INTO international_customers VALUES (6, N'中文', N'Test', N'Test中文'); -- Correct -- Storage comparison-- VARCHAR: 'Hello' = 5 bytes-- NVARCHAR: N'Hello' = 10 bytes (2 bytes per character) -- Performance consideration: NVARCHAR uses 2× storage-- For pure ASCII data, this doubles table and index sizes -- Oracle: NVARCHAR2 and NCHARCREATE TABLE oracle_international ( id NUMBER(10) PRIMARY KEY, name NVARCHAR2(100) -- Unicode variable-length); -- PostgreSQL: No separate NVARCHAR (TEXT/VARCHAR with UTF-8 covers all)-- Just ensure database encoding is UTF8CREATE TABLE pg_international ( id INTEGER PRIMARY KEY, name VARCHAR(100) -- UTF-8 handles all Unicode);PostgreSQL uses UTF-8 encoding for all text types, making NVARCHAR unnecessary. SQL Server's VARCHAR uses the database collation's code page (potentially single-byte), requiring NVARCHAR for full Unicode. MySQL with utf8mb4 also handles all Unicode in regular VARCHAR types.
Character type selection impacts data integrity, storage efficiency, query performance, and internationalization capability. Apply these principles for effective text storage:
| Data Domain | Recommended Type | Key Considerations |
|---|---|---|
| Fixed codes (ISO country) | CHAR(n) | Exact length, predictable storage |
| Names, addresses | VARCHAR(50-100) | Most fit under 50; buffer for edge cases |
| Email addresses | VARCHAR(255) | RFC 5321 spec, case-insensitive collation |
| URLs | VARCHAR(2000+) | Browser limits vary, consider TEXT |
| Descriptions | VARCHAR(500-2000) | UI-driven, searchable |
| Articles, documents | TEXT | Large content, consider full-text indexing |
| JSON data | JSON type or TEXT | Prefer native JSON where available |
| International text | UTF-8/NVARCHAR | Full Unicode support required |
You now have comprehensive knowledge of SQL character types, from basic VARCHAR/CHAR distinctions through Unicode encoding and collation semantics. Next, we'll explore date and time types—critical for temporal data and often a source of subtle bugs.