File System Comparison - Learning Module

Loading content...

0/240

Performance Characteristics

Performance: Where Theory Meets Reality

Features tell you what a file system can do. Performance tells you how fast it does it. In production environments where milliseconds matter—databases serving queries, build systems compiling code, streaming services delivering content—file system performance directly translates to user experience and operational costs.

But file system performance isn't a single number. It varies dramatically based on:

Workload type: Sequential reads vs. random writes vs. metadata operations
Storage medium: Spinning disks vs. SSDs vs. NVMe
Concurrency level: Single-threaded access vs. parallel operations
Data characteristics: Small files vs. large files, deep directories vs. shallow

This page dissects performance characteristics across these dimensions, providing the analytical framework you need to predict how file systems will behave under your specific workloads.

What You Will Learn

By the end of this page, you will understand how different file systems perform under various workload patterns, recognize which architectural decisions drive performance differences, appreciate how storage media type affects relative performance, and develop intuition for predicting file system behavior in production scenarios.

The Performance Taxonomy

File system performance encompasses multiple distinct dimensions. Understanding each dimension—and how they interact—is essential for meaningful performance analysis.

Performance Dimensions

•Throughput — Raw bytes per second for read/write operations. Matters most for large file transfers, streaming, and bulk data processing. Measured in MB/s or GB/s.
•IOPS (I/O Operations Per Second) — Number of discrete operations completed per second. Critical for random access patterns, databases, and small file operations. SSDs dramatically outperform HDDs here.
•Latency — Time from operation request to completion. Crucial for interactive applications, real-time systems, and tail latency-sensitive services. Measured in microseconds or milliseconds.
•Metadata Performance — Speed of non-data operations: file creation/deletion, directory listing, attribute changes. Often the bottleneck for workloads with many small files.
•Scalability — How performance changes as data volume, file count, or concurrent users increase. Some systems degrade gracefully; others hit sudden cliffs.
•Consistency Under Load — Variation in performance during sustained operations. Important for SLA-bound services that need predictable response times.

The Workload Defines the Metric

A file system with 2GB/s throughput but poor IOPS is perfect for video editing but terrible for email servers. A system with excellent metadata performance handles source code repositories well but may be overkill for archival storage. Always match performance priorities to actual workload characteristics.

Sequential I/O Performance

Sequential I/O—reading or writing data in contiguous order—is where file systems approach the theoretical limits of underlying storage. Performance differences primarily stem from:

Allocation contiguity: How well the file system keeps file data together on disk
Prefetching/read-ahead: Anticipating future reads to hide storage latency
Write buffering: Coalescing small writes into large, efficient transfers
Journaling overhead: Extra writes required for consistency protection

Sequential Performance Characteristics
File System	Sequential Read	Sequential Write	Key Factors	Best Practices
FAT32	Good	Good	Simple, minimal overhead	Acceptable for basic throughput when features aren't needed
NTFS	Excellent	Very Good	Extent-based allocation, efficient caching	Enable write caching; use large cluster sizes for throughput
ext4	Excellent	Excellent	Extent-based, delayed allocation	Delayed allocation optimizes layout; use nobarrier cautiously for speed
XFS	Excellent	Excellent	Allocation groups enable parallelism	Excels with large files; consider stripe alignment on RAID
ZFS	Very Good	Good*	Checksums add overhead; tunable record size	*Compression often compensates; tune recordsize for workload
Btrfs	Good	Good*	COW overhead on writes	*COW fragmentation can degrade over time; defragment periodically

Understanding Delayed Allocation (ext4, XFS):

Delayed allocation postpones block assignment until data is actually flushed to disk. Benefits include:

Better contiguity: The allocator sees the complete write size and can allocate contiguous blocks
Reduced fragmentation: Multiple small writes are coalesced before allocation
Allocation efficiency: Temporary files deleted before flush require no disk allocation

The tradeoff is a small window where data exists only in memory—hence the default ordered journaling mode that ensures metadata updates only commit after data reaches disk.

Read-Ahead Mechanisms:

All modern file systems detect sequential access patterns and prefetch upcoming data:

Linux kernel provides generic read-ahead (default 128KB, tunable via /sys/block/<device>/queue/read_ahead_kb)
File systems may implement additional prefetching for metadata
XFS and ZFS include workload-adaptive prefetching that adjusts to access patterns

For streaming workloads, increasing read-ahead dramatically improves throughput by ensuring data arrives before it's requested.

COW Write Amplification

ZFS and Btrfs perform copy-on-write for all modifications. For sequential overwrites of existing data, this means reading existing blocks, modifying them, and writing to new locations—potentially doubling actual I/O. For new file creation, this effect is minimal. For database-like workloads with in-place updates, it's significant.

Practical Sequential Performance (Representative Benchmarks):

The following represents typical relative performance on enterprise SSDs. Actual numbers vary with hardware, configuration, and kernel version:

File System	Sequential Read (% of raw)	Sequential Write (% of raw)
ext4	95-99%	90-95%
XFS	95-99%	90-95%
NTFS	90-95%	85-92%
ZFS (no compression)	85-92%	75-85%
ZFS (LZ4)	90-98%*	85-95%*
Btrfs	88-95%	75-85%

*ZFS with compression often exceeds uncompressed because less data is transferred.

Key insight: All modern file systems achieve near-raw performance for sequential I/O. Differences become significant only at extreme throughput levels or when COW overhead compounds.

Random I/O Performance

Random I/O—accessing data at arbitrary locations—exposes fundamental differences in file system architecture. Where sequential I/O approaches raw device performance, random I/O highlights overhead from:

Metadata lookups: Finding where data lives before accessing it
Allocation fragmentation: How scattered file data has become over time
Journaling commits: Synchronous writes required for crash consistency
Concurrent access overhead: Lock contention and serialization points

Random I/O Characteristics
File System	Random Read	Random Write	IOPS Potential	Limiting Factors
FAT32	Poor	Poor	Low	FAT table lookups; fragmentation accumulation
NTFS	Good	Good	Moderate-High	MFT locality maintained; efficient B-tree lookups
ext4	Very Good	Very Good	High	HTree directories; extent-based allocation reduces lookups
XFS	Excellent	Excellent	Very High	Parallel allocation groups; designed for high IOPS
ZFS	Good	Moderate	Moderate	Checksum verification; transaction group commits
Btrfs	Good	Moderate	Moderate	COW overhead; B-tree traversal for every operation

Why XFS Excels at Random I/O:

XFS was designed at SGI for high-end workstations and servers requiring maximum IOPS. Its architectural advantages:

Allocation Groups: The filesystem is divided into independent allocation groups, each with its own metadata. This enables:

Parallel allocation across multiple threads
Reduced lock contention
Better locality for related data

B+ Tree Everywhere: Inodes, free space, and directory entries all use B+ trees optimized for block I/O. Lookups are O(log n) with small constants.

Delayed Logging: XFS's delayed logging mechanism batches metadata changes, reducing the frequency of journal commits and improving write IOPS.

For database workloads, XFS consistently ranks among the highest-performing options on both spinning disks and SSDs.

HDD vs SSD Random I/O

On HDDs, random I/O is dominated by seek time (~10ms) regardless of file system. The difference between 100 IOPS and 120 IOPS rarely matters. On SSDs with 100,000+ IOPS potential, file system overhead becomes the primary differentiator. File system choice matters far more on flash storage.

The ZFS Random Write Challenge:

ZFS's random write performance deserves special attention because its architecture creates inherent overhead:

Transaction Groups (TXGs): ZFS batches writes into transaction groups that commit atomically every few seconds (tunable). This provides:

Strong consistency guarantees
Efficient sequential writes to disk
But: increased latency for individual synchronous writes

SLOG (Separate Intent Log): For workloads requiring low-latency synchronous writes (databases), ZFS supports a dedicated log device (SLOG). This absorbs sync writes while the main pool handles async operations.

Record Size Tuning: ZFS's default 128KB record size is optimal for large files but causes write amplification for small random writes (modifying 4KB triggers 128KB write). Tuning recordsize per dataset to match workload (e.g., 16KB for databases) significantly improves random write performance.

With proper tuning (SLOG, appropriate recordsize, ARC sizing), ZFS random I/O performance approaches traditional file systems. Without tuning, it can lag significantly.

Metadata Operation Performance

Metadata operations—file create, delete, rename, stat, directory listing—are often overlooked in performance discussions but dominate many real-world workloads:

Source code operations: Compilers stat thousands of header files; build systems check timestamps on thousands of objects
Mail servers: Each message is a small file; reading a mailbox requires directory scans
Backup systems: Walking directory trees to find changed files
Package managers: Installing software creates thousands of small files with specific permissions

Metadata performance can vary by orders of magnitude between file systems.

Metadata Operation Performance
Operation	FAT32	NTFS	ext4	XFS	ZFS	Btrfs
File Create	Poor	Very Good	Excellent	Excellent	Good	Good
File Delete	Poor	Very Good	Very Good	Excellent	Good	Good
Directory List	Poor*	Excellent	Excellent	Excellent	Excellent	Very Good
File Stat	Good	Excellent	Excellent	Excellent	Good	Good
Rename	Good	Excellent	Excellent	Excellent	Excellent	Excellent
Large Directory	Terrible	Excellent	Excellent	Excellent	Excellent	Very Good

*FAT32 directories are linear lists; performance degrades linearly with entry count

The Directory Entry Problem:

Directory listing performance depends critically on directory structure implementation:

Linear Lists (FAT, early Unix): Listing requires reading entire directory. Finding a file requires scanning until match. O(n) for all operations. Directories with thousands of entries become unusable.

Hash-Based (ext4 HTree): ext4's HTree provides hash-based lookup with O(1) average case. A directory with 100,000 entries performs nearly identically to one with 100 entries for lookups. Listing still requires reading all entries but benefits from hash ordering for locality.

B-Tree (NTFS, XFS, Btrfs): B-tree directories provide O(log n) operations with excellent cache behavior. NTFS and XFS handle millions of entries per directory efficiently.

The Postmark Benchmark

The Postmark benchmark simulates email server workloads with small file creation, appending, reading, and deletion. It's particularly good at revealing metadata performance differences. XFS and ext4 typically lead, with FAT32 orders of magnitude slower.

File Creation Cost Analysis:

Creating a file involves multiple metadata operations:

Allocate inode/file record — Reserve metadata structure for file
Initialize attributes — Set permissions, timestamps, sizes
Create directory entry — Link name to inode in parent directory
Journal commit — Ensure atomicity of metadata changes

Each step has associated costs:

File System	Inode Allocation	Directory Update	Journal Cost	Total Overhead
ext4	Bitmap lookup	HTree insert	Log write	Low
XFS	AG BTree insert	BTree insert	Deferred	Low
ZFS	On-demand	ZAP insert	TXG sync	Moderate
Btrfs	BTree insert	BTree insert	Log tree	Moderate

For workloads creating thousands of files per second, these differences compound significantly.

Real-world example: Extracting a Linux kernel tarball (70,000+ files) completes in ~20 seconds on XFS, ~25 seconds on ext4, ~60 seconds on ZFS (default settings), and 10+ minutes on FAT32.

Concurrency and Scalability

Modern servers have many CPU cores executing concurrent file operations. File system scalability—the ability to maintain performance as parallelism increases—varies substantially based on internal locking strategies and data structure designs.

Concurrency Characteristics
File System	Parallel Read	Parallel Write	Lock Granularity	Scaling Limit
FAT32	Poor	Poor	Coarse (per-volume)	Single-threaded effectively
NTFS	Good	Good	Per-file	Good up to ~16 cores
ext4	Very Good	Good	Per-inode + extent tree	Excellent up to ~32 cores
XFS	Excellent	Excellent	Per-AG, per-inode	Designed for 100+ cores
ZFS	Very Good	Good	Per-dataset, TXG	RAM-limited, not CPU-limited
Btrfs	Good	Moderate	Per-subvolume, global	Improving; some lock contention issues

XFS Parallelism Design:

XFS was designed at SGI for systems with hundreds of processors. Its parallelism features include:

Allocation Groups (AGs):

File system divided into independent AGs
Each AG has its own locks for allocation and freeing
Different threads operating on different AGs never contend
Directory entries can span AGs based on hashing

Per-Inode Locking:

No global locks for file operations
Operations on different files are fully parallel
Even operations within a file use fine-grained extent locks

Logging Scalability:

Log grants allow concurrent transactions
Log tail pushing is parallel
Inode clusters batch related updates

Result: XFS maintains near-linear scaling up to very high core counts on workloads with sufficient parallelism.

The Single-File Bottleneck

No file system can parallelize operations on a single file beyond certain limits. Appending to a log file from 64 threads will serialize on that file's locks regardless of file system. Workload design matters—multiple files, sharding, or application-level coordination is required for true parallelism.

ZFS ARC (Adaptive Replacement Cache):

ZFS uses a sophisticated caching system that affects both read performance and apparent concurrency:

ARC Design:

All-RAM cache sitting between file system and disk
Adaptive policy balancing recency and frequency
Prefetch and L2ARC (SSD cache) integration
Compressed ARC stores compressed blocks in RAM

Concurrency Implications:

Reads satisfied from ARC avoid disk entirely
ARC hit rates of 90%+ are common for cached workloads
Multiple readers of cached data face minimal contention
ARC can mask I/O performance differences

The RAM Equation: ZFS performance is highly RAM-dependent. With sufficient RAM for ARC, ZFS performs excellently. When RAM is constrained, performance drops significantly because:

ARC evictions increase I/O
Dedup tables (if enabled) compete for RAM
Metadata caching suffers

Rule of thumb: ZFS wants 1GB RAM per TB of storage for basic use, more for dedup or high-IOPS workloads.

Storage Media Considerations

The underlying storage medium dramatically affects which file system performs best. Optimizations that help with spinning disks may be irrelevant or counterproductive for flash storage, and vice versa.

HDD Optimization Priorities

•Minimize seeks — Sequential access dominates; random access kills performance
•Contiguous allocation — Keep file data together to reduce head movement
•Large I/O sizes — Amortize rotational latency over more data
•Read-ahead — Prefetch aggressively during sequential reads
•Write coalescing — Batch writes to same disk regions

SSD/NVMe Optimization Priorities

•TRIM/discard support — Inform SSD of freed blocks for wear leveling
•Parallelism — SSDs handle concurrent I/O exceptionally well
•Small I/O efficiency — 4KB random is nearly as fast as 4KB sequential
•Reduce write amplification — Minimize extra writes to extend SSD life
•NUMA awareness — Modern NVMe attaches to specific NUMA nodes

The SSD Game-Changer

On HDDs, file system choice might mean 10% performance difference because seeks dominate. On SSDs, file system overhead becomes the primary factor, and differences of 2-5x are common for metadata-heavy workloads. SSD adoption has made file system selection more impactful, not less.

TRIM/Discard Support:

SSDs need to know when blocks are freed so they can optimize wear leveling and garbage collection:

File System	TRIM Support	Configuration
ext4	Full	`discard` mount option or `fstrim`
XFS	Full	`discard` mount option or `fstrim`
NTFS	Full (Win7+)	Automatic on recognized SSDs
ZFS	Full	`zpool set autotrim=on`
Btrfs	Full	`discard=async` mount option
FAT32	Limited	OS-dependent

Continuous vs. Batched TRIM:

Continuous (discard mount): Immediate TRIM on delete, small overhead per operation
Batched (fstrim cron): Periodic TRIM pass, no per-operation overhead

For most workloads, weekly fstrim via cron provides optimal balance of SSD health and performance.

NVMe-Specific Considerations:

NVMe storage introduces performance levels that stress file systems in new ways:

Multi-Queue Block Layer: Linux's blk-mq provides per-CPU queues that map to NVMe's parallel command queues. File systems must avoid serialization to exploit this:

XFS: Excellent multi-queue utilization
ext4: Good, some global locks under extreme load
ZFS: Limited by TXG design, ARC helps
Btrfs: Improving, some serialization points remain

I/O Scheduler Selection: For NVMe devices, none scheduler (no scheduling) often provides best performance because:

NVMe devices have internal parallelism
OS-level reordering adds latency without benefit
Multi-queue architecture handles prioritization internally

NUMA Awareness: NVMe devices attach to specific NUMA nodes. Accessing storage from remote NUMA nodes adds latency. For maximum performance:

Bind applications to the same NUMA node as their storage
Or use file systems that are NUMA-aware (XFS has some NUMA optimizations)

Performance Benchmark Summary

Let's synthesize these performance characteristics into a practical comparison framework. The following represents typical relative performance across common workload categories:

Workload Performance Rankings
Workload Type	Best	Very Good	Good	Poor
Large File Sequential	ext4, XFS	NTFS, ZFS (compressed)	Btrfs	FAT32
Database (OLTP)	XFS	ext4, NTFS	ZFS (tuned)	Btrfs, FAT32
Many Small Files	XFS, ext4	NTFS	Btrfs, ZFS	FAT32
Build Systems	ext4, XFS	Btrfs, NTFS	ZFS	FAT32
Virtualization	XFS, ext4	ZFS (zvols)	Btrfs, NTFS	FAT32
Media Streaming	XFS, ext4, ZFS	NTFS, Btrfs	—	FAT32
Backup Repository	ZFS, XFS	ext4, Btrfs	NTFS	FAT32

Benchmarks Are Guidelines

These rankings represent general patterns from common benchmarks. Your specific workload, hardware, configuration, and kernel version can shift relative performance significantly. Always test with representative workloads before production deployment.

Summary: Performance Characteristics

We've examined file system performance across multiple dimensions. Let's consolidate the key insights:

Key Takeaways

•Performance is multidimensional — Throughput, IOPS, latency, and metadata operations each tell different stories; no single metric captures file system performance.
•Sequential I/O approaches raw device speed — All modern file systems perform well here; differences emerge primarily from journaling overhead and COW costs.
•Random I/O reveals architectural differences — XFS leads with its allocation group design; ZFS requires tuning (recordsize, SLOG) for competitive random write performance.
•Metadata performance varies dramatically — Hash and B-tree directories handle large directories gracefully; linear structures (FAT) become unusable at scale.
•Concurrency requires careful design — XFS's per-AG locking scales to many cores; other file systems may serialize under high parallelism.
•Storage medium changes the game — SSD adoption makes file system overhead more significant; HDD optimizations may be counterproductive on flash.

What's Next:

Performance matters, but so do limits. The next page examines Maximum Sizes—the ceiling constraints on file sizes, volume capacities, filename lengths, and other scalability limits that define what each file system can theoretically accommodate.

Page Complete

You now understand how file system architecture translates to performance characteristics across different workload patterns and storage media. This knowledge enables informed file system selection based on actual performance requirements rather than feature lists alone.