Loading content...
In the previous page, we explored HDFS—a groundbreaking distributed file system that scaled storage to petabytes. But HDFS has an architectural constraint that limits its scalability: the NameNode. All metadata operations funnel through a single master node, creating a performance ceiling and potential single point of failure.
What if we could eliminate the centralized metadata server entirely?
This question drove the creation of Ceph, an open-source distributed storage platform that takes a radically different approach. Instead of a central node knowing where every piece of data lives, Ceph uses a mathematical algorithm (CRUSH) that computes data locations on demand. Any client, given the cluster map, can independently determine where data should be stored—no central lookup required.
Ceph was born from Sage Weil's PhD research at UC Santa Cruz, designed from the ground up to solve the metadata scalability problem while providing unified storage: the ability to expose the same underlying storage cluster as objects, block devices, or a file system simultaneously.
By the end of this page, you will understand Ceph's distributed architecture, the CRUSH algorithm that eliminates metadata bottlenecks, how RADOS provides reliable object storage, and how RBD (block), CephFS (file), and RGW (object) interfaces build on RADOS to provide unified storage capabilities.
Ceph was designed with several ambitious goals that distinguish it from earlier distributed storage systems. Understanding these goals reveals why Ceph's architecture differs so dramatically from HDFS and traditional SAN/NAS systems.
The key insight: Traditional storage systems rely on metadata servers to maintain a mapping from files/objects to physical locations. This mapping must be consulted on every access, making the metadata server a bottleneck and single point of failure.
Ceph eliminates this by using a deterministic placement algorithm. Given an object name and the cluster topology, any node can compute exactly where that object should be stored—without asking anyone. This is like having a formula that converts any address into GPS coordinates without needing a directory lookup.
Ceph is named after 'Cephalopod'—the class of marine animals including octopus and squid. Like these creatures with distributed nervous systems and no centralized brain, Ceph distributes intelligence throughout the cluster. Every node can make autonomous decisions using the same algorithm.
Ceph's architecture is built in layers. At the foundation is RADOS (Reliable Autonomic Distributed Object Store), a distributed object storage system. On top of RADOS, Ceph provides three interface layers:
| Component | Role | Count in Cluster |
|---|---|---|
| OSD (Object Storage Daemon) | Stores objects, handles replication, recovery, scrubbing | One per disk (hundreds to thousands) |
| Monitor (MON) | Maintains cluster map, provides consensus, authentication | Odd number for quorum (3, 5, 7) |
| Manager (MGR) | Metrics, dashboard, module host (alerts, orchestration) | 2+ for HA |
| MDS (Metadata Server) | Manages CephFS namespace, file metadata | 1+ only if using CephFS |
| RADOS Gateway (RGW) | S3/Swift compatible REST API | Deployed as needed behind LB |
Key architectural insight: The OSD daemons are the workhorses of Ceph. Each OSD typically manages a single physical disk (or SSD) and is responsible for storing objects, replicating data to peer OSDs, detecting failures, and recovering from them. The intelligence is distributed—OSDs coordinate directly with each other for replication, not through a central controller.
The Monitors maintain the authoritative cluster map (CRUSH map + OSD status) but don't handle any data operations. Clients and OSDs fetch the map from Monitors and then operate independently.
CRUSH is Ceph's revolutionary placement algorithm—the innovation that eliminates centralized metadata servers. CRUSH computes where objects should be stored based on:
How CRUSH Works (Conceptual):
# CRUSH Placement Example (Simplified) Input: - object_name: "user/photos/vacation.jpg" - replication_factor: 3 - failure_domain: "rack" Step 1: Hash object name to PG PG_ID = hash("user/photos/vacation.jpg") % total_pg_count PG_ID = 2847 Step 2: Run CRUSH algorithm with PG_ID For each replica (1 to 3): - Select a root in topology tree - Descend tree, selecting rack ≠ previous replicas - Within rack, select host - Within host, select OSD with capacity weight Step 3: Output Replica 1: OSD.42 (Rack A, Host 7) Replica 2: OSD.156 (Rack B, Host 23) Replica 3: OSD.89 (Rack C, Host 12) # ANY client can compute this independently!Why CRUSH is revolutionary:
| Traditional Approach | CRUSH Approach |
|---|---|
| Central metadata server maintains location mapping | Algorithm computes locations—no central lookup |
| Metadata server is bottleneck and SPOF | All clients compute independently |
| Adding storage requires metadata updates | Algorithm adapts automatically to topology changes |
| Failure domain placement requires manual configuration | Topology-aware rules ensure rack/datacenter spread |
CRUSH Map Structure:
The CRUSH map encodes your cluster topology as a hierarchy:
root default├── datacenter dc1│ ├── room room1│ │ ├── rack rack1│ │ │ ├── host node001│ │ │ │ ├── osd.0 (1TB SSD)│ │ │ │ └── osd.1 (4TB HDD)│ │ │ ├── host node002│ │ │ │ ├── osd.2 (1TB SSD)│ │ │ │ └── osd.3 (4TB HDD)│ │ ├── rack rack2│ │ │ └── ... (more hosts)│ ├── room room2│ └── ... (more racks)├── datacenter dc2 └── ... (second site for DR)CRUSH rules define placement policy: 'replicate 3 times, each replica on a different rack, prefer SSDs for the first replica.' Different pools can have different rules—fast SSD pool for hot data, HDD pool with erasure coding for archives, cross-datacenter pool for critical data.
RADOS is the core of Ceph—a flat object store where everything is stored as objects with unique names. Files, block device data, and S3 objects all become RADOS objects ultimately.
Object Structure in RADOS:
Pools and Placement Groups:
Objects are organized into pools, which are logical partitions with independent configuration:
Placement Groups (PGs):
PGs are the unit of replication and recovery. Instead of tracking billions of objects individually, Ceph tracks thousands of PGs. Each PG:
Typical PG count: 100-300 PGs per OSD. A 100-OSD cluster might have ~10,000-30,000 PGs total.
Replication (3x) uses 200% overhead but handles failures gracefully with minimal compute. Erasure coding (e.g., 4+2 = 4 data chunks + 2 parity) uses only 50% overhead but requires CPU for encoding/decoding. Use replication for hot data needing low latency; use EC for cold/archival data where capacity matters more.
Understanding how OSDs handle operations reveals Ceph's distributed consistency model and failure handling. Let's trace a write operation:
Strong Consistency:
This write protocol ensures strong consistency: all replicas have the data before the client receives success. There's no window where a read could see stale data. Compare this to eventual consistency systems where a successful write doesn't guarantee all replicas are updated.
Read Path:
Failure Detection and Recovery:
Ceph uses heartbeats between OSDs and monitors to detect failures:
| Detection Method | Purpose | Timeout |
|---|---|---|
| OSD ↔ OSD heartbeats | Detect peer failures during operations | Seconds |
| OSD → Monitor heartbeats | Report OSD liveness to cluster | 6 seconds default |
| Monitor grace period | Confirm OSD is truly down | 20 seconds default |
Recovery Workflow:
This entire process is automatic—no operator intervention needed.
Recovery competes with client I/O for disk and network bandwidth. Ceph provides tuning parameters (recovery_max_active, osd_recovery_sleep) to throttle recovery and minimize performance impact on production workloads. In 24/7 environments, you might slow recovery to maintain SLAs.
RBD presents RADOS as block devices—virtual disks that can be attached to virtual machines, containers, or bare metal servers. This enables Ceph to replace traditional SAN storage for applications requiring block semantics.
How RBD Works:
An RBD image (virtual disk) is striped across many RADOS objects:
RBD Image: "vm-disk-001" (100 GB)Object Size: 4 MB (default) Object 0: rbd_data.12345.0000000000000000 (bytes 0 - 4MB)Object 1: rbd_data.12345.0000000000000001 (bytes 4MB - 8MB)Object 2: rbd_data.12345.0000000000000002 (bytes 8MB - 12MB)...Object 25599: rbd_data.12345.0000000000006400 (last 4MB) # Each object is independently placed by CRUSH# Parallel I/O across hundreds of OSDs possibleRBD Features:
| Feature | Description | Use Case |
|---|---|---|
| Thin Provisioning | Space allocated only as data is written | Overcommit storage, pay for what you use |
| Snapshots | Copy-on-write point-in-time captures | Backup, testing, rollback |
| Cloning | Writable copies from snapshots (instant) | VM templating, rapid deployment |
| Layering | Images can have parent images (COW chains) | Golden image hierarchies |
| Mirroring | Async replication to remote cluster | Disaster recovery across sites |
| Exclusive Lock | Only one client can write at a time | Prevents corruption from concurrent access |
| Live Migration | Move RBD between pools or clusters | Storage tiering, cloud migration |
RBD Integration:
RBD is widely supported:
Traditional SANs require expensive Fibre Channel infrastructure and dedicated storage arrays. RBD runs over commodity Ethernet (10/25/100GbE) and commodity servers. For many workloads, RBD on NVMe OSDs matches or exceeds SAN performance at a fraction of the cost.
CephFS provides a POSIX-compliant distributed file system on top of RADOS. Unlike HDFS (which has limited POSIX support), CephFS supports full file system semantics including permissions, hard/soft links, and atomic rename.
CephFS Architecture:
CephFS introduces one additional component: the Metadata Server (MDS). The MDS manages the directory tree, file names, and attributes—but crucially, not the data. Data is still stored directly in RADOS.
Key Insight: Separating Metadata from Data
By separating metadata (handled by MDS) from data (stored directly in RADOS), CephFS achieves:
Dynamic Subtree Partitioning:
With multiple active MDS nodes, CephFS dynamically distributes the namespace tree:
Namespace Tree Distribution (2 Active MDS): MDS.0 handles: MDS.1 handles:├── /home ├── /data│ ├── alice/ │ ├── logs/│ ├── bob/ │ ├── metrics/│ └── charlie/ │ └── uploads/└── /etc └── /tmp # Hot directories automatically migrate to less-loaded MDS# If MDS.0 is overloaded by /home/alice activity,# /home/alice subtree can migrate to MDS.1.snap/ virtual directory)CephFS provides richer semantics (full POSIX) but with higher metadata overhead. HDFS's simpler model (write-once, no random writes) enables optimizations like single-block-per-file metadata. Choose based on workload: CephFS for general-purpose shared storage, HDFS for Hadoop/Spark batch processing.
RADOS Gateway (RGW) exposes Ceph as an S3-compatible and Swift-compatible object storage service. This enables applications written for AWS S3 to work with on-premises Ceph storage without modification.
RGW Architecture:
RGW runs as a RESTful web service (typically behind a load balancer) that translates S3/Swift API calls into RADOS operations:
S3 Request:PUT /bucket/object HTTP/1.1Content-Length: 1048576Authorization: AWS4-HMAC-SHA256 ...Body: <1MB of data> RGW Translation:1. Authenticate request against RGW user database2. Check bucket exists in index pool3. Create RADOS object: <bucket>/<object>4. Write 1MB data to object5. Update bucket index (object metadata)6. Return HTTP 200 OK Underlying RADOS operations:- rados put "default.rgw.buckets.data/bucket.12345/object" - rados omap set "default.rgw.buckets.index/bucket.12345"RGW Features Beyond Basic S3:
| Feature | Description |
|---|---|
| Multisite Replication | Active-active or active-passive sync across datacenters |
| Bucket Versioning | Keep multiple versions of objects |
| Object Lifecycle | Automatic transition to cold storage, expiration |
| Bucket Policies | IAM-style access control |
| Lambda Notifications | Trigger webhooks on object events |
| STS (Security Token Service) | Temporary credentials, role assumption |
| Server-Side Encryption | SSE-C, SSE-KMS encryption options |
| Multipart Uploads | Large object uploads with resumability |
Multisite Deployment:
RGW supports complex multi-datacenter topologies:
This makes RGW suitable for global object storage with disaster recovery and data locality.
Use RGW when you need S3/Swift API compatibility, multi-tenancy with access controls, or HTTP-based access for web applications. Use native RADOS (librados) when you need maximum performance and control—RGW adds HTTP overhead and authentication processing.
Deploying and operating Ceph at scale requires understanding key operational challenges and best practices.
Common Operational Tasks:
| Task | Frequency | Impact |
|---|---|---|
| Adding OSDs | As needed | Low impact, automatic rebalancing |
| Removing OSDs | Rare | Medium impact, data migrates out first |
| Upgrading Ceph | Quarterly | Rolling upgrades possible, plan carefully |
| Replacing failed disks | As failures occur | Automatic recovery, minimal intervention |
| Pool configuration changes | Rare | Some changes require data movement |
| CRUSH map updates | Rare | Can trigger data migration |
Monitoring Essentials:
ceph health should return HEALTH_OKWhen OSDs approach full (default: 85% nearfull, 95% full), Ceph blocks writes to prevent data inconsistency. A truly full cluster is operational nightmare—recovery becomes impossible without adding capacity or deleting data. Set aggressive alerts and never operate above 80%.
We've explored Ceph's unique approach to distributed storage—eliminating metadata bottlenecks through algorithmic placement while providing unified object, block, and file interfaces. Let's consolidate the key concepts:
What's Next:
While Ceph provides a comprehensive unified storage solution with sophisticated distribution algorithms, there are simpler alternatives for specific use cases. In the next page, we'll explore GlusterFS, a scale-out file system that takes a different approach—emphasizing simplicity and native file system semantics over algorithmic complexity.
You now understand Ceph's architecture at a level sufficient for evaluating it as a storage solution and making informed design decisions. You can explain CRUSH placement, RADOS operations, and how the unified interfaces work together.