Distributed File Systems - Learning Module

Loading content...

0/273

Ceph: Unified Storage

The Quest to Eliminate Metadata Bottlenecks

In the previous page, we explored HDFS—a groundbreaking distributed file system that scaled storage to petabytes. But HDFS has an architectural constraint that limits its scalability: the NameNode. All metadata operations funnel through a single master node, creating a performance ceiling and potential single point of failure.

What if we could eliminate the centralized metadata server entirely?

This question drove the creation of Ceph, an open-source distributed storage platform that takes a radically different approach. Instead of a central node knowing where every piece of data lives, Ceph uses a mathematical algorithm (CRUSH) that computes data locations on demand. Any client, given the cluster map, can independently determine where data should be stored—no central lookup required.

Ceph was born from Sage Weil's PhD research at UC Santa Cruz, designed from the ground up to solve the metadata scalability problem while providing unified storage: the ability to expose the same underlying storage cluster as objects, block devices, or a file system simultaneously.

What You Will Learn

By the end of this page, you will understand Ceph's distributed architecture, the CRUSH algorithm that eliminates metadata bottlenecks, how RADOS provides reliable object storage, and how RBD (block), CephFS (file), and RGW (object) interfaces build on RADOS to provide unified storage capabilities.

Ceph Philosophy and Design Goals

Ceph was designed with several ambitious goals that distinguish it from earlier distributed storage systems. Understanding these goals reveals why Ceph's architecture differs so dramatically from HDFS and traditional SAN/NAS systems.

Ceph Design Principles

•No single point of failure — Every component must be redundant. No single node failure should cause data unavailability or loss.
•Horizontal scalability — Adding nodes should linearly increase capacity and performance. No architectural ceilings.
•Self-managing — The cluster should detect failures, heal itself, and rebalance data automatically without human intervention.
•Software-defined — Run on commodity hardware without specialized controllers or appliances. All intelligence in software.
•Unified storage — Provide object, block, and file interfaces from a single storage pool. No data silos.
•Strong consistency — All clients see the same data at the same time. No eventual consistency surprises.

The key insight: Traditional storage systems rely on metadata servers to maintain a mapping from files/objects to physical locations. This mapping must be consulted on every access, making the metadata server a bottleneck and single point of failure.

Ceph eliminates this by using a deterministic placement algorithm. Given an object name and the cluster topology, any node can compute exactly where that object should be stored—without asking anyone. This is like having a formula that converts any address into GPS coordinates without needing a directory lookup.

Why 'Ceph'?

Ceph is named after 'Cephalopod'—the class of marine animals including octopus and squid. Like these creatures with distributed nervous systems and no centralized brain, Ceph distributes intelligence throughout the cluster. Every node can make autonomous decisions using the same algorithm.

Architecture Overview: RADOS and Beyond

Ceph's architecture is built in layers. At the foundation is RADOS (Reliable Autonomic Distributed Object Store), a distributed object storage system. On top of RADOS, Ceph provides three interface layers:

Converting Mermaid diagram...

Ceph Component Overview
Component	Role	Count in Cluster
OSD (Object Storage Daemon)	Stores objects, handles replication, recovery, scrubbing	One per disk (hundreds to thousands)
Monitor (MON)	Maintains cluster map, provides consensus, authentication	Odd number for quorum (3, 5, 7)
Manager (MGR)	Metrics, dashboard, module host (alerts, orchestration)	2+ for HA
MDS (Metadata Server)	Manages CephFS namespace, file metadata	1+ only if using CephFS
RADOS Gateway (RGW)	S3/Swift compatible REST API	Deployed as needed behind LB

Key architectural insight: The OSD daemons are the workhorses of Ceph. Each OSD typically manages a single physical disk (or SSD) and is responsible for storing objects, replicating data to peer OSDs, detecting failures, and recovering from them. The intelligence is distributed—OSDs coordinate directly with each other for replication, not through a central controller.

The Monitors maintain the authoritative cluster map (CRUSH map + OSD status) but don't handle any data operations. Clients and OSDs fetch the map from Monitors and then operate independently.

The CRUSH Algorithm: Controlled Replication Under Scalable Hashing

CRUSH is Ceph's revolutionary placement algorithm—the innovation that eliminates centralized metadata servers. CRUSH computes where objects should be stored based on:

The object's name (deterministic input)
The cluster topology (CRUSH map)
Placement rules (replication and failure domain requirements)

How CRUSH Works (Conceptual):

•Object Naming — Each object has a unique name. Hash the name to get a numeric placement group (PG) ID.
•Placement Groups — PGs are logical groupings (~100 objects per PG). They simplify tracking—instead of tracking billions of objects, track thousands of PGs.
•CRUSH Computation — Given a PG ID and CRUSH map, the algorithm deterministically selects OSDs. Anyone with the same inputs computes the same result.
•Failure Domain Awareness — CRUSH understands topology (hosts, racks, datacenters) and distributes replicas across failure domains as specified.

crush-pseudocode.md
# CRUSH Placement Example (Simplified)
 
Input: 
  - object_name: "user/photos/vacation.jpg"
  - replication_factor: 3
  - failure_domain: "rack"
 
Step 1: Hash object name to PG
  PG_ID = hash("user/photos/vacation.jpg") % total_pg_count
  PG_ID = 2847
 
Step 2: Run CRUSH algorithm with PG_ID
  For each replica (1 to 3):
    - Select a root in topology tree
    - Descend tree, selecting rack ≠ previous replicas
    - Within rack, select host
    - Within host, select OSD with capacity weight
 
Step 3: Output
  Replica 1: OSD.42  (Rack A, Host 7)
  Replica 2: OSD.156 (Rack B, Host 23)
  Replica 3: OSD.89  (Rack C, Host 12)
  
# ANY client can compute this independently!

Why CRUSH is revolutionary:

Traditional Approach	CRUSH Approach
Central metadata server maintains location mapping	Algorithm computes locations—no central lookup
Metadata server is bottleneck and SPOF	All clients compute independently
Adding storage requires metadata updates	Algorithm adapts automatically to topology changes
Failure domain placement requires manual configuration	Topology-aware rules ensure rack/datacenter spread

CRUSH Map Structure:

The CRUSH map encodes your cluster topology as a hierarchy:

crush-topology.txt
root default
├── datacenter dc1
│   ├── room room1
│   │   ├── rack rack1
│   │   │   ├── host node001
│   │   │   │   ├── osd.0 (1TB SSD)
│   │   │   │   └── osd.1 (4TB HDD)
│   │   │   ├── host node002
│   │   │   │   ├── osd.2 (1TB SSD)
│   │   │   │   └── osd.3 (4TB HDD)
│   │   ├── rack rack2
│   │   │   └── ... (more hosts)
│   ├── room room2
│       └── ... (more racks)
├── datacenter dc2
    └── ... (second site for DR)

CRUSH Rules: Encoding Policy

CRUSH rules define placement policy: 'replicate 3 times, each replica on a different rack, prefer SSDs for the first replica.' Different pools can have different rules—fast SSD pool for hot data, HDD pool with erasure coding for archives, cross-datacenter pool for critical data.

RADOS: Reliable Autonomic Distributed Object Store

RADOS is the core of Ceph—a flat object store where everything is stored as objects with unique names. Files, block device data, and S3 objects all become RADOS objects ultimately.

Object Structure in RADOS:

•Object ID — Unique identifier (up to 4KB string, often includes pool/namespace prefix)
•Data — Binary payload up to object size limit (default ~128MB, configurable)
•Extended Attributes (xattrs) — Key-value metadata stored with object (size limited)
•Omap — Key-value store for larger metadata (stored separately, can be large)

Pools and Placement Groups:

Objects are organized into pools, which are logical partitions with independent configuration:

Replication factor — How many copies (e.g., 3 for critical data, 2 for less important)
Erasure coding profile — Alternatively, use EC for space efficiency (like RAID)
CRUSH rule — Which failure domains to use, which device classes
PG count — How many placement groups (affects parallelism and overhead)

Placement Groups (PGs):

PGs are the unit of replication and recovery. Instead of tracking billions of objects individually, Ceph tracks thousands of PGs. Each PG:

Contains multiple objects (100s to 1000s typically)
Maps to a set of OSDs (primary + replicas)
Is the unit of peering, recovery, and scrubbing

Typical PG count: 100-300 PGs per OSD. A 100-OSD cluster might have ~10,000-30,000 PGs total.

Converting Mermaid diagram...

Replication vs. Erasure Coding

Replication (3x) uses 200% overhead but handles failures gracefully with minimal compute. Erasure coding (e.g., 4+2 = 4 data chunks + 2 parity) uses only 50% overhead but requires CPU for encoding/decoding. Use replication for hot data needing low latency; use EC for cold/archival data where capacity matters more.

OSD Operations: Read, Write, and Recovery

Understanding how OSDs handle operations reveals Ceph's distributed consistency model and failure handling. Let's trace a write operation:

Write Path

•Client computes placement — Using CRUSH, client determines which OSDs hold the object's PG. Contacts the primary OSD.
•Primary receives write — Primary OSD for this PG handles the request. Assigns a version/epoch to the transaction.
•Primary forwards to replicas — Primary sends the write to all replica OSDs in parallel.
•Replicas persist and ACK — Each replica writes to its journal (WAL) and ACKs the primary.
•Primary commits — After receiving ACKs from all replicas (by default), primary commits its own copy.
•Primary ACKs client — Only after all replicas confirm is the write considered complete.

Strong Consistency:

This write protocol ensures strong consistency: all replicas have the data before the client receives success. There's no window where a read could see stale data. Compare this to eventual consistency systems where a successful write doesn't guarantee all replicas are updated.

Read Path:

Read Path

•Client computes placement — Determines primary OSD for the object's PG.
•Read from primary — By default, reads go to primary OSD only (ensures seeing latest version).
•Primary returns data — Primary checks it's still authoritative and returns data.
•Alternative: Replica reads — Can be configured to read from any replica for higher throughput (eventual consistency).

Failure Detection and Recovery:

Ceph uses heartbeats between OSDs and monitors to detect failures:

Detection Method	Purpose	Timeout
OSD ↔ OSD heartbeats	Detect peer failures during operations	Seconds
OSD → Monitor heartbeats	Report OSD liveness to cluster	6 seconds default
Monitor grace period	Confirm OSD is truly down	20 seconds default

Recovery Workflow:

Monitor detects OSD failure, updates cluster map
All OSDs receive updated map
PGs previously on failed OSD are now degraded
CRUSH selects new replica locations
Surviving replicas copy data to new locations (backfilling)
PGs return to fully replicated state

This entire process is automatic—no operator intervention needed.

Recovery Impact on Performance

Recovery competes with client I/O for disk and network bandwidth. Ceph provides tuning parameters (recovery_max_active, osd_recovery_sleep) to throttle recovery and minimize performance impact on production workloads. In 24/7 environments, you might slow recovery to maintain SLAs.

RBD: RADOS Block Device

RBD presents RADOS as block devices—virtual disks that can be attached to virtual machines, containers, or bare metal servers. This enables Ceph to replace traditional SAN storage for applications requiring block semantics.

How RBD Works:

An RBD image (virtual disk) is striped across many RADOS objects:

rbd-striping.txt
RBD Image: "vm-disk-001" (100 GB)
Object Size: 4 MB (default)
 
Object 0: rbd_data.12345.0000000000000000  (bytes 0 - 4MB)
Object 1: rbd_data.12345.0000000000000001  (bytes 4MB - 8MB)
Object 2: rbd_data.12345.0000000000000002  (bytes 8MB - 12MB)
...
Object 25599: rbd_data.12345.0000000000006400 (last 4MB)
 
# Each object is independently placed by CRUSH
# Parallel I/O across hundreds of OSDs possible

RBD Features:

RBD Capabilities
Feature	Description	Use Case
Thin Provisioning	Space allocated only as data is written	Overcommit storage, pay for what you use
Snapshots	Copy-on-write point-in-time captures	Backup, testing, rollback
Cloning	Writable copies from snapshots (instant)	VM templating, rapid deployment
Layering	Images can have parent images (COW chains)	Golden image hierarchies
Mirroring	Async replication to remote cluster	Disaster recovery across sites
Exclusive Lock	Only one client can write at a time	Prevents corruption from concurrent access
Live Migration	Move RBD between pools or clusters	Storage tiering, cloud migration

RBD Integration:

RBD is widely supported:

Linux kernel module (krbd): Native block device, no dependencies
QEMU/KVM: Direct integration for VM disks
librbd: User-space library for applications
Kubernetes (CSI): Dynamic volume provisioning for containers
OpenStack Cinder: Default backend for many deployments

RBD vs. Traditional SAN

Traditional SANs require expensive Fibre Channel infrastructure and dedicated storage arrays. RBD runs over commodity Ethernet (10/25/100GbE) and commodity servers. For many workloads, RBD on NVMe OSDs matches or exceeds SAN performance at a fraction of the cost.

CephFS: Distributed POSIX File System

CephFS provides a POSIX-compliant distributed file system on top of RADOS. Unlike HDFS (which has limited POSIX support), CephFS supports full file system semantics including permissions, hard/soft links, and atomic rename.

CephFS Architecture:

CephFS introduces one additional component: the Metadata Server (MDS). The MDS manages the directory tree, file names, and attributes—but crucially, not the data. Data is still stored directly in RADOS.

Converting Mermaid diagram...

Key Insight: Separating Metadata from Data

By separating metadata (handled by MDS) from data (stored directly in RADOS), CephFS achieves:

Scalable data throughput: Clients read/write data directly to OSDs, not through MDS
Metadata scalability: Multiple active MDS servers can handle different parts of namespace
Independent scaling: Add MDS nodes for metadata-heavy workloads, add OSDs for data capacity

Dynamic Subtree Partitioning:

With multiple active MDS nodes, CephFS dynamically distributes the namespace tree:

mds-subtree-partitioning.txt
Namespace Tree Distribution (2 Active MDS):
 
MDS.0 handles:                MDS.1 handles:
├── /home                     ├── /data
│   ├── alice/                │   ├── logs/
│   ├── bob/                  │   ├── metrics/
│   └── charlie/              │   └── uploads/
└── /etc                      └── /tmp
 
# Hot directories automatically migrate to less-loaded MDS
# If MDS.0 is overloaded by /home/alice activity,
# /home/alice subtree can migrate to MDS.1

CephFS Features

•Full POSIX semantics — Permissions, ownership, timestamps, extended attributes, ACLs
•Snapshots — Directory-level copy-on-write snapshots (.snap/ virtual directory)
•Quotas — Per-directory capacity and file count limits
•Multiple file systems — Single Ceph cluster can host multiple independent CephFS instances
•Client caching — MDS issues capabilities (caps) allowing clients to cache metadata and data
•Standby replay MDS — Hot standby that replays journal, enables instant failover

CephFS vs. HDFS Trade-offs

CephFS provides richer semantics (full POSIX) but with higher metadata overhead. HDFS's simpler model (write-once, no random writes) enables optimizations like single-block-per-file metadata. Choose based on workload: CephFS for general-purpose shared storage, HDFS for Hadoop/Spark batch processing.

RADOS Gateway: S3 and Swift Compatible Object Storage

RADOS Gateway (RGW) exposes Ceph as an S3-compatible and Swift-compatible object storage service. This enables applications written for AWS S3 to work with on-premises Ceph storage without modification.

RGW Architecture:

RGW runs as a RESTful web service (typically behind a load balancer) that translates S3/Swift API calls into RADOS operations:

rgw-translation.txt
S3 Request:
PUT /bucket/object HTTP/1.1
Content-Length: 1048576
Authorization: AWS4-HMAC-SHA256 ...
Body: <1MB of data>
 
RGW Translation:
1. Authenticate request against RGW user database
2. Check bucket exists in index pool
3. Create RADOS object: <bucket>/<object>
4. Write 1MB data to object
5. Update bucket index (object metadata)
6. Return HTTP 200 OK
 
Underlying RADOS operations:
- rados put "default.rgw.buckets.data/bucket.12345/object" 
- rados omap set "default.rgw.buckets.index/bucket.12345"

RGW Features Beyond Basic S3:

RGW Capabilities
Feature	Description
Multisite Replication	Active-active or active-passive sync across datacenters
Bucket Versioning	Keep multiple versions of objects
Object Lifecycle	Automatic transition to cold storage, expiration
Bucket Policies	IAM-style access control
Lambda Notifications	Trigger webhooks on object events
STS (Security Token Service)	Temporary credentials, role assumption
Server-Side Encryption	SSE-C, SSE-KMS encryption options
Multipart Uploads	Large object uploads with resumability

Multisite Deployment:

RGW supports complex multi-datacenter topologies:

Active-Passive: One site handles writes, async replicates to standby site
Active-Active: Both sites accept writes, sync bidirectionally (eventual consistency)
Multi-Zone: Multiple zones in one region with synchronous data, async metadata

This makes RGW suitable for global object storage with disaster recovery and data locality.

When to Use RGW vs. Native RADOS

Use RGW when you need S3/Swift API compatibility, multi-tenancy with access controls, or HTTP-based access for web applications. Use native RADOS (librados) when you need maximum performance and control—RGW adds HTTP overhead and authentication processing.

Operational Considerations

Deploying and operating Ceph at scale requires understanding key operational challenges and best practices.

Critical Planning Considerations

•OSD Disk Choice — SSDs for journals/metadata, NVMe for high-performance pools, HDDs for capacity. Don't mix workloads on same OSDs.
•Network Design — Separate public (client) and cluster (replication) networks. 10GbE minimum; 25-100GbE for NVMe clusters.
•PG Count Tuning — Too few PGs = poor distribution; too many = excessive memory/CPU overhead. Target 100-200 PGs per OSD.
•Failure Domain Planning — Design CRUSH topology before deployment. Retrofitting is complex.
•Monitor Placement — Spread monitors across failure domains. Never place majority in one rack/datacenter.
•Capacity Planning — Keep clusters under 80% full. Recovery slows dramatically on full clusters.

Common Operational Tasks:

Task	Frequency	Impact
Adding OSDs	As needed	Low impact, automatic rebalancing
Removing OSDs	Rare	Medium impact, data migrates out first
Upgrading Ceph	Quarterly	Rolling upgrades possible, plan carefully
Replacing failed disks	As failures occur	Automatic recovery, minimal intervention
Pool configuration changes	Rare	Some changes require data movement
CRUSH map updates	Rare	Can trigger data migration

Monitoring Essentials:

Health status: ceph health should return HEALTH_OK
OSD status: Monitor failed/slow/full OSDs
PG states: Check for degraded, undersized, or inconsistent PGs
Recovery progress: Track backfill/recovery operations
Performance metrics: IOPS, throughput, latency per pool

The Dreaded 'Full' Cluster

When OSDs approach full (default: 85% nearfull, 95% full), Ceph blocks writes to prevent data inconsistency. A truly full cluster is operational nightmare—recovery becomes impossible without adding capacity or deleting data. Set aggressive alerts and never operate above 80%.

Summary: Ceph Unified Storage

We've explored Ceph's unique approach to distributed storage—eliminating metadata bottlenecks through algorithmic placement while providing unified object, block, and file interfaces. Let's consolidate the key concepts:

Key Takeaways

•CRUSH eliminates centralized metadata — Any client can compute object placement independently using the CRUSH algorithm and cluster map.
•RADOS is the foundation — A flat object store with strong consistency, automatic failure recovery, and replication or erasure coding.
•Unified storage interfaces — RBD (block), CephFS (file), and RGW (S3/Swift) all build on the same RADOS foundation.
•Self-healing architecture — Automatic failure detection, recovery, and rebalancing without operator intervention.
•Topology-aware placement — CRUSH rules ensure data is distributed across failure domains (racks, datacenters).
•Commodity hardware — Ceph runs on standard servers with standard disks, no proprietary hardware required.
•Operational complexity — Ceph requires careful planning and monitoring; it's not a 'set and forget' system.

What's Next:

While Ceph provides a comprehensive unified storage solution with sophisticated distribution algorithms, there are simpler alternatives for specific use cases. In the next page, we'll explore GlusterFS, a scale-out file system that takes a different approach—emphasizing simplicity and native file system semantics over algorithmic complexity.

Page Complete

You now understand Ceph's architecture at a level sufficient for evaluating it as a storage solution and making informed design decisions. You can explain CRUSH placement, RADOS operations, and how the unified interfaces work together.

Ceph: Unified Storage

The Quest to Eliminate Metadata Bottlenecks

What if we could eliminate the centralized metadata server entirely?

What You Will Learn

Ceph Philosophy and Design Goals

Ceph Design Principles

•No single point of failure — Every component must be redundant. No single node failure should cause data unavailability or loss.
•Horizontal scalability — Adding nodes should linearly increase capacity and performance. No architectural ceilings.
•Self-managing — The cluster should detect failures, heal itself, and rebalance data automatically without human intervention.
•Software-defined — Run on commodity hardware without specialized controllers or appliances. All intelligence in software.
•Unified storage — Provide object, block, and file interfaces from a single storage pool. No data silos.
•Strong consistency — All clients see the same data at the same time. No eventual consistency surprises.

Why 'Ceph'?

Architecture Overview: RADOS and Beyond

Converting Mermaid diagram...

Ceph Component Overview
Component	Role	Count in Cluster
OSD (Object Storage Daemon)	Stores objects, handles replication, recovery, scrubbing	One per disk (hundreds to thousands)
Monitor (MON)	Maintains cluster map, provides consensus, authentication	Odd number for quorum (3, 5, 7)
Manager (MGR)	Metrics, dashboard, module host (alerts, orchestration)	2+ for HA
MDS (Metadata Server)	Manages CephFS namespace, file metadata	1+ only if using CephFS
RADOS Gateway (RGW)	S3/Swift compatible REST API	Deployed as needed behind LB

The Monitors maintain the authoritative cluster map (CRUSH map + OSD status) but don't handle any data operations. Clients and OSDs fetch the map from Monitors and then operate independently.

The CRUSH Algorithm: Controlled Replication Under Scalable Hashing

CRUSH is Ceph's revolutionary placement algorithm—the innovation that eliminates centralized metadata servers. CRUSH computes where objects should be stored based on:

The object's name (deterministic input)
The cluster topology (CRUSH map)
Placement rules (replication and failure domain requirements)

How CRUSH Works (Conceptual):

•Object Naming — Each object has a unique name. Hash the name to get a numeric placement group (PG) ID.
•Placement Groups — PGs are logical groupings (~100 objects per PG). They simplify tracking—instead of tracking billions of objects, track thousands of PGs.
•CRUSH Computation — Given a PG ID and CRUSH map, the algorithm deterministically selects OSDs. Anyone with the same inputs computes the same result.
•Failure Domain Awareness — CRUSH understands topology (hosts, racks, datacenters) and distributes replicas across failure domains as specified.

crush-pseudocode.md
# CRUSH Placement Example (Simplified)
 
Input: 
  - object_name: "user/photos/vacation.jpg"
  - replication_factor: 3
  - failure_domain: "rack"
 
Step 1: Hash object name to PG
  PG_ID = hash("user/photos/vacation.jpg") % total_pg_count
  PG_ID = 2847
 
Step 2: Run CRUSH algorithm with PG_ID
  For each replica (1 to 3):
    - Select a root in topology tree
    - Descend tree, selecting rack ≠ previous replicas
    - Within rack, select host
    - Within host, select OSD with capacity weight
 
Step 3: Output
  Replica 1: OSD.42  (Rack A, Host 7)
  Replica 2: OSD.156 (Rack B, Host 23)
  Replica 3: OSD.89  (Rack C, Host 12)
  
# ANY client can compute this independently!

Why CRUSH is revolutionary:

Traditional Approach	CRUSH Approach
Central metadata server maintains location mapping	Algorithm computes locations—no central lookup
Metadata server is bottleneck and SPOF	All clients compute independently
Adding storage requires metadata updates	Algorithm adapts automatically to topology changes
Failure domain placement requires manual configuration	Topology-aware rules ensure rack/datacenter spread

CRUSH Map Structure:

The CRUSH map encodes your cluster topology as a hierarchy:

crush-topology.txt
root default
├── datacenter dc1
│   ├── room room1
│   │   ├── rack rack1
│   │   │   ├── host node001
│   │   │   │   ├── osd.0 (1TB SSD)
│   │   │   │   └── osd.1 (4TB HDD)
│   │   │   ├── host node002
│   │   │   │   ├── osd.2 (1TB SSD)
│   │   │   │   └── osd.3 (4TB HDD)
│   │   ├── rack rack2
│   │   │   └── ... (more hosts)
│   ├── room room2
│       └── ... (more racks)
├── datacenter dc2
    └── ... (second site for DR)

CRUSH Rules: Encoding Policy

RADOS: Reliable Autonomic Distributed Object Store

RADOS is the core of Ceph—a flat object store where everything is stored as objects with unique names. Files, block device data, and S3 objects all become RADOS objects ultimately.

Object Structure in RADOS:

•Object ID — Unique identifier (up to 4KB string, often includes pool/namespace prefix)
•Data — Binary payload up to object size limit (default ~128MB, configurable)
•Extended Attributes (xattrs) — Key-value metadata stored with object (size limited)
•Omap — Key-value store for larger metadata (stored separately, can be large)

Pools and Placement Groups:

Objects are organized into pools, which are logical partitions with independent configuration:

Replication factor — How many copies (e.g., 3 for critical data, 2 for less important)
Erasure coding profile — Alternatively, use EC for space efficiency (like RAID)
CRUSH rule — Which failure domains to use, which device classes
PG count — How many placement groups (affects parallelism and overhead)

Placement Groups (PGs):

PGs are the unit of replication and recovery. Instead of tracking billions of objects individually, Ceph tracks thousands of PGs. Each PG:

Contains multiple objects (100s to 1000s typically)
Maps to a set of OSDs (primary + replicas)
Is the unit of peering, recovery, and scrubbing

Typical PG count: 100-300 PGs per OSD. A 100-OSD cluster might have ~10,000-30,000 PGs total.

Converting Mermaid diagram...

Replication vs. Erasure Coding

OSD Operations: Read, Write, and Recovery

Understanding how OSDs handle operations reveals Ceph's distributed consistency model and failure handling. Let's trace a write operation:

Write Path

•Client computes placement — Using CRUSH, client determines which OSDs hold the object's PG. Contacts the primary OSD.
•Primary receives write — Primary OSD for this PG handles the request. Assigns a version/epoch to the transaction.
•Primary forwards to replicas — Primary sends the write to all replica OSDs in parallel.
•Replicas persist and ACK — Each replica writes to its journal (WAL) and ACKs the primary.
•Primary commits — After receiving ACKs from all replicas (by default), primary commits its own copy.
•Primary ACKs client — Only after all replicas confirm is the write considered complete.

Strong Consistency:

Read Path:

Read Path

•Client computes placement — Determines primary OSD for the object's PG.
•Read from primary — By default, reads go to primary OSD only (ensures seeing latest version).
•Primary returns data — Primary checks it's still authoritative and returns data.
•Alternative: Replica reads — Can be configured to read from any replica for higher throughput (eventual consistency).

Failure Detection and Recovery:

Ceph uses heartbeats between OSDs and monitors to detect failures:

Detection Method	Purpose	Timeout
OSD ↔ OSD heartbeats	Detect peer failures during operations	Seconds
OSD → Monitor heartbeats	Report OSD liveness to cluster	6 seconds default
Monitor grace period	Confirm OSD is truly down	20 seconds default

Recovery Workflow:

Monitor detects OSD failure, updates cluster map
All OSDs receive updated map
PGs previously on failed OSD are now degraded
CRUSH selects new replica locations
Surviving replicas copy data to new locations (backfilling)
PGs return to fully replicated state

This entire process is automatic—no operator intervention needed.

Recovery Impact on Performance

RBD: RADOS Block Device

How RBD Works:

An RBD image (virtual disk) is striped across many RADOS objects:

rbd-striping.txt
RBD Image: "vm-disk-001" (100 GB)
Object Size: 4 MB (default)
 
Object 0: rbd_data.12345.0000000000000000  (bytes 0 - 4MB)
Object 1: rbd_data.12345.0000000000000001  (bytes 4MB - 8MB)
Object 2: rbd_data.12345.0000000000000002  (bytes 8MB - 12MB)
...
Object 25599: rbd_data.12345.0000000000006400 (last 4MB)
 
# Each object is independently placed by CRUSH
# Parallel I/O across hundreds of OSDs possible

RBD Features:

RBD Capabilities
Feature	Description	Use Case
Thin Provisioning	Space allocated only as data is written	Overcommit storage, pay for what you use
Snapshots	Copy-on-write point-in-time captures	Backup, testing, rollback
Cloning	Writable copies from snapshots (instant)	VM templating, rapid deployment
Layering	Images can have parent images (COW chains)	Golden image hierarchies
Mirroring	Async replication to remote cluster	Disaster recovery across sites
Exclusive Lock	Only one client can write at a time	Prevents corruption from concurrent access
Live Migration	Move RBD between pools or clusters	Storage tiering, cloud migration

RBD Integration:

RBD is widely supported:

Linux kernel module (krbd): Native block device, no dependencies
QEMU/KVM: Direct integration for VM disks
librbd: User-space library for applications
Kubernetes (CSI): Dynamic volume provisioning for containers
OpenStack Cinder: Default backend for many deployments

RBD vs. Traditional SAN

CephFS: Distributed POSIX File System

CephFS Architecture:

Converting Mermaid diagram...

Key Insight: Separating Metadata from Data

By separating metadata (handled by MDS) from data (stored directly in RADOS), CephFS achieves:

Scalable data throughput: Clients read/write data directly to OSDs, not through MDS
Metadata scalability: Multiple active MDS servers can handle different parts of namespace
Independent scaling: Add MDS nodes for metadata-heavy workloads, add OSDs for data capacity

Dynamic Subtree Partitioning:

With multiple active MDS nodes, CephFS dynamically distributes the namespace tree:

mds-subtree-partitioning.txt
Namespace Tree Distribution (2 Active MDS):
 
MDS.0 handles:                MDS.1 handles:
├── /home                     ├── /data
│   ├── alice/                │   ├── logs/
│   ├── bob/                  │   ├── metrics/
│   └── charlie/              │   └── uploads/
└── /etc                      └── /tmp
 
# Hot directories automatically migrate to less-loaded MDS
# If MDS.0 is overloaded by /home/alice activity,
# /home/alice subtree can migrate to MDS.1

CephFS Features

•Full POSIX semantics — Permissions, ownership, timestamps, extended attributes, ACLs
•Snapshots — Directory-level copy-on-write snapshots (.snap/ virtual directory)
•Quotas — Per-directory capacity and file count limits
•Multiple file systems — Single Ceph cluster can host multiple independent CephFS instances
•Client caching — MDS issues capabilities (caps) allowing clients to cache metadata and data
•Standby replay MDS — Hot standby that replays journal, enables instant failover

CephFS vs. HDFS Trade-offs

RADOS Gateway: S3 and Swift Compatible Object Storage

RGW Architecture:

RGW runs as a RESTful web service (typically behind a load balancer) that translates S3/Swift API calls into RADOS operations:

rgw-translation.txt
S3 Request:
PUT /bucket/object HTTP/1.1
Content-Length: 1048576
Authorization: AWS4-HMAC-SHA256 ...
Body: <1MB of data>
 
RGW Translation:
1. Authenticate request against RGW user database
2. Check bucket exists in index pool
3. Create RADOS object: <bucket>/<object>
4. Write 1MB data to object
5. Update bucket index (object metadata)
6. Return HTTP 200 OK
 
Underlying RADOS operations:
- rados put "default.rgw.buckets.data/bucket.12345/object" 
- rados omap set "default.rgw.buckets.index/bucket.12345"

RGW Features Beyond Basic S3:

RGW Capabilities
Feature	Description
Multisite Replication	Active-active or active-passive sync across datacenters
Bucket Versioning	Keep multiple versions of objects
Object Lifecycle	Automatic transition to cold storage, expiration
Bucket Policies	IAM-style access control
Lambda Notifications	Trigger webhooks on object events
STS (Security Token Service)	Temporary credentials, role assumption
Server-Side Encryption	SSE-C, SSE-KMS encryption options
Multipart Uploads	Large object uploads with resumability

Multisite Deployment:

RGW supports complex multi-datacenter topologies:

Active-Passive: One site handles writes, async replicates to standby site
Active-Active: Both sites accept writes, sync bidirectionally (eventual consistency)
Multi-Zone: Multiple zones in one region with synchronous data, async metadata

This makes RGW suitable for global object storage with disaster recovery and data locality.

When to Use RGW vs. Native RADOS

Operational Considerations

Deploying and operating Ceph at scale requires understanding key operational challenges and best practices.

Critical Planning Considerations

•OSD Disk Choice — SSDs for journals/metadata, NVMe for high-performance pools, HDDs for capacity. Don't mix workloads on same OSDs.
•Network Design — Separate public (client) and cluster (replication) networks. 10GbE minimum; 25-100GbE for NVMe clusters.
•PG Count Tuning — Too few PGs = poor distribution; too many = excessive memory/CPU overhead. Target 100-200 PGs per OSD.
•Failure Domain Planning — Design CRUSH topology before deployment. Retrofitting is complex.
•Monitor Placement — Spread monitors across failure domains. Never place majority in one rack/datacenter.
•Capacity Planning — Keep clusters under 80% full. Recovery slows dramatically on full clusters.

Common Operational Tasks:

Task	Frequency	Impact
Adding OSDs	As needed	Low impact, automatic rebalancing
Removing OSDs	Rare	Medium impact, data migrates out first
Upgrading Ceph	Quarterly	Rolling upgrades possible, plan carefully
Replacing failed disks	As failures occur	Automatic recovery, minimal intervention
Pool configuration changes	Rare	Some changes require data movement
CRUSH map updates	Rare	Can trigger data migration

Monitoring Essentials:

Health status: ceph health should return HEALTH_OK
OSD status: Monitor failed/slow/full OSDs
PG states: Check for degraded, undersized, or inconsistent PGs
Recovery progress: Track backfill/recovery operations
Performance metrics: IOPS, throughput, latency per pool

The Dreaded 'Full' Cluster

Summary: Ceph Unified Storage

Key Takeaways

•CRUSH eliminates centralized metadata — Any client can compute object placement independently using the CRUSH algorithm and cluster map.
•RADOS is the foundation — A flat object store with strong consistency, automatic failure recovery, and replication or erasure coding.
•Unified storage interfaces — RBD (block), CephFS (file), and RGW (S3/Swift) all build on the same RADOS foundation.
•Self-healing architecture — Automatic failure detection, recovery, and rebalancing without operator intervention.
•Topology-aware placement — CRUSH rules ensure data is distributed across failure domains (racks, datacenters).
•Commodity hardware — Ceph runs on standard servers with standard disks, no proprietary hardware required.
•Operational complexity — Ceph requires careful planning and monitoring; it's not a 'set and forget' system.

What's Next:

Page Complete