Loading system design...
Design a distributed stream processing system like Apache Flink or Spark Streaming that continuously processes unbounded data streams with low latency. The system supports windowed aggregations (tumbling, sliding, session windows) over event time with watermark-based progress tracking, maintains large distributed state (keyed per-event state backed by RocksDB for TB-scale), achieves exactly-once semantics via Chandy-Lamport distributed snapshot checkpointing (barriers → state snapshot → source offset capture), distributes computation as parallel subtasks across a cluster, and integrates end-to-end exactly-once with external systems via two-phase commit sinks or idempotent writes.
| Metric | Value |
|---|---|
| Events processed per second | 1–10 million |
| End-to-end latency | Milliseconds (Flink) to seconds (Spark micro-batch) |
| Checkpoint interval | 30 seconds – 5 minutes |
| Checkpoint duration | 1–30 seconds |
| State size (per job) | GB to TB (RocksDB backend) |
| Parallelism (per job) | 10–10,000 subtasks |
| TaskManagers per cluster | 10–1,000 |
| Task slots per TaskManager | 4–16 |
| Jobs per cluster (multi-tenant) | 10–100 |
| Watermark lag (event time) | Seconds to minutes |
Continuous data processing: ingest unbounded streams of events (no beginning or end — events arrive continuously) from sources like Kafka, Kinesis, or Pub/Sub; process each event with low latency (milliseconds to seconds); emit results to sinks (databases, dashboards, downstream topics) in near real-time
Windowed aggregation: group events into time-based windows — tumbling (fixed, non-overlapping), sliding (overlapping, specified by size + slide), session (activity-based, gap-triggered close); compute aggregations (count, sum, avg, min, max, top-K) per window; emit results when window closes
Event time processing: process events based on when they occurred (event time) not when they arrive (processing time); handle out-of-order events; support watermarks — a mechanism to track event-time progress and determine when a window can be closed
Exactly-once semantics: guarantee that each input event affects the output exactly once — no duplicates, no lost events; achieved end-to-end (source → processing → sink); critical for financial calculations, billing, and accurate counting
Stateful processing: operators maintain state across events — e.g., running count per user, session state, ML model state; state persisted durably (survives failures); state can grow large (terabytes across the cluster); efficient state access (in-memory with disk spill) for high-throughput processing
Fault tolerance via checkpointing: periodically snapshot the entire distributed state (operator state + source offsets) to durable storage (S3/HDFS); on failure → restart from latest checkpoint → replay source from checkpointed offsets → recover exact pre-failure state; no data loss, no duplicates
Parallel and distributed execution: a stream processing job is a DAG (directed acyclic graph) of operators; each operator parallelised across multiple tasks running on different machines; data partitioned (keyed) across parallel tasks; the framework handles distribution, communication, and load balancing
Stream-table joins and enrichment: join a fast-moving event stream with a slowly-changing reference table (e.g., join click events with user profile data); support temporal joins (join with table state at event time); broadcast tables to all parallel tasks for efficient lookup
Backpressure handling: if a downstream operator is slower than upstream → the system must slow down upstream rather than dropping events or running out of memory; backpressure propagated from sink to source; prevents data loss during load spikes
Exactly-once sinks: output results to external systems with exactly-once guarantees — write to Kafka (transactional producer), write to databases (idempotent upserts), write to file systems (commit on checkpoint); two-phase commit protocol for transactional sinks; at-least-once + idempotent = exactly-once
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?