Loading content...
Architecture diagrams show what components exist and how they're connected. But they don't tell the complete story. They're like a city map showing buildings and roads—useful, but they don't show the actual traffic: where it originates, how it flows through intersections, and where it ultimately arrives.
Data flow diagrams (DFDs) answer the question: 'What happens to information as it moves through the system?' They trace the journey of data from user input through transformations, validations, enrichments, and ultimately to storage or response.
For system designers, understanding data flow is crucial because:
This page teaches you to model data flows effectively, communicate them clearly, and use them to identify design issues before implementation.
By the end of this page, you will understand the difference between architecture diagrams and data flow diagrams, master DFD notation and modeling techniques, learn to trace synchronous and asynchronous data paths, and recognize flow patterns that indicate design problems.
A data flow diagram visualizes how data moves through a system. While architecture diagrams focus on components and their relationships, DFDs focus on information—where it comes from, how it's transformed, and where it ends up.
Architecture Diagram: Shows components, services, databases, and their connections
Data Flow Diagram: Shows information movement and transformation
Both are essential. Architecture diagrams are the structural blueprint; data flow diagrams are the plumbing diagram showing how information flows through those structures.
| Aspect | Architecture Diagram | Data Flow Diagram |
|---|---|---|
| Focus | Components and connections | Information movement and transformation |
| Nodes represent | Services, databases, infrastructure | Processes, data stores, external entities |
| Edges represent | API calls, protocols, dependencies | Data in motion, information transfer |
| Answers | 'What exists and how is it connected?' | 'What happens to data as it moves?' |
| Best for | Deployment, scaling, technology choices | Processing logic, latency analysis, data integrity |
Data flow diagrams are particularly valuable for:
1. End-to-end request tracing: Showing how a user request transforms through the system
2. Data pipeline visualization: Modeling ETL, stream processing, and analytics flows
3. Latency analysis: Identifying which paths add the most processing time
4. Consistency analysis: Understanding where data might become stale or inconsistent
5. Error propagation: Tracing how failures in one area affect downstream processing
6. Compliance mapping: Showing how sensitive data flows for GDPR, HIPAA, PCI compliance
Use architecture diagrams to explain 'what we're building.' Use data flow diagrams to explain 'how it processes information.' In interviews, you'll typically draw architecture first, then trace specific data flows when the interviewer asks about particular scenarios.
Traditional DFD notation (Gane-Sarson or Yourdon-DeMarco) uses four fundamental elements. Modern system design often adapts these while preserving the core concepts.
What they are: Sources or destinations of data outside the system boundary
Representation: Rectangles or squares (sometimes with shadows)
Examples: Users, external APIs, partner systems, IoT devices
In modern systems: Mobile apps, web browsers, third-party services
What they are: Transformations that change, validate, or route data
Representation: Circles or rounded rectangles
Examples: Validate input, calculate price, enrich with metadata, format response
In modern systems: Microservices, functions, workers, processing stages
What they are: Repositories where data is persisted or cached
Representation: Open-ended rectangles (two parallel lines)
Examples: Databases, caches, file systems, message logs
In modern systems: PostgreSQL, Redis, S3, Kafka (when used for storage)
What they are: Movement of data between other elements
Representation: Arrows, labeled with the data being transferred
Examples: Order data, User credentials, Payment confirmation, Event payload
Critical: Flows must always be labeled—unlabeled arrows are meaningless
| Element | Symbol | Naming Convention | Examples |
|---|---|---|---|
| External Entity | Rectangle | Noun (who/what) | Customer, Partner API, Mobile App |
| Process | Circle/Rounded Rect | Verb phrase (what it does) | Validate Order, Calculate Tax, Send Notification |
| Data Store | Open rectangle (||) | Noun (what it stores) | Orders, User Profiles, Event Log |
| Data Flow | Arrow | Noun (what moves) | Order request, Validated order, Confirmation |
1234567891011121314151617181920
┌──────────┐ ┌──────────┐│ Customer │ │ Email ││ │ │ Service │└────┬─────┘ └────▲─────┘ │ │ │ Order request │ Order confirmation │ (items, address, payment) │ (orderId, summary) ▼ │┌─────────────────┐ Validated order ┌─────────────────┐ ││ 1.0 Validate │─────────────────────►│ 2.0 Process │──┘│ Order │ │ Payment │└────────┬────────┘ └────────┬────────┘ │ │ │ Validation result │ Payment result │ │ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ║ Validation ║ ║ Payment ║ ║ Log ║ ║ Records ║ └─────────────┘ └─────────────┘In practice, system designers often blend DFD notation with architecture diagram elements. The key is showing data transformation and movement clearly—the exact notation matters less than consistency and clarity.
Like architecture diagrams with C4 levels, data flow diagrams can be decomposed into levels of increasing detail.
The highest level shows the entire system as a single process, with all external entities and the data flows between them.
Purpose: Establish system scope and external interfaces
Content:
Questions answered:
Decomposes the Level 0 system into major processes, showing how data flows between them.
Purpose: Show major processing stages and internal data stores
Content:
Guideline: 5-9 processes typically (cognitive limit for comprehension)
Each Level 1 process can be further decomposed to show detailed processing steps.
Purpose: Detailed understanding of specific processes
Content:
When to go deeper: When a process is too complex to understand as a single box
| Level | Focus | Audience | Typical Element Count |
|---|---|---|---|
| 0 (Context) | System boundary | Everyone | 1 process, 3-5 external entities |
| 1 (System) | Major functions | Architects, leads | 5-9 processes, 2-4 data stores |
| 2+ (Detail) | Specific processes | Implementers | 3-7 sub-processes per parent |
When decomposing a process, all data flows into and out of the parent process must appear in the child diagram. This 'balancing' ensures that decomposition is consistent—no data appears or disappears when zooming in.
Synchronous flows represent request-response patterns where the caller waits for completion. These are the most common patterns in user-facing operations.
Show both request and response paths explicitly:
12345678910111213141516171819202122232425262728293031323334353637383940
┌────────────┐│ Customer │└─────┬──────┘ │ ① Order Request │ (items, address) ▼┌─────────────────┐ ② Inventory check ┌─────────────────┐│ Order API │───────────────────────►│ Inventory Svc ││ │◄───────────────────────│ │└─────────┬───────┘ ③ Available items └────────┬────────┘ │ │ read │ ④ Validated order ▼ │ (priced items) ┌─────────────────┐ ▼ ║ Inventory DB ║┌─────────────────┐ └─────────────────┘│ Payment Svc │└─────────┬───────┘ │ ⑤ Payment request │ (amount, card token) ▼┌─────────────────┐ ⑥ Charge ┌─────────────────┐│ Payment │───────────────►│ Stripe API ││ Processor │◄───────────────│ (external) │└─────────┬───────┘ ⑦ Success └─────────────────┘ │ │ ⑧ Payment confirmed ▼┌─────────────────┐│ Order API │──────┐ ⑨ Create order└─────────────────┘ │ ▼ ┌─────────────────┐ ║ Orders DB ║ └─────────────────┘ │ ⑩ Order confirmation│ (orderId, ETA) ▼ ┌─────────────┐ │ Customer │ └─────────────┘Latency analysis: With numbers on each step, calculate total latency:
Failure points: Each synchronous call is a potential failure. In this flow:
Opportunity for optimization:
Long synchronous chains are fragile. If you're showing more than 4-5 synchronous hops in a user-facing flow, consider whether some steps could be asynchronous. Each hop multiplies the probability of timeout or failure.
Asynchronous flows decouple producers from consumers, allowing independent processing and improved resilience.
Asynchronous flows should clearly show the decoupling:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
SYNCHRONOUS PORTION (customer-facing):═══════════════════════════════════════ ┌────────────┐ Order Request ┌─────────────────┐│ Customer │─────────────────►│ Order API │└────────────┘ └────────┬────────┘ │ Create order (PENDING) ┌─────────────────────────────────┤ │ ▼ │ ┌─────────────┐ │ ║ Orders DB ║ │ └─────────────┘ │ │ Ack (orderId) ▼┌────────────┐│ Customer │ ← Response: "Order received, processing..."└────────────┘ ASYNCHRONOUS PORTION (background):═══════════════════════════════════════ ┌─────────────────┐ ┌─────────────────┐│ Order API │ │ Fulfillment │└────────┬────────┘ │ Worker │ │ └────────▲────────┘ │ OrderCreated event │ consume │ (orderId, items) │ ▼ │┌═══════════════════════════════════════════════════════┐║ ORDER EVENTS TOPIC ║║ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ║║ │msg 1│ │msg 2│ │msg 3│ │msg 4│ ... ║║ └─────┘ └─────┘ └─────┘ └─────┘ ║└═══════════════════════════════════════════════════════┘ │ │ consume ▼ ┌────────────────────────┬────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Inventory Svc │ │ Payment Svc │ │ Notification │ │ (reserve) │ │ (charge) │ │ Service │ └─────────────────┘ └─────────────────┘ └─────────────────┘Fan-out: One event triggers multiple consumers
Saga/Choreography: Sequence of events forming a workflow
CQRS (Command Query Responsibility Segregation):
Event Sourcing:
Don't draw a dashed arrow directly between producer and consumer. Always show the intermediate queue or topic—it's where messages live during transit and it's a critical component for reliability, ordering, and replay.
Real systems combine synchronous and asynchronous patterns. Understanding common hybrid patterns helps you design and diagram effectively.
Use case: Return acknowledgment quickly, process in background
Flow:
Examples: File upload processing, report generation, bulk operations
Use case: Aggregate multiple backend calls for one client request
Flow:
Diagram: Show fan-out from BFF to services, then aggregation
Use case: Prefer events but need sync for consistency
Flow:
Diagram: Show both paths—sync line and async event flow
Use case: Distributed transaction alternative
Flow:
Diagram: Show happy path forward, compensation path backward (often in different color/style)
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
HAPPY PATH (solid arrows):────────────────────────── ┌─────────┐ ┌───────────────┐ ┌───────────────┐ ┌────────────┐ │ Client │────►│ 1. Reserve │────►│ 2. Charge │────►│ 3. Ship │ └─────────┘ │ Inventory │ │ Payment │ └────────────┘ └───────────────┘ └───────────────┘ COMPENSATION PATH (dashed arrows):────────────────────────────────── ┌───────────────┐ ┌───────────────┐ │ 1c. Release │◄────│ 2c. Refund │ │ Inventory │ │ Payment │ └───────────────┘ └───────────────┘ ▲ ▲ │ │ failure at step 2 failure at step 3 COMBINED VIEW:───────────── ┌─────────────┐ │ Client │ └──────┬──────┘ │ Start order ▼ ┌─────────────────┐ │ 1. Reserve │ ←──────────────────┐ │ Inventory │ │ └────────┬────────┘ │ success │ │ ▼ ┌─────────────────────┐ ┌─────────────────┐ │ 1c. Release │ │ 2. Charge │ ←────│ Inventory │ │ Payment │ fail └─────────────────────┘ └────────┬────────┘ success │ ▼ ┌─────────────────────┐ ┌─────────────────┐ │ 2c. Refund Payment │ │ 3. Create │ ←────│ + Release Inventory │ │ Shipment │ fail └─────────────────────┘ └────────┬────────┘ success │ ▼ ┌─────────────────┐ │ Order Complete │ └─────────────────┘A key insight from data flow analysis is understanding where and how data transforms. Each transformation is a potential source of bugs, latency, and complexity.
Validation: Checking data correctness
Enrichment: Adding information
Format conversion: Changing representation
Aggregation: Combining multiple inputs
Filtering: Selecting subsets
Normalization/Denormalization:
| Transformation | Latency Impact | Failure Risk | Consistency Risk |
|---|---|---|---|
| Validation | Low | Medium (rejection) | Low |
| Enrichment (local) | Low | Low | Medium (stale refs) |
| Enrichment (external call) | High | High | High |
| Format conversion | Low | Medium (schema issues) | Low |
| Aggregation (real-time) | Medium-High | Medium | Medium |
| Aggregation (batch) | Low (async) | Low | High (lag) |
Every time your flow shows 'enrich by calling another service,' you're adding latency and a failure point. Consider: Can this data be cached? Can it be published via events instead of fetched? Can it be embedded at write time rather than joined at read time?
Certain data flow patterns indicate design problems. Recognizing these in your diagrams helps catch issues before implementation.
One service consumes data from many sources and becomes a bottleneck:
[A] ─┐
[B] ─┼─► [Central Service] ─► [Output]
[C] ─┤
[D] ─┘
Problems: Single point of failure, scaling bottleneck, unrelated changes affect all flows
Solution: Decompose by bounded context, let consumers pull what they need
Data flows through a service that adds no value—just passes it along:
[A] ─► [Proxy/Router] ─► [B]
Problems: Added latency with no benefit, unnecessary coupling, operational overhead
Solution: Direct communication where appropriate, or ensure the intermediate service adds genuine value (auth, transformation, rate limiting)
Data flows in cycles through services:
[A] ─► [B] ─► [C]
▲ │
└────────────┘
Problems: Infinite loops possible, unclear source of truth, debugging nightmare
Solution: Identify the authoritative source, break cycle with events or clear hierarchy
One request triggers synchronous calls to many downstream services:
┌─► [Svc1]
├─► [Svc2]
[Request] ─┼─► [Svc3] ← waiting for all
├─► [Svc4]
└─► [Svc5]
Problems: Latency = slowest service, any failure fails all, all-or-nothing semantics
Solution: Parallelize where possible, timeout aggressively, consider async for non-critical paths
Let's model the complete data flow for an e-commerce checkout, combining synchronous and asynchronous patterns.
Scenario: Customer clicks 'Place Order' with cart items and payment info.
External entities:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
┌──────────────────────────────────────────────────────────────────────────────┐│ SYNCHRONOUS PHASE ││ (Customer Waiting) ││ ││ ┌──────────┐ Cart + Payment ┌───────────────┐ ││ │ Customer │──────────────────►│ Checkout API │ ││ └──────────┘ └───────┬───────┘ ││ │ ││ ┌───────────────────────────┼───────────────────────┐ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ ││ │ Validate Cart │ │ Validate Address │ │ Apply Promotions │ ││ │ (items, prices) │ │ (delivery zone) │ │ (discounts, tax) │ ││ └─────────┬─────────┘ └─────────┬─────────┘ └─────────┬─────────┘ ││ │ │ │ ││ └────────────────────────┴────────────────────────┘ ││ │ ││ ▼ ││ ┌───────────────────┐ ││ │ Create Order │──────┐ ││ │ (status: PENDING) │ │ persist ││ └─────────┬─────────┘ ▼ ││ │ ┌─────────────┐ ││ │ ║ Orders DB ║ ││ charge │ └─────────────┘ ││ request ▼ ││ ┌───────────────────┐ ││ │ Payment Service │ ││ └─────────┬─────────┘ ││ │ ││ ▼ ││ ┌───────────────────┐ ││ │ Stripe API │ ← external ││ └─────────┬─────────┘ ││ success │ ││ ▼ ││ ┌───────────────────┐ ││ │ Update Order │ ││ │ (status: PAID) │ ││ └─────────┬─────────┘ ││ │ ││ ▼ ││ ┌───────────────────┐ ││ │ Return confirm │──────────►┌──────────┐ ││ │ (orderId, ETA) │ │ Customer │ ││ └─────────┬─────────┘ └──────────┘ ││ │ │└────────────────────────────────────┼─────────────────────────────────────────┘ │ publish event ▼┌──────────────────────────────────────────────────────────────────────────────┐│ ASYNCHRONOUS PHASE ││ (Post-Checkout) ││ ││ ╔═══════════════════════════════════════════════════════════════════════╗ ││ ║ ORDER_PAID TOPIC ║ ││ ║ { orderId, items, customer, address, paymentId } ║ ││ ╚═══════════════════════════════════════════════════╤════════════════════╝ ││ │ ││ ┌──────────────────┬────────────────────┼─────────────────┐ ││ │ │ │ │ ││ ▼ ▼ ▼ ▼ ││ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌────────────┐ ││ │ Inventory Svc │ │ Fulfillment Svc │ │ Email Service │ │ Analytics │ ││ │ (decrement) │ │ (create pick) │ │ (confirmation) │ │ (event) │ ││ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ └────────────┘ ││ │ │ │ ││ ▼ ▼ ▼ ││ ╔══════════════╗ ╔══════════════╗ ┌───────────────┐ ││ ║ Inventory DB ║ ║ Warehouse DB ║ │ Mailgun API │ ││ ╚══════════════╝ ╚══════════════╝ └───────────────┘ ││ │└──────────────────────────────────────────────────────────────────────────────┘Synchronous phase (customer waiting, ~400-800ms):
Asynchronous phase (background, seconds to minutes):
Transformation points:
Data flow diagrams reveal how information moves and transforms through your system. Let's consolidate the key principles:
What's next:
With components identified, architecture diagrammed, and data flows traced, we turn to the interfaces between components: API design. The next page covers how to design APIs that are intuitive, consistent, and evolvable.
You now understand how to model and analyze data flows through distributed systems. This skill enables you to identify bottlenecks, failure points, and consistency risks before implementation—essential for designing reliable systems at scale.