System DesignRequirements Gathering

Requirements Gathering

LevelBeginner

Duration60 mins

TopicRequirements Gathering

2 / 5

Non-Functional Requirements — How It Performs

The Hidden Half of Requirements

Consider two messaging applications. Both let you send text messages. Both support image sharing. Both offer group chats. Functionally, they're identical. Yet one has 500 million users while the other struggles with a few thousand. The difference? Non-functional requirements.

One delivers messages in 100 milliseconds; the other takes 3 seconds. One stays online during traffic spikes; the other crashes whenever a celebrity tweets. One protects your data with end-to-end encryption; the other has suffered three data breaches. One works seamlessly on a 2G connection in rural India; the other demands fiber-optic internet.

Non-functional requirements (NFRs) define the quality attributes that determine whether a system is merely functional or truly excellent. They answer not 'what does the system do?' but 'how well does it do it?' — and this 'how well' often decides whether a product succeeds or fails in the market.

What You Will Learn

By the end of this page, you will understand the entire landscape of non-functional requirements, how to identify and specify them, their critical importance in system design, quantification techniques, trade-offs between competing NFRs, and how to leverage them in system design interviews.

Defining Non-Functional Requirements

Non-functional requirements (NFRs), also called quality attributes, system qualities, or -ilities (because many end in 'ility' — scalability, reliability, usability), describe the properties and constraints that govern HOW the system performs its functions.

While functional requirements define the features and behaviors (the verbs of your system), non-functional requirements define the adjectives: how fast, how secure, how available, how scalable, how usable, how maintainable.

Formal Definition:

A non-functional requirement is a specification of a quality criterion that the system must satisfy, typically expressed as a measurable attribute with an acceptable range or threshold. NFRs constrain the design space by eliminating solutions that would otherwise satisfy functional requirements but fail to meet quality standards.

Functional Requirement Examples

•Users can log in with email and password
•The system processes payment transactions
•Users can search for products by name
•The system sends email notifications
•Users can upload profile pictures

Non-Functional Requirement Examples

•Login completes within 500ms for 99% of requests
•Payment processing has 99.99% availability
•Search returns results within 200ms for 95th percentile
•Notifications sent within 30 seconds of trigger
•Image uploads complete within 5 seconds for 10MB files

Notice the critical difference: functional requirements state what happens; non-functional requirements quantify how well it happens. This quantification is essential—vague NFRs like 'the system should be fast' are nearly useless. 'Response time < 200ms at p99 under 10,000 concurrent users' is actionable.

The Iceberg Metaphor:

Functional requirements are the visible tip of the system iceberg—the features users consciously interact with. Non-functional requirements are the massive underwater portion—invisible when working correctly, catastrophically apparent when they fail. Users don't praise latency of 50ms; they just expect it. But a latency spike to 5 seconds triggers complaints, churn, and viral Twitter threads.

The Invisibility Problem

NFRs are often overlooked because they're invisible in requirements discussions that focus on features. Product managers ask 'What should users be able to do?' but rarely 'How fast should each action be?' or 'What happens when a datacenter fails?' This invisibility makes NFRs the most common source of post-launch crises.

The Complete Non-Functional Requirements Taxonomy

NFRs span a wide landscape of quality attributes. Understanding this taxonomy ensures you don't overlook critical requirements. Let's explore each category in depth:

Performance measures how quickly and efficiently the system responds to user actions and processes workloads.

Key Metrics:

Latency/Response Time — Time between request and response. Measured at various percentiles (p50, p90, p95, p99).
Throughput — Number of operations per unit time (requests/second, transactions/minute).
Resource Utilization — CPU, memory, network, disk usage under load.

Examples of Well-Specified Performance Requirements:

API endpoints respond in < 100ms at p50, < 500ms at p99 under normal load
System handles 10,000 concurrent users with < 5% degradation
Batch processing completes 1 million records in < 1 hour
Page load time < 3 seconds on 3G mobile connections

Why It Matters:

Studies show 100ms latency increase reduces conversion by 1%
Amazon found every 100ms of latency cost 1% in sales
Google found 500ms delay dropped traffic by 20%

Additional NFR Categories:

Extended Non-Functional Requirements
Category	Description	Example Requirement
Usability	Ease of use, learnability, accessibility	New users complete core task within 5 minutes without training
Portability	Ability to run on different platforms/environments	Application runs on Chrome, Firefox, Safari, Edge without modification
Interoperability	Ability to exchange data with other systems	REST APIs follow OpenAPI 3.0 specification
Compliance	Adherence to laws, regulations, standards	System compliant with HIPAA, GDPR, SOC 2 Type II
Observability	Ability to understand system state from outputs	All services emit structured logs, metrics, and distributed traces
Cost Efficiency	Resource utilization relative to cost	Infrastructure cost < $0.01 per 1000 transactions
Extensibility	Ability to add new features with minimal changes	Plugin architecture for custom integrations
Recoverability	Ability to restore after failure	Full system restoration from backup within 4 hours

Quantifying Non-Functional Requirements

The cardinal sin of NFR specification is vagueness. 'The system should be fast' is not a requirement—it's a wish. Quantification transforms wishes into verifiable specifications.

The SMART Framework for NFRs:

Apply SMART criteria to every non-functional requirement:

Specific — Clearly defined, no ambiguity
Measurable — Quantified with metrics and thresholds
Achievable — Technically feasible given constraints
Relevant — Aligned with business objectives
Time-bound — Applies to a specific context or condition

Transforming Vague NFRs into SMART RequirementsFrom wishes to specifications

Input

Output

Percentiles vs Averages:

A critical distinction in performance requirements is the difference between averages and percentiles:

Average (mean) latency — Hides outliers. If 99% of requests take 50ms but 1% take 10 seconds, average might look acceptable at 150ms.
Percentile latency (p50, p90, p95, p99) — Shows the experience of the Nth percentile user.
- p50 (median): Half of users experience this or better
- p99: 99% of users experience this or better; 1% have it worse

Why p99 matters:

Power users often generate disproportionate traffic. If your p99 is bad, your most valuable users suffer most. Additionally, in microservices architectures, p99 latencies compound. If a request touches 10 services each with 100ms p99, the overall p99 can exceed 1 second.

SLI, SLO, SLA — The Reliability Vocabulary

SLI (Service Level Indicator): A metric that measures some aspect of service (e.g., latency, error rate). SLO (Service Level Objective): A target value for an SLI (e.g., latency < 200ms for 99% of requests). SLA (Service Level Agreement): A contract with consequences if SLOs are violated (e.g., refunds, credits). NFRs typically become SLOs, which may be formalized into SLAs.

slo-specification.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Service Level Objectives: Payment Service
 
## SLO 1: Availability
- SLI: Percentage of successful HTTP responses (non-5xx)
- Target: 99.95% over 30-day rolling window
- Measurement: Synthetic monitoring every 60 seconds from 5 global locations
 
## SLO 2: Latency
- SLI: Time from request received to response sent
- Targets:
  - p50 < 100ms
  - p95 < 250ms
  - p99 < 500ms
- Measurement: Application metrics, aggregated per minute
 
## SLO 3: Error Rate
- SLI: Percentage of requests resulting in user-facing errors
- Target: < 0.1% over 24-hour rolling window
- Measurement: Application error logs, excluding client errors (4xx)
 
## SLO 4: Throughput
- SLI: Successful transactions per second
- Target: Sustain 10,000 TPS during peak hours (9 AM - 6 PM local time)
- Measurement: Transaction counter with 1-minute resolution
 
## Error Budget Policy
- When error budget < 25% remaining, initiate reliability sprint
- When error budget exhausted, freeze feature deployments until recovered

NFR Trade-offs — The Engineering Art

Here's the uncomfortable truth: you cannot optimize for all NFRs simultaneously. Improving one quality attribute often degrades another. System design is fundamentally about navigating these trade-offs intelligently.

Common NFR Trade-off Pairs:

NFR Trade-off Matrix
Optimizing For	Often Conflicts With	Example
Performance	Cost Efficiency	Faster hardware and more replicas cost more
Consistency	Availability	CAP theorem — partitions force a choice
Security	Usability	Multi-factor auth, frequent re-authentication increases friction
Scalability	Consistency	Distributed writes are harder to keep consistent
Maintainability	Performance	Clean abstractions may add overhead
Availability	Cost	Redundancy for high availability is expensive
Latency	Throughput	Batching improves throughput but increases latency
Flexibility	Simplicity	Highly configurable systems are harder to understand

The CAP Theorem — The Canonical NFR Trade-off:

In distributed systems, you must choose between:

Consistency (C) — Every read receives the most recent write
Availability (A) — Every request receives a response
Partition Tolerance (P) — System continues operating despite network partitions

Since network partitions are inevitable in distributed systems, you effectively choose between CP (consistent but may be unavailable during partition) or AP (available but may serve stale data during partition).

Example Trade-off Decisions:

Banking System — Chooses CP. Account balances must be consistent; it's better to reject a transaction than allow an overdraft due to stale data.
Social Media Feed — Chooses AP. Showing a slightly stale feed is acceptable; being unavailable is not.
E-commerce Inventory — Often a hybrid. Shopping cart can be AP (eventual consistency acceptable), but final checkout must be CP (can't sell item twice).

PACELC — Beyond CAP

PACELC extends CAP: if Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency. This acknowledges that even without partitions, there's a latency-consistency trade-off in distributed systems.

Making Trade-off Decisions:

When NFRs conflict, use this framework:

Understand business criticality — Which quality matters most for the core business function?
Identify user impact — How does each option affect user experience?
Evaluate cost — What are the implementation and operational costs of each choice?
Consider reversibility — Can you change this decision later, or is it locked in?
Document explicitly — Record the trade-off decision and rationale for future reference.

The worst outcome is an implicit, undocumented trade-off that the team doesn't realize they've made until production incidents reveal it.

Deriving NFRs from Business Context

NFRs don't emerge from thin air—they derive from the business context, user expectations, regulatory requirements, and operational constraints. Let's explore how to systematically derive them:

Sources of Non-Functional Requirements

•Business Objectives — Revenue targets, user growth goals, and market positioning directly imply NFRs. 'We need to support 10 million users by year-end' implies specific scalability requirements.
•User Expectations — Industry standards set baselines. E-commerce users expect sub-second page loads because Amazon trained them. Video users expect buffering-free playback because Netflix trained them.
•Competitive Landscape — If competitors deliver features faster, your latency NFRs must match or beat them.
•Regulatory Compliance — GDPR, HIPAA, PCI-DSS, SOC 2 mandate specific security, privacy, and auditability requirements.
•Operational Constraints — Budget limits, team size, existing infrastructure, and operational maturity constrain what's achievable.
•Service Level Agreements — Contractual commitments to customers define minimum thresholds.
•Risk Tolerance — Fintech systems have near-zero tolerance for data loss; social media can accept occasional inconsistency.

Traffic and Scale Estimation:

Many NFRs require estimating traffic and data volumes. Use back-of-envelope calculations:

Example: Estimating for a Social Media Platform

DAU (Daily Active Users): 100 million
Actions per user per day: 20 (posts, likes, comments, views)
Total daily actions: 2 billion
Peak to average ratio: 3x (evening hours)
Peak QPS: 2B / 86,400 seconds × 3 ≈ 70,000 QPS

From this:

Performance NFR: Handle 70,000 QPS with < 200ms latency
Scalability NFR: Auto-scale from 20K to 100K QPS within 5 minutes
Availability NFR: 99.9% (can't afford losing 2 billion actions)

Data Storage Estimation:

Posts per day: 10 million
Average post size: 1 KB text + 500 KB media (averaged)
Daily storage growth: ~5 TB
Annual storage: ~2 PB

From this:

Scalability NFR: Storage system scales to petabyte range
Cost NFR: Storage cost < $X per TB (given budget constraints)

The 10x Planning Rule

Design for 10x your current scale, plan for 100x. Your architecture shouldn't require fundamental changes to handle 10x traffic. You should have a documented path (even if it requires significant work) to handle 100x.

NFRs in System Design Interviews

In system design interviews, NFRs are where candidates differentiate themselves. Asking about and reasoning through NFRs signals senior-level thinking.

The NFR Interview Framework:

After understanding functional requirements, systematically probe NFRs:

interview-nfr-questions.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# NFR Clarification Questions for System Design Interviews
 
## Scale
- "How many users do we expect? DAU/MAU?"
- "What's the read vs write ratio?"
- "Are there predictable traffic patterns (time zones, events)?"
- "What's the expected growth rate?"
 
## Performance
- "What latency is acceptable for critical paths?"
- "Are there batch processing requirements with specific SLAs?"
- "Any real-time requirements (chat, gaming, financial)?"
 
## Availability
- "What's the availability target? (99.9%? 99.99%?)"
- "Is downtime during maintenance acceptable?"
- "Are certain features more critical than others?"
 
## Consistency
- "Does the system need strong consistency or is eventual okay?"
- "For which operations is consistency critical?"
- "What's an acceptable staleness window?"
 
## Security
- "What data is sensitive? PII? Financial? Health?"
- "Any compliance requirements? (GDPR, HIPAA, PCI)"
- "Multi-tenancy isolation requirements?"
 
## Geography
- "Is this single-region or multi-region?"
- "Where are users located?"
- "Are there data residency requirements?"

Demonstrating NFR Mastery in Interviews:

•State your assumptions — 'I'll assume we need 99.9% availability for the core path, but background jobs can have lower SLAs.'
•Connect NFRs to architecture — 'Given the 99.99% availability requirement, we'll need active-active deployment across at least two regions with health-check-based failover.'
•Acknowledge trade-offs — 'Choosing eventual consistency here gives us better availability and lower latency, but we need to handle the case where a user's post isn't immediately visible to all followers.'
•Quantify when possible — 'With 100M DAU and 20 actions per user, we're looking at roughly 25,000 requests per second average, 75,000 at peak.'
•Consider edge cases — 'The celebrity problem: when a single user has 10 million followers and posts, we need a different fanout strategy than for regular users.'

Senior Signal

When you proactively discuss NFR trade-offs without being prompted—explaining why you're choosing eventual consistency for the feed but strong consistency for the purchase flow—you demonstrate principal engineer-level thinking.

Documenting Non-Functional Requirements

NFR documentation requires particular care because these requirements drive critical architectural decisions and serve as acceptance criteria for system validation.

nfr-documentation-template.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Non-Functional Requirements: [System Name]
 
## 1. Performance Requirements
 
### PRF-001: API Response Time
- **Requirement**: Core API endpoints respond within latency thresholds
- **Thresholds**:
  - p50 (median): < 50ms
  - p95: < 200ms
  - p99: < 500ms
- **Measurement**: Application Performance Monitoring (APM) tool, sampled at 100%
- **Scope**: All public API endpoints except bulk operations
- **Load Condition**: Under normal load (0-80% of capacity)
- **Priority**: Must Have
 
### PRF-002: Throughput
- **Requirement**: System sustains target transaction rate
- **Threshold**: 10,000 requests per second
- **Measurement**: Load testing with production-representative traffic mix
- **Scope**: Aggregate across all API endpoints
- **Priority**: Must Have
 
## 2. Availability Requirements
 
### AVL-001: Service Availability
- **Requirement**: Core services maintain availability SLO
- **Threshold**: 99.95% measured monthly
- **Measurement**: Synthetic monitoring from 5 geographic locations, 60-second intervals
- **Exclusions**: Scheduled maintenance (max 4 hours/month, announced 72 hours ahead)
- **Priority**: Must Have
 
### AVL-002: Disaster Recovery
- **Requirement**: Full recovery from regional outage
- **RTO (Recovery Time Objective)**: 4 hours
- **RPO (Recovery Point Objective)**: 15 minutes (max data loss)
- **Measurement**: Annual DR drill
- **Priority**: Must Have
 
## 3. Scalability Requirements
 
### SCL-001: Horizontal Scaling
- **Requirement**: System scales horizontally to meet demand
- **Threshold**: Scale from baseline to 10x capacity within 30 minutes
- **Measurement**: Load testing with automated scaling triggers
- **Priority**: Should Have
 
### SCL-002: Data Growth
- **Requirement**: System handles projected data growth
- **Threshold**: 5 years of data (~10PB) without architecture change
- **Measurement**: Capacity planning reviews quarterly
- **Priority**: Must Have
 
## 4. Security Requirements
 
### SEC-001: Encryption at Rest
- **Requirement**: All sensitive data encrypted at rest
- **Standard**: AES-256
- **Scope**: All databases, file storage, backups
- **Audit**: Annual third-party security audit
- **Priority**: Must Have
 
### SEC-002: Encryption in Transit
- **Requirement**: All network traffic encrypted
- **Standard**: TLS 1.3 (minimum TLS 1.2)
- **Scope**: All external traffic, internal service-to-service traffic
- **Priority**: Must Have
 
## Appendix: NFR Trade-off Decisions
 
| Decision | Option A | Option B | Chosen | Rationale |
|----------|----------|----------|--------|-----------|
| Feed Consistency | Strong (CP) | Eventual (AP) | Eventual | Prioritize availability; stale feeds acceptable for seconds |
| Payment Consistency | Strong (CP) | Eventual (AP) | Strong | Financial correctness mandatory |

Living Documentation

NFR documentation must evolve with the system. As you learn more about actual system behavior, refine thresholds. As business needs change, update requirements. Outdated NFRs are worse than none—they create false confidence.

Summary: Mastering Non-Functional Requirements

We've explored the critical world of non-functional requirements—the quality attributes that determine whether your system is merely functional or truly excellent. Let's consolidate:

Key Takeaways

•NFRs define HOW, not WHAT — They specify quality attributes: performance, scalability, availability, security, maintainability.
•Quantification is mandatory — Vague NFRs are useless. Use SMART criteria: specific, measurable, achievable, relevant, time-bound.
•Master the taxonomy — Performance, scalability, availability, reliability, security, maintainability, usability, and more.
•Understand the nines — 99.9% vs 99.99% availability has massive operational implications.
•Trade-offs are inevitable — You cannot maximize all NFRs. Make informed, documented choices.
•Derive from context — Business objectives, user expectations, regulations, and constraints drive NFR values.
•Interview differentiator — Proactively discussing NFRs and trade-offs signals senior-level thinking.
•Document rigorously — NFRs drive architecture; treat them with the same rigor as functional requirements.

What's Next:

We now understand what the system does (functional requirements) and how well it does it (non-functional requirements). But how do we extract these requirements effectively? The next page explores the art of asking the right questions—the systematic inquiry process that separates thorough requirements gathering from superficial fact-finding.

Page Complete

You now possess a comprehensive understanding of non-functional requirements. You can identify them, quantify them, navigate their trade-offs, and communicate them effectively. This knowledge is essential for every architectural decision you'll make.

2 / 5

Loading learning content...

System DesignRequirements Gathering

Requirements Gathering

LevelBeginner

Duration60 mins

TopicRequirements Gathering

2 / 5

Non-Functional Requirements — How It Performs

The Hidden Half of Requirements

What You Will Learn

Defining Non-Functional Requirements

Formal Definition:

Functional Requirement Examples

•Users can log in with email and password
•The system processes payment transactions
•Users can search for products by name
•The system sends email notifications
•Users can upload profile pictures

Non-Functional Requirement Examples

•Login completes within 500ms for 99% of requests
•Payment processing has 99.99% availability
•Search returns results within 200ms for 95th percentile
•Notifications sent within 30 seconds of trigger
•Image uploads complete within 5 seconds for 10MB files

The Iceberg Metaphor:

The Invisibility Problem

The Complete Non-Functional Requirements Taxonomy

NFRs span a wide landscape of quality attributes. Understanding this taxonomy ensures you don't overlook critical requirements. Let's explore each category in depth:

Performance measures how quickly and efficiently the system responds to user actions and processes workloads.

Key Metrics:

Latency/Response Time — Time between request and response. Measured at various percentiles (p50, p90, p95, p99).
Throughput — Number of operations per unit time (requests/second, transactions/minute).
Resource Utilization — CPU, memory, network, disk usage under load.

Examples of Well-Specified Performance Requirements:

API endpoints respond in < 100ms at p50, < 500ms at p99 under normal load
System handles 10,000 concurrent users with < 5% degradation
Batch processing completes 1 million records in < 1 hour
Page load time < 3 seconds on 3G mobile connections

Why It Matters:

Studies show 100ms latency increase reduces conversion by 1%
Amazon found every 100ms of latency cost 1% in sales
Google found 500ms delay dropped traffic by 20%

Additional NFR Categories:

Extended Non-Functional Requirements
Category	Description	Example Requirement
Usability	Ease of use, learnability, accessibility	New users complete core task within 5 minutes without training
Portability	Ability to run on different platforms/environments	Application runs on Chrome, Firefox, Safari, Edge without modification
Interoperability	Ability to exchange data with other systems	REST APIs follow OpenAPI 3.0 specification
Compliance	Adherence to laws, regulations, standards	System compliant with HIPAA, GDPR, SOC 2 Type II
Observability	Ability to understand system state from outputs	All services emit structured logs, metrics, and distributed traces
Cost Efficiency	Resource utilization relative to cost	Infrastructure cost < $0.01 per 1000 transactions
Extensibility	Ability to add new features with minimal changes	Plugin architecture for custom integrations
Recoverability	Ability to restore after failure	Full system restoration from backup within 4 hours

Quantifying Non-Functional Requirements

The cardinal sin of NFR specification is vagueness. 'The system should be fast' is not a requirement—it's a wish. Quantification transforms wishes into verifiable specifications.

The SMART Framework for NFRs:

Apply SMART criteria to every non-functional requirement:

Specific — Clearly defined, no ambiguity
Measurable — Quantified with metrics and thresholds
Achievable — Technically feasible given constraints
Relevant — Aligned with business objectives
Time-bound — Applies to a specific context or condition

Transforming Vague NFRs into SMART RequirementsFrom wishes to specifications

Input

Output

Percentiles vs Averages:

A critical distinction in performance requirements is the difference between averages and percentiles:

Average (mean) latency — Hides outliers. If 99% of requests take 50ms but 1% take 10 seconds, average might look acceptable at 150ms.
Percentile latency (p50, p90, p95, p99) — Shows the experience of the Nth percentile user.
- p50 (median): Half of users experience this or better
- p99: 99% of users experience this or better; 1% have it worse

Why p99 matters:

SLI, SLO, SLA — The Reliability Vocabulary

slo-specification.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Service Level Objectives: Payment Service
 
## SLO 1: Availability
- SLI: Percentage of successful HTTP responses (non-5xx)
- Target: 99.95% over 30-day rolling window
- Measurement: Synthetic monitoring every 60 seconds from 5 global locations
 
## SLO 2: Latency
- SLI: Time from request received to response sent
- Targets:
  - p50 < 100ms
  - p95 < 250ms
  - p99 < 500ms
- Measurement: Application metrics, aggregated per minute
 
## SLO 3: Error Rate
- SLI: Percentage of requests resulting in user-facing errors
- Target: < 0.1% over 24-hour rolling window
- Measurement: Application error logs, excluding client errors (4xx)
 
## SLO 4: Throughput
- SLI: Successful transactions per second
- Target: Sustain 10,000 TPS during peak hours (9 AM - 6 PM local time)
- Measurement: Transaction counter with 1-minute resolution
 
## Error Budget Policy
- When error budget < 25% remaining, initiate reliability sprint
- When error budget exhausted, freeze feature deployments until recovered

NFR Trade-offs — The Engineering Art

Common NFR Trade-off Pairs:

NFR Trade-off Matrix
Optimizing For	Often Conflicts With	Example
Performance	Cost Efficiency	Faster hardware and more replicas cost more
Consistency	Availability	CAP theorem — partitions force a choice
Security	Usability	Multi-factor auth, frequent re-authentication increases friction
Scalability	Consistency	Distributed writes are harder to keep consistent
Maintainability	Performance	Clean abstractions may add overhead
Availability	Cost	Redundancy for high availability is expensive
Latency	Throughput	Batching improves throughput but increases latency
Flexibility	Simplicity	Highly configurable systems are harder to understand

The CAP Theorem — The Canonical NFR Trade-off:

In distributed systems, you must choose between:

Consistency (C) — Every read receives the most recent write
Availability (A) — Every request receives a response
Partition Tolerance (P) — System continues operating despite network partitions

Example Trade-off Decisions:

Banking System — Chooses CP. Account balances must be consistent; it's better to reject a transaction than allow an overdraft due to stale data.
Social Media Feed — Chooses AP. Showing a slightly stale feed is acceptable; being unavailable is not.
E-commerce Inventory — Often a hybrid. Shopping cart can be AP (eventual consistency acceptable), but final checkout must be CP (can't sell item twice).

PACELC — Beyond CAP

Making Trade-off Decisions:

When NFRs conflict, use this framework:

Understand business criticality — Which quality matters most for the core business function?
Identify user impact — How does each option affect user experience?
Evaluate cost — What are the implementation and operational costs of each choice?
Consider reversibility — Can you change this decision later, or is it locked in?
Document explicitly — Record the trade-off decision and rationale for future reference.

The worst outcome is an implicit, undocumented trade-off that the team doesn't realize they've made until production incidents reveal it.

Deriving NFRs from Business Context

NFRs don't emerge from thin air—they derive from the business context, user expectations, regulatory requirements, and operational constraints. Let's explore how to systematically derive them:

Sources of Non-Functional Requirements

•Business Objectives — Revenue targets, user growth goals, and market positioning directly imply NFRs. 'We need to support 10 million users by year-end' implies specific scalability requirements.
•User Expectations — Industry standards set baselines. E-commerce users expect sub-second page loads because Amazon trained them. Video users expect buffering-free playback because Netflix trained them.
•Competitive Landscape — If competitors deliver features faster, your latency NFRs must match or beat them.
•Regulatory Compliance — GDPR, HIPAA, PCI-DSS, SOC 2 mandate specific security, privacy, and auditability requirements.
•Operational Constraints — Budget limits, team size, existing infrastructure, and operational maturity constrain what's achievable.
•Service Level Agreements — Contractual commitments to customers define minimum thresholds.
•Risk Tolerance — Fintech systems have near-zero tolerance for data loss; social media can accept occasional inconsistency.

Traffic and Scale Estimation:

Many NFRs require estimating traffic and data volumes. Use back-of-envelope calculations:

Example: Estimating for a Social Media Platform

DAU (Daily Active Users): 100 million
Actions per user per day: 20 (posts, likes, comments, views)
Total daily actions: 2 billion
Peak to average ratio: 3x (evening hours)
Peak QPS: 2B / 86,400 seconds × 3 ≈ 70,000 QPS

From this:

Performance NFR: Handle 70,000 QPS with < 200ms latency
Scalability NFR: Auto-scale from 20K to 100K QPS within 5 minutes
Availability NFR: 99.9% (can't afford losing 2 billion actions)

Data Storage Estimation:

Posts per day: 10 million
Average post size: 1 KB text + 500 KB media (averaged)
Daily storage growth: ~5 TB
Annual storage: ~2 PB

From this:

Scalability NFR: Storage system scales to petabyte range
Cost NFR: Storage cost < $X per TB (given budget constraints)

The 10x Planning Rule

NFRs in System Design Interviews

In system design interviews, NFRs are where candidates differentiate themselves. Asking about and reasoning through NFRs signals senior-level thinking.

The NFR Interview Framework:

After understanding functional requirements, systematically probe NFRs:

interview-nfr-questions.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# NFR Clarification Questions for System Design Interviews
 
## Scale
- "How many users do we expect? DAU/MAU?"
- "What's the read vs write ratio?"
- "Are there predictable traffic patterns (time zones, events)?"
- "What's the expected growth rate?"
 
## Performance
- "What latency is acceptable for critical paths?"
- "Are there batch processing requirements with specific SLAs?"
- "Any real-time requirements (chat, gaming, financial)?"
 
## Availability
- "What's the availability target? (99.9%? 99.99%?)"
- "Is downtime during maintenance acceptable?"
- "Are certain features more critical than others?"
 
## Consistency
- "Does the system need strong consistency or is eventual okay?"
- "For which operations is consistency critical?"
- "What's an acceptable staleness window?"
 
## Security
- "What data is sensitive? PII? Financial? Health?"
- "Any compliance requirements? (GDPR, HIPAA, PCI)"
- "Multi-tenancy isolation requirements?"
 
## Geography
- "Is this single-region or multi-region?"
- "Where are users located?"
- "Are there data residency requirements?"

Demonstrating NFR Mastery in Interviews:

•State your assumptions — 'I'll assume we need 99.9% availability for the core path, but background jobs can have lower SLAs.'
•Connect NFRs to architecture — 'Given the 99.99% availability requirement, we'll need active-active deployment across at least two regions with health-check-based failover.'
•Acknowledge trade-offs — 'Choosing eventual consistency here gives us better availability and lower latency, but we need to handle the case where a user's post isn't immediately visible to all followers.'
•Quantify when possible — 'With 100M DAU and 20 actions per user, we're looking at roughly 25,000 requests per second average, 75,000 at peak.'
•Consider edge cases — 'The celebrity problem: when a single user has 10 million followers and posts, we need a different fanout strategy than for regular users.'

Senior Signal

Documenting Non-Functional Requirements

NFR documentation requires particular care because these requirements drive critical architectural decisions and serve as acceptance criteria for system validation.

nfr-documentation-template.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Non-Functional Requirements: [System Name]
 
## 1. Performance Requirements
 
### PRF-001: API Response Time
- **Requirement**: Core API endpoints respond within latency thresholds
- **Thresholds**:
  - p50 (median): < 50ms
  - p95: < 200ms
  - p99: < 500ms
- **Measurement**: Application Performance Monitoring (APM) tool, sampled at 100%
- **Scope**: All public API endpoints except bulk operations
- **Load Condition**: Under normal load (0-80% of capacity)
- **Priority**: Must Have
 
### PRF-002: Throughput
- **Requirement**: System sustains target transaction rate
- **Threshold**: 10,000 requests per second
- **Measurement**: Load testing with production-representative traffic mix
- **Scope**: Aggregate across all API endpoints
- **Priority**: Must Have
 
## 2. Availability Requirements
 
### AVL-001: Service Availability
- **Requirement**: Core services maintain availability SLO
- **Threshold**: 99.95% measured monthly
- **Measurement**: Synthetic monitoring from 5 geographic locations, 60-second intervals
- **Exclusions**: Scheduled maintenance (max 4 hours/month, announced 72 hours ahead)
- **Priority**: Must Have
 
### AVL-002: Disaster Recovery
- **Requirement**: Full recovery from regional outage
- **RTO (Recovery Time Objective)**: 4 hours
- **RPO (Recovery Point Objective)**: 15 minutes (max data loss)
- **Measurement**: Annual DR drill
- **Priority**: Must Have
 
## 3. Scalability Requirements
 
### SCL-001: Horizontal Scaling
- **Requirement**: System scales horizontally to meet demand
- **Threshold**: Scale from baseline to 10x capacity within 30 minutes
- **Measurement**: Load testing with automated scaling triggers
- **Priority**: Should Have
 
### SCL-002: Data Growth
- **Requirement**: System handles projected data growth
- **Threshold**: 5 years of data (~10PB) without architecture change
- **Measurement**: Capacity planning reviews quarterly
- **Priority**: Must Have
 
## 4. Security Requirements
 
### SEC-001: Encryption at Rest
- **Requirement**: All sensitive data encrypted at rest
- **Standard**: AES-256
- **Scope**: All databases, file storage, backups
- **Audit**: Annual third-party security audit
- **Priority**: Must Have
 
### SEC-002: Encryption in Transit
- **Requirement**: All network traffic encrypted
- **Standard**: TLS 1.3 (minimum TLS 1.2)
- **Scope**: All external traffic, internal service-to-service traffic
- **Priority**: Must Have
 
## Appendix: NFR Trade-off Decisions
 
| Decision | Option A | Option B | Chosen | Rationale |
|----------|----------|----------|--------|-----------|
| Feed Consistency | Strong (CP) | Eventual (AP) | Eventual | Prioritize availability; stale feeds acceptable for seconds |
| Payment Consistency | Strong (CP) | Eventual (AP) | Strong | Financial correctness mandatory |

Living Documentation

Summary: Mastering Non-Functional Requirements

We've explored the critical world of non-functional requirements—the quality attributes that determine whether your system is merely functional or truly excellent. Let's consolidate:

Key Takeaways

•NFRs define HOW, not WHAT — They specify quality attributes: performance, scalability, availability, security, maintainability.
•Quantification is mandatory — Vague NFRs are useless. Use SMART criteria: specific, measurable, achievable, relevant, time-bound.
•Master the taxonomy — Performance, scalability, availability, reliability, security, maintainability, usability, and more.
•Understand the nines — 99.9% vs 99.99% availability has massive operational implications.
•Trade-offs are inevitable — You cannot maximize all NFRs. Make informed, documented choices.
•Derive from context — Business objectives, user expectations, regulations, and constraints drive NFR values.
•Interview differentiator — Proactively discussing NFRs and trade-offs signals senior-level thinking.
•Document rigorously — NFRs drive architecture; treat them with the same rigor as functional requirements.

What's Next:

Page Complete

2 / 5