Working Set And Pff - Learning Module

Loading content...

0/227

Working Set Window

The Critical Parameter

The working set model's elegance lies in its simplicity: track the pages used in the last Δ references, and allocate frames accordingly. But this simplicity conceals a profound challenge—the selection and tracking of Δ itself.

The window size Δ isn't just a tuning knob; it's the linchpin that determines whether the working set model succeeds or fails in practice. Too small, and processes suffer excessive page faults as needed pages are prematurely dropped. Too large, and memory is wasted on stale pages while other processes starve. Worse still, tracking exact working sets requires knowing the precise order of all memory references—information that's expensive to collect at the hardware level.

This page explores the working set window in comprehensive depth: its theoretical properties, the tradeoffs in its selection, practical tracking mechanisms, and the implementation challenges that motivated alternative approaches.

What You Will Learn

By the end of this page, you will understand: (1) How Δ affects working set size and system behavior, (2) The fundamental tradeoff spectrum in Δ selection, (3) Timer-based approximations to the reference-counting window, (4) Hardware and software techniques for tracking working sets, and (5) Why exact working set tracking is impractical and what alternatives exist.

The Window Parameter in Depth

In the formal working set model, the window Δ is measured in memory references—each load or store operation advances the virtual time counter by one. This reference-based measurement has important properties:

Virtual Time vs. Wall-Clock Time:

Aspect	Reference-Based Δ	Wall-Clock-Based Δ
Unit	Memory references	Seconds or milliseconds
Advances when	Process accesses memory	Time passes (regardless of activity)
Blocked process	Virtual time stops	Time continues passing
High CPU burst	Many references counted	Same calendar time
I/O-bound process	Few references	Same calendar time

Why Reference-Based is Theoretically Correct:

Imagine two processes:

Process A: Compute-intensive, makes 1 million references per second
Process B: I/O-bound, makes 1,000 references per second

With wall-clock Δ = 1 second:

Process A's 'recent history' spans 1,000,000 references
Process B's 'recent history' spans only 1,000 references

This disparity means the same Δ represents vastly different amounts of actual program behavior. Reference-based counting ensures both processes have working sets computed over the same amount of behavioral history, not elapsed time.

The Practical Problem

While reference-based Δ is theoretically superior, it requires counting every memory reference—an operation that occurs billions of times per second on modern processors. No practical system can afford this overhead, which is why real implementations use timer-based approximations.

The Relationship Between Δ and Working Set Size:

As Δ increases, the working set size |W(t, Δ)| can only grow (or stay the same). This relationship is governed by the working set size function WSS(Δ), which for a given process at time t is:

WSS(Δ) = |W(t, Δ)|

This function has characteristic properties:

WSS(1) = 1 — With a window of one reference, only the current page is in the working set
WSS(∞) = N — With an infinite window, all N pages ever touched are in the working set
Monotonically non-decreasing — WSS(Δ₁) ≤ WSS(Δ₂) when Δ₁ < Δ₂
Typically concave — Grows quickly for small Δ (capturing core locality) then slows (additional Δ adds few new pages)

The shape of this curve varies by program but typically shows:

Steep initial rise: The first few thousand references access core data structures
Gradual plateau: Additional references mostly revisit already-counted pages
Step increases: Phase transitions introduce new pages suddenly

The Δ Selection Tradeoff

Selecting Δ involves navigating a fundamental tradeoff spectrum. Let's explore the extremes and the balanced middle ground.

Case 1: Δ Too Small

Symptom: Working sets are smaller than actual memory needs

Mechanism:

A needed page was last referenced Δ+1 references ago
Page falls outside working set, considered 'unneeded'
If frame is reclaimed, page fault occurs on next reference
Page brought back, only to be evicted again if still near edge of window

Consequences:

Thrashing at process level: Even though system-wide allocation seems correct, individual processes page fault excessively
Wasted I/O: Constant swapping of pages that are actually needed
Poor cache behavior: Repeated page faults disrupt CPU caches and TLBs

Small Δ Pathology: The Edge Effect

When Δ is too small, pages 'fall off the edge' of the window while still being part of the active locality. This creates oscillation: page enters working set → gets allocated → ages out → gets evicted → is needed → page fault → enters working set again. This cycle consumes enormous I/O without accomplishing useful work.

Case 2: Δ Too Large

Symptom: Working sets include many pages no longer needed

Mechanism:

Process transitions from Phase A to Phase B
Phase A pages were referenced within last Δ references (barely)
These pages remain in working set despite being irrelevant to Phase B
Frames are allocated to stale pages instead of other processes' active pages

Consequences:

Memory bloat: Each process appears to need more frames than truly required
Reduced multiprogramming: Fewer processes can run simultaneously
Resource starvation: Active processes may not get needed frames because stale allocations aren't released
Slow adaptation: When workload changes, system takes too long to rebalance

Large Δ Pathology: The Ghost Page Effect

With overly large Δ, working sets become 'haunted' by ghost pages—pages that were important in previous phases but are now irrelevant. These ghosts occupy frames and allocation budgets, preventing the system from responding to current needs.

Case 3: Δ Optimal

Characteristic: Δ spans exactly one 'locality region'—the set of pages needed for the current program phase

Properties:

Pages enter working set when the phase begins
Pages exit when the phase ends (and new phase begins)
Working set size matches actual frame requirement
Minimal page faults (only on genuine phase transitions)
Efficient memory utilization across all processes

The Challenge: The optimal Δ varies:

Between different programs (database vs. compiler)
Between different phases of the same program (initialization vs. steady-state)
Under different workload conditions (interactive vs. batch)

No single Δ is optimal for all situations, which is why some systems use adaptive Δ or abandon explicit windows entirely (using page fault frequency instead).

Δ Selection Impact Summary
Aspect	Δ Too Small	Δ Optimal	Δ Too Large
Working Set Size	Underestimates needs	Matches needs	Overestimates needs
Page Fault Rate	High (edge effect)	Low (phase transitions only)	Low but wasteful
Memory Efficiency	Poor (constant swapping)	Excellent	Poor (wasted frames)
Multiprogramming	High but thrashing	Optimal balance	Artificially limited
Adaptation Speed	Too fast (unstable)	Appropriate	Too slow (sluggish)

Timer-Based Approximation

Given that counting every memory reference is prohibitively expensive, practical systems use timer-based approximations. Instead of tracking exact reference counts, they use elapsed time as a proxy.

The Approximation:

Replace the reference-based window Δ (in memory references) with a time-based window τ (in time units, typically timer interrupts).

W(t, τ) ≈ set of pages referenced since the last τ timer ticks

How Timer-Based Tracking Works:

Timer Interrupt: Every τ milliseconds (e.g., 10ms), a timer interrupt fires
Reference Bit Check: For each page, examine its reference bit:
- If set (page was accessed), page is in working set
- If clear (page wasn't accessed), page is not in working set
Bit Reset: Clear all reference bits for the next interval
Iteration: Over time, pages that aren't accessed will have consistently clear reference bits and be identified as outside the working set

Multi-Interval Approach:

A single interval is too coarse. Most implementations use multiple intervals:

Interval	Reference Bit History	Working Set Status
Current	1 (just accessed)	Definitely in
Last	1 (accessed last interval)	Probably in
2 intervals ago	1	Marginally in
3+ intervals ago	0, 0, 0, ...	Outside working set

By tracking a history of reference bits across multiple intervals, the system builds a more accurate picture of the working set.

timer_working_set.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Timer-based working set tracking (conceptual implementation)
 
#define HISTORY_LENGTH 8   // Track 8 intervals of history
#define TIMER_INTERVAL 10  // 10ms between timer ticks
 
typedef struct {
    uint8_t reference_history;  // Bit vector: bit i = reference in interval i
    // bit 0 = most recent, bit 7 = oldest
} PageMetadata;
 
PageMetadata page_table[MAX_PAGES];
 
// Called on every timer interrupt
void timer_interrupt_handler(Process* proc) {
    for (int page = 0; page < proc->num_pages; page++) {
        PageTableEntry* pte = &proc->page_table[page];
        PageMetadata* meta = &page_table[page];
        
        // Shift history right (age the bits)
        meta->reference_history >>= 1;
        
        // If page was referenced this interval, set MSB
        if (pte->reference_bit) {
            meta->reference_history |= 0x80;  // Set bit 7
            pte->reference_bit = 0;           // Clear for next interval
        }
    }
}
 
// Determine if page is in working set
bool is_in_working_set(int page, int window_intervals) {
    PageMetadata* meta = &page_table[page];
    
    // Check if any of the last 'window_intervals' bits are set
    uint8_t mask = (1 << window_intervals) - 1;  // e.g., 0x0F for 4 intervals
    mask <<= (8 - window_intervals);              // Shift to check MSBs
    
    return (meta->reference_history & mask) != 0;
}
 
// Calculate working set size
int calculate_working_set_size(Process* proc, int window_intervals) {
    int wss = 0;
    for (int page = 0; page < proc->num_pages; page++) {
        if (is_in_working_set(page, window_intervals)) {
            wss++;
        }
    }
    return wss;
}

The 8-Bit History Technique

The code above uses an 8-bit history per page, allowing tracking of 8 intervals (~80ms with 10ms timer). Each bit records whether the page was accessed during that interval. This technique provides granularity between 'accessed ever' and 'accessed right now' without storing complete reference traces.

Accuracy of Timer-Based Approximation:

The timer approach introduces two sources of error:

1. Quantization Error:

A page accessed early in an interval looks the same as one accessed late
Reference within the last 1ms vs. 9ms ago both show as 'accessed this interval'
Finer timer granularity reduces this error but increases overhead

2. Rate Sensitivity:

High-reference-rate processes might access a page 1,000 times per interval
Low-reference-rate processes might access it once per interval
Both show as 'referenced' — no distinction in intensity

Despite these limitations, timer-based working set tracking is accurate enough for practical use because:

The working set decision is binary (in or out), not requiring precise counts
Multiple intervals smooth out quantization issues
The principle of locality means 'recently accessed' correlates well with 'will be accessed soon'
The overhead reduction makes the approach feasible at all

Hardware Support for Tracking

Working set tracking relies heavily on hardware support. The reference bit (also called the access bit or used bit) in each page table entry is the fundamental enabler.

The Reference Bit:

Property	Description
Location	Each page table entry (PTE) contains this bit
Set by	Hardware (MMU), automatically on any page access
Cleared by	Software (OS), typically on timer interrupt
Purpose	Indicates whether page was accessed since last clear
Cost	Zero runtime overhead — hardware sets it automatically

x86/x64 Page Table Entry Structure:

63    52 51        12 11   9 8  7 6 5 4 3 2 1 0
+-------+------------+-----+---+-+-+-+-+-+-+-+-+
|  NX   | PFN        | AVL |   |A|D|.|U|W|P| |
+-------+------------+-----+---+-+-+-+-+-+-+-+-+
                           |   ↑ ↑
                           |   | └─ Dirty bit (D)
                           |   └─── Accessed bit (A) ← Reference bit
                           └─────── Available for OS use

A (Accessed): Set by CPU when page is read or written
D (Dirty): Set by CPU when page is written
Both bits are set automatically by hardware, cleared by software

Hardware Sets, Software Clears

This division of labor is crucial for efficiency. Hardware can set bits at memory-access speed (every few nanoseconds) without OS involvement. Software clears bits periodically (every few milliseconds) in batches. If software had to set bits, every memory access would require an OS trap — making the system 1000x slower.

TLB Complications:

The Translation Lookaside Buffer (TLB) caches page table entries for fast address translation. This creates complications for reference bit tracking:

Problem: When a page is accessed via TLB hit, the PTE in the page table might not be updated immediately. Some architectures:

Update PTE through TLB: TLB entry references main PTE, updates happen automatically
Update TLB only: PTE in memory isn't updated; software must handle
Buffer updates: Hardware batches PTE updates for efficiency

Consequence: When the OS clears reference bits, it must also:

Invalidate corresponding TLB entries, or
Use TLB shootdown (on multiprocessor systems) to clear remote TLBs

This adds overhead but is necessary for accurate tracking.

Software-Managed TLB Architectures:

Some architectures (notably MIPS) use software-managed TLBs where the OS handles TLB misses. On these systems:

OS has complete control over TLB contents
Reference tracking is simpler (OS can update bits during TLB miss handling)
Higher overhead per TLB miss but finer control for the OS

Reference Bit Behavior Across Architectures
Architecture	Reference Bit Name	TLB Behavior	Dirty Bit Handling
x86/x64	Accessed (A)	Hardware-managed, writes through to PTE	Dirty (D) bit, hardware-set
ARM	Access Flag (AF)	Hardware-managed or software-managed	Dirty Bit modifier (DBM)
MIPS	None (software)	Software-managed TLB	OS tracks via faults
RISC-V	Accessed (A)	Configurable	Dirty (D) bit available
PowerPC	Referenced (R)	Software-managed	Changed (C) bit

Implementation Challenges

Implementing working set tracking in a production operating system involves numerous practical challenges beyond the basic algorithm.

Challenge 1: Page Table Scanning Overhead

To calculate working set sizes, the OS must scan page tables:

A process with 1GB address space has 262,144 pages (with 4KB pages)
Scanning all pages on every timer interrupt is expensive
Page tables themselves span multiple pages that must be accessed

Mitigation Strategies:

Scan only resident pages (smaller set)
Incremental scanning (scan portion each interval)
Sample-based estimation (scan random subset)
Hardware counters where available

incremental_scan.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Incremental page table scanning to reduce overhead
 
#define PAGES_PER_SCAN 1024  // Scan 1024 pages per timer tick
 
typedef struct {
    int scan_position;      // Where to resume scanning
    int working_set_size;   // Accumulated count
    int scan_complete;      // Full scan finished this cycle?
} ProcessScanState;
 
void incremental_working_set_scan(Process* proc) {
    ProcessScanState* state = &proc->scan_state;
    
    int pages_scanned = 0;
    int start = state->scan_position;
    
    // Scan up to PAGES_PER_SCAN pages
    while (pages_scanned < PAGES_PER_SCAN && 
           state->scan_position < proc->num_pages) {
        
        int page = state->scan_position;
        
        if (proc->page_table[page].present) {  // Only scan resident pages
            if (is_in_working_set(page, WINDOW_INTERVALS)) {
                state->working_set_size++;
            }
            
            // Age the reference history
            age_reference_history(page);
        }
        
        state->scan_position++;
        pages_scanned++;
    }
    
    // Check if we've completed a full scan
    if (state->scan_position >= proc->num_pages) {
        state->scan_complete = 1;
        state->scan_position = 0;
        
        // Report working set size
        proc->current_wss = state->working_set_size;
        state->working_set_size = 0;  // Reset for next cycle
    }
}

Challenge 2: Context Switch Handling

When a process is context-switched out, its reference bits are frozen, but time continues:

Process A runs, accumulates references, gets switched out
Process B runs for extended period
Process A switches back in

Question: Does Process A's working set include pages referenced before the switch?

Options:

Virtual time: Stop working set clock when process isn't running (accurate but complex)
Real time: Continue clock, potentially aging out pages during non-execution (simpler but less accurate)
Hybrid: Suspend aging during brief switches, continue during long suspensions

Challenge 3: Shared Pages

Pages shared between processes complicate working set tracking:

Shared library page is in Process A's working set
Process B also uses the same physical page
How is the page counted? Once or twice?
If B accesses it, should A's reference bit reflect this?

Typically, working sets are per-process with independent reference bits per mapping, but physical frame allocation considers sharing.

The Multi-Core Complication

On multi-core systems, a process may run on different cores over time, and different cores cache TLB entries separately. Clearing reference bits requires TLB shootdowns (inter-processor interrupts to invalidate remote TLBs). This synchronization overhead can be significant on systems with many cores.

Challenge 4: Determining When to Act

Knowing working set sizes is only useful if the OS uses this information effectively:

When should frames be taken from a process with shrunk WSS?
When should a process with grown WSS receive more frames?
How aggressively should allocation track WSS changes?

Too aggressive reallocation causes thrashing during temporary WSS fluctuations. Too conservative reallocation wastes memory. Finding the right balance requires:

Hysteresis: Don't reallocate on small WSS changes
Smoothing: Use average WSS over time, not instantaneous
Thresholds: Only act when WSS changes exceed some percentage
Prediction: Anticipate phase transitions (e.g., when process signals a new operation)

Challenge 5: Global vs. Local Decisions

The working set model tells us each process's needs, but global decisions must balance:

Total memory demand exceeds supply (which process to shortchange?)
One process's WSS spiked (should it steal from others?)
Interactive process needs memory (should it preempt batch processes?)

These policy decisions, while informed by working set data, require additional heuristics and priorities.

Working Set Tracking Algorithms

Several algorithms have been developed to track working sets with varying degrees of accuracy and overhead.

Working Set Tracking Algorithms

•Timer Interrupt Sampling: Check reference bits on each timer interrupt. Simple but coarse granularity. Accuracy depends on timer frequency.
•Reference Bit Aging: Maintain multi-bit counter per page that ages over time. More history than single bit. The 8-bit shift register approach discussed earlier.
•WSClock Algorithm: Combine clock algorithm with working set tracking. Circular list of pages, aging on each clock tick. Elegantly unifies replacement and tracking.
•Sampled Working Set: Track references to random sample of pages. Extrapolate to full working set. Reduces overhead proportional to sample size.
•Page Fault Counting: Instead of tracking which pages, count fault rate. Use fault rate to infer if allocation matches WSS. This leads to Page Fault Frequency (PFF) approach.

The WSClock Algorithm:

A particularly elegant approach combines clock-style replacement with working set semantics:

Pages arranged in circular list with a clock hand
Each page has: reference bit, last-use time/counter
When seeking victim, advance clock:
- If page referenced recently (within window): skip, clear reference bit
- If page not referenced recently: candidate for eviction
Working set = pages with reference time within window
Working set size = count of pages not eligible for eviction

Advantages of WSClock:

Unified mechanism for replacement and WSS tracking
Efficient: only considers pages on eviction
Natural decay: unreferenced pages gradually age out
No separate scanning pass needed

Comparison of Tracking Approaches:

Working Set Tracking Algorithm Comparison
Algorithm	Accuracy	Overhead	Granularity	Best For
Timer Sampling	Medium	Low	Timer interval	General purpose
Bit Aging	Good	Medium	Timer interval	When history matters
WSClock	Good	Low	Per-eviction	Unified approach
Sampling	Statistical	Very Low	Configurable	Large address spaces
Fault Counting	Indirect	Very Low	Per-fault	Adaptive allocation

The Trend: Indirect Measurement

Modern systems increasingly favor indirect measurement (like page fault rates) over direct working set tracking. The insight: if a process isn't faulting, it has enough frames — regardless of what the 'true' working set might be. This leads to Page Fault Frequency (PFF) allocation, which we'll explore in detail in a later page.

Limitations of the Working Set Model

Despite its elegance and influence, the pure working set model has significant limitations that have led to alternative approaches in practice.

Key Limitations

•Parameter Sensitivity: The window size Δ critically affects correctness, yet optimal Δ varies between programs, between phases, and over time. No single Δ works well universally.
•Tracking Overhead: Precisely tracking which pages are in the working set requires examining every page periodically. For processes with large address spaces, this overhead is significant.
•Abrupt Transitions: When a process transitions between phases, its working set can change dramatically. The model responds after the fact, potentially causing a fault spike during transition.
•No Predictive Element: The working set looks backward at recent history. It cannot anticipate future needs, only react to past behavior.
•Shared Memory Complexity: Modern systems heavily use shared memory (libraries, mmap'd files, inter-process communication). Working set semantics become unclear for shared pages.
•Virtual Time Complexity: Correctly implementing virtual time (stopping the clock when process isn't running) adds significant implementation complexity.

The Fundamental Tension:

The working set model embodies a tension between:

Theoretical purity: Reference-based windows, exact page tracking
Practical feasibility: Timer-based approximations, sampling

Real implementations compromise on purity to achieve feasibility. But once we're approximating anyway, alternative models (like PFF) may achieve better results with simpler mechanisms.

Why PFF Often Wins:

Page Fault Frequency (PFF) addresses many working set limitations:

Issue	Working Set Approach	PFF Approach
Parameter selection	Must choose Δ	Uses observed fault rate
Tracking overhead	Scan all pages	Only count faults
Correctness metric	Indirect (WSS ≈ frames needed)	Direct (fault rate = pain)
Implementation	Complex reference tracking	Simple counter
Self-adjusting	No	Yes (fault rate drives adjustment)

PFF's insight: We don't need to know the working set; we just need to know if the process is suffering. If it's faulting too much, give it more frames. If it's faulting rarely, take some frames away. This behavioral approach sidesteps much of the working set model's complexity.

Historical Importance

Despite its practical limitations, the working set model remains foundational. It provided the conceptual framework—locality, phase behavior, per-process allocation based on need—that informs all modern memory management. Even systems that don't implement working set tracking explicitly are designed with its principles in mind.

Summary: The Working Set Window

We've explored the working set window in comprehensive depth, from its theoretical properties to practical implementation challenges. Let's consolidate the key insights:

Key Takeaways

•The Window Δ — Theoretically measured in references, practically approximated with timer intervals. Its value critically determines working set accuracy.
•The Δ Tradeoff — Small Δ causes excessive faults (edge effect); large Δ wastes memory (ghost pages). Optimal Δ matches locality regions but varies by program and phase.
•Timer-Based Tracking — Practical implementations use reference bit sampling on timer interrupts, trading precision for feasibility.
•Hardware Support — Reference bits in PTEs, set by hardware and cleared by software, are essential for any working set tracking.
•Implementation Challenges — Page table scanning, context switching, shared pages, and multi-core synchronization all complicate real implementations.
•Limitations — Parameter sensitivity and tracking overhead have driven adoption of alternative approaches like Page Fault Frequency.

What's Next:

In the next page, we'll explore Page Fault Frequency (PFF)—an alternative approach to dynamic allocation that sidesteps many working set limitations by focusing directly on observable behavior (fault rates) rather than inferred state (working set membership).

Page Complete

You now understand the working set window in depth—its definition, the tradeoffs in selecting it, how systems track it in practice, and why this complexity motivated simpler alternatives. This understanding prepares you for the next evolution in memory management: the Page Fault Frequency approach.

Working Set Window

The Critical Parameter

What You Will Learn

The Window Parameter in Depth

Virtual Time vs. Wall-Clock Time:

Aspect	Reference-Based Δ	Wall-Clock-Based Δ
Unit	Memory references	Seconds or milliseconds
Advances when	Process accesses memory	Time passes (regardless of activity)
Blocked process	Virtual time stops	Time continues passing
High CPU burst	Many references counted	Same calendar time
I/O-bound process	Few references	Same calendar time

Why Reference-Based is Theoretically Correct:

Imagine two processes:

Process A: Compute-intensive, makes 1 million references per second
Process B: I/O-bound, makes 1,000 references per second

With wall-clock Δ = 1 second:

Process A's 'recent history' spans 1,000,000 references
Process B's 'recent history' spans only 1,000 references

The Practical Problem

The Relationship Between Δ and Working Set Size:

As Δ increases, the working set size |W(t, Δ)| can only grow (or stay the same). This relationship is governed by the working set size function WSS(Δ), which for a given process at time t is:

WSS(Δ) = |W(t, Δ)|

This function has characteristic properties:

WSS(1) = 1 — With a window of one reference, only the current page is in the working set
WSS(∞) = N — With an infinite window, all N pages ever touched are in the working set
Monotonically non-decreasing — WSS(Δ₁) ≤ WSS(Δ₂) when Δ₁ < Δ₂
Typically concave — Grows quickly for small Δ (capturing core locality) then slows (additional Δ adds few new pages)

The shape of this curve varies by program but typically shows:

Steep initial rise: The first few thousand references access core data structures
Gradual plateau: Additional references mostly revisit already-counted pages
Step increases: Phase transitions introduce new pages suddenly

The Δ Selection Tradeoff

Selecting Δ involves navigating a fundamental tradeoff spectrum. Let's explore the extremes and the balanced middle ground.

Case 1: Δ Too Small

Symptom: Working sets are smaller than actual memory needs

Mechanism:

A needed page was last referenced Δ+1 references ago
Page falls outside working set, considered 'unneeded'
If frame is reclaimed, page fault occurs on next reference
Page brought back, only to be evicted again if still near edge of window

Consequences:

Thrashing at process level: Even though system-wide allocation seems correct, individual processes page fault excessively
Wasted I/O: Constant swapping of pages that are actually needed
Poor cache behavior: Repeated page faults disrupt CPU caches and TLBs

Small Δ Pathology: The Edge Effect

Case 2: Δ Too Large

Symptom: Working sets include many pages no longer needed

Mechanism:

Process transitions from Phase A to Phase B
Phase A pages were referenced within last Δ references (barely)
These pages remain in working set despite being irrelevant to Phase B
Frames are allocated to stale pages instead of other processes' active pages

Consequences:

Memory bloat: Each process appears to need more frames than truly required
Reduced multiprogramming: Fewer processes can run simultaneously
Resource starvation: Active processes may not get needed frames because stale allocations aren't released
Slow adaptation: When workload changes, system takes too long to rebalance

Large Δ Pathology: The Ghost Page Effect

Case 3: Δ Optimal

Characteristic: Δ spans exactly one 'locality region'—the set of pages needed for the current program phase

Properties:

Pages enter working set when the phase begins
Pages exit when the phase ends (and new phase begins)
Working set size matches actual frame requirement
Minimal page faults (only on genuine phase transitions)
Efficient memory utilization across all processes

The Challenge: The optimal Δ varies:

Between different programs (database vs. compiler)
Between different phases of the same program (initialization vs. steady-state)
Under different workload conditions (interactive vs. batch)

No single Δ is optimal for all situations, which is why some systems use adaptive Δ or abandon explicit windows entirely (using page fault frequency instead).

Δ Selection Impact Summary
Aspect	Δ Too Small	Δ Optimal	Δ Too Large
Working Set Size	Underestimates needs	Matches needs	Overestimates needs
Page Fault Rate	High (edge effect)	Low (phase transitions only)	Low but wasteful
Memory Efficiency	Poor (constant swapping)	Excellent	Poor (wasted frames)
Multiprogramming	High but thrashing	Optimal balance	Artificially limited
Adaptation Speed	Too fast (unstable)	Appropriate	Too slow (sluggish)

Timer-Based Approximation

The Approximation:

Replace the reference-based window Δ (in memory references) with a time-based window τ (in time units, typically timer interrupts).

W(t, τ) ≈ set of pages referenced since the last τ timer ticks

How Timer-Based Tracking Works:

Timer Interrupt: Every τ milliseconds (e.g., 10ms), a timer interrupt fires
Reference Bit Check: For each page, examine its reference bit:
- If set (page was accessed), page is in working set
- If clear (page wasn't accessed), page is not in working set
Bit Reset: Clear all reference bits for the next interval
Iteration: Over time, pages that aren't accessed will have consistently clear reference bits and be identified as outside the working set

Multi-Interval Approach:

A single interval is too coarse. Most implementations use multiple intervals:

Interval	Reference Bit History	Working Set Status
Current	1 (just accessed)	Definitely in
Last	1 (accessed last interval)	Probably in
2 intervals ago	1	Marginally in
3+ intervals ago	0, 0, 0, ...	Outside working set

By tracking a history of reference bits across multiple intervals, the system builds a more accurate picture of the working set.

timer_working_set.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Timer-based working set tracking (conceptual implementation)
 
#define HISTORY_LENGTH 8   // Track 8 intervals of history
#define TIMER_INTERVAL 10  // 10ms between timer ticks
 
typedef struct {
    uint8_t reference_history;  // Bit vector: bit i = reference in interval i
    // bit 0 = most recent, bit 7 = oldest
} PageMetadata;
 
PageMetadata page_table[MAX_PAGES];
 
// Called on every timer interrupt
void timer_interrupt_handler(Process* proc) {
    for (int page = 0; page < proc->num_pages; page++) {
        PageTableEntry* pte = &proc->page_table[page];
        PageMetadata* meta = &page_table[page];
        
        // Shift history right (age the bits)
        meta->reference_history >>= 1;
        
        // If page was referenced this interval, set MSB
        if (pte->reference_bit) {
            meta->reference_history |= 0x80;  // Set bit 7
            pte->reference_bit = 0;           // Clear for next interval
        }
    }
}
 
// Determine if page is in working set
bool is_in_working_set(int page, int window_intervals) {
    PageMetadata* meta = &page_table[page];
    
    // Check if any of the last 'window_intervals' bits are set
    uint8_t mask = (1 << window_intervals) - 1;  // e.g., 0x0F for 4 intervals
    mask <<= (8 - window_intervals);              // Shift to check MSBs
    
    return (meta->reference_history & mask) != 0;
}
 
// Calculate working set size
int calculate_working_set_size(Process* proc, int window_intervals) {
    int wss = 0;
    for (int page = 0; page < proc->num_pages; page++) {
        if (is_in_working_set(page, window_intervals)) {
            wss++;
        }
    }
    return wss;
}

The 8-Bit History Technique

Accuracy of Timer-Based Approximation:

The timer approach introduces two sources of error:

1. Quantization Error:

A page accessed early in an interval looks the same as one accessed late
Reference within the last 1ms vs. 9ms ago both show as 'accessed this interval'
Finer timer granularity reduces this error but increases overhead

2. Rate Sensitivity:

High-reference-rate processes might access a page 1,000 times per interval
Low-reference-rate processes might access it once per interval
Both show as 'referenced' — no distinction in intensity

Despite these limitations, timer-based working set tracking is accurate enough for practical use because:

The working set decision is binary (in or out), not requiring precise counts
Multiple intervals smooth out quantization issues
The principle of locality means 'recently accessed' correlates well with 'will be accessed soon'
The overhead reduction makes the approach feasible at all

Hardware Support for Tracking

Working set tracking relies heavily on hardware support. The reference bit (also called the access bit or used bit) in each page table entry is the fundamental enabler.

The Reference Bit:

Property	Description
Location	Each page table entry (PTE) contains this bit
Set by	Hardware (MMU), automatically on any page access
Cleared by	Software (OS), typically on timer interrupt
Purpose	Indicates whether page was accessed since last clear
Cost	Zero runtime overhead — hardware sets it automatically

x86/x64 Page Table Entry Structure:

63    52 51        12 11   9 8  7 6 5 4 3 2 1 0
+-------+------------+-----+---+-+-+-+-+-+-+-+-+
|  NX   | PFN        | AVL |   |A|D|.|U|W|P| |
+-------+------------+-----+---+-+-+-+-+-+-+-+-+
                           |   ↑ ↑
                           |   | └─ Dirty bit (D)
                           |   └─── Accessed bit (A) ← Reference bit
                           └─────── Available for OS use

A (Accessed): Set by CPU when page is read or written
D (Dirty): Set by CPU when page is written
Both bits are set automatically by hardware, cleared by software

Hardware Sets, Software Clears

TLB Complications:

The Translation Lookaside Buffer (TLB) caches page table entries for fast address translation. This creates complications for reference bit tracking:

Problem: When a page is accessed via TLB hit, the PTE in the page table might not be updated immediately. Some architectures:

Update PTE through TLB: TLB entry references main PTE, updates happen automatically
Update TLB only: PTE in memory isn't updated; software must handle
Buffer updates: Hardware batches PTE updates for efficiency

Consequence: When the OS clears reference bits, it must also:

Invalidate corresponding TLB entries, or
Use TLB shootdown (on multiprocessor systems) to clear remote TLBs

This adds overhead but is necessary for accurate tracking.

Software-Managed TLB Architectures:

Some architectures (notably MIPS) use software-managed TLBs where the OS handles TLB misses. On these systems:

OS has complete control over TLB contents
Reference tracking is simpler (OS can update bits during TLB miss handling)
Higher overhead per TLB miss but finer control for the OS

Reference Bit Behavior Across Architectures
Architecture	Reference Bit Name	TLB Behavior	Dirty Bit Handling
x86/x64	Accessed (A)	Hardware-managed, writes through to PTE	Dirty (D) bit, hardware-set
ARM	Access Flag (AF)	Hardware-managed or software-managed	Dirty Bit modifier (DBM)
MIPS	None (software)	Software-managed TLB	OS tracks via faults
RISC-V	Accessed (A)	Configurable	Dirty (D) bit available
PowerPC	Referenced (R)	Software-managed	Changed (C) bit

Implementation Challenges

Implementing working set tracking in a production operating system involves numerous practical challenges beyond the basic algorithm.

Challenge 1: Page Table Scanning Overhead

To calculate working set sizes, the OS must scan page tables:

A process with 1GB address space has 262,144 pages (with 4KB pages)
Scanning all pages on every timer interrupt is expensive
Page tables themselves span multiple pages that must be accessed

Mitigation Strategies:

Scan only resident pages (smaller set)
Incremental scanning (scan portion each interval)
Sample-based estimation (scan random subset)
Hardware counters where available

incremental_scan.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Incremental page table scanning to reduce overhead
 
#define PAGES_PER_SCAN 1024  // Scan 1024 pages per timer tick
 
typedef struct {
    int scan_position;      // Where to resume scanning
    int working_set_size;   // Accumulated count
    int scan_complete;      // Full scan finished this cycle?
} ProcessScanState;
 
void incremental_working_set_scan(Process* proc) {
    ProcessScanState* state = &proc->scan_state;
    
    int pages_scanned = 0;
    int start = state->scan_position;
    
    // Scan up to PAGES_PER_SCAN pages
    while (pages_scanned < PAGES_PER_SCAN && 
           state->scan_position < proc->num_pages) {
        
        int page = state->scan_position;
        
        if (proc->page_table[page].present) {  // Only scan resident pages
            if (is_in_working_set(page, WINDOW_INTERVALS)) {
                state->working_set_size++;
            }
            
            // Age the reference history
            age_reference_history(page);
        }
        
        state->scan_position++;
        pages_scanned++;
    }
    
    // Check if we've completed a full scan
    if (state->scan_position >= proc->num_pages) {
        state->scan_complete = 1;
        state->scan_position = 0;
        
        // Report working set size
        proc->current_wss = state->working_set_size;
        state->working_set_size = 0;  // Reset for next cycle
    }
}

Challenge 2: Context Switch Handling

When a process is context-switched out, its reference bits are frozen, but time continues:

Process A runs, accumulates references, gets switched out
Process B runs for extended period
Process A switches back in

Question: Does Process A's working set include pages referenced before the switch?

Options:

Virtual time: Stop working set clock when process isn't running (accurate but complex)
Real time: Continue clock, potentially aging out pages during non-execution (simpler but less accurate)
Hybrid: Suspend aging during brief switches, continue during long suspensions

Challenge 3: Shared Pages

Pages shared between processes complicate working set tracking:

Shared library page is in Process A's working set
Process B also uses the same physical page
How is the page counted? Once or twice?
If B accesses it, should A's reference bit reflect this?

Typically, working sets are per-process with independent reference bits per mapping, but physical frame allocation considers sharing.

The Multi-Core Complication

Challenge 4: Determining When to Act

Knowing working set sizes is only useful if the OS uses this information effectively:

When should frames be taken from a process with shrunk WSS?
When should a process with grown WSS receive more frames?
How aggressively should allocation track WSS changes?

Too aggressive reallocation causes thrashing during temporary WSS fluctuations. Too conservative reallocation wastes memory. Finding the right balance requires:

Hysteresis: Don't reallocate on small WSS changes
Smoothing: Use average WSS over time, not instantaneous
Thresholds: Only act when WSS changes exceed some percentage
Prediction: Anticipate phase transitions (e.g., when process signals a new operation)

Challenge 5: Global vs. Local Decisions

The working set model tells us each process's needs, but global decisions must balance:

Total memory demand exceeds supply (which process to shortchange?)
One process's WSS spiked (should it steal from others?)
Interactive process needs memory (should it preempt batch processes?)

These policy decisions, while informed by working set data, require additional heuristics and priorities.

Working Set Tracking Algorithms

Several algorithms have been developed to track working sets with varying degrees of accuracy and overhead.

Working Set Tracking Algorithms

•Timer Interrupt Sampling: Check reference bits on each timer interrupt. Simple but coarse granularity. Accuracy depends on timer frequency.
•Reference Bit Aging: Maintain multi-bit counter per page that ages over time. More history than single bit. The 8-bit shift register approach discussed earlier.
•WSClock Algorithm: Combine clock algorithm with working set tracking. Circular list of pages, aging on each clock tick. Elegantly unifies replacement and tracking.
•Sampled Working Set: Track references to random sample of pages. Extrapolate to full working set. Reduces overhead proportional to sample size.
•Page Fault Counting: Instead of tracking which pages, count fault rate. Use fault rate to infer if allocation matches WSS. This leads to Page Fault Frequency (PFF) approach.

The WSClock Algorithm:

A particularly elegant approach combines clock-style replacement with working set semantics:

Pages arranged in circular list with a clock hand
Each page has: reference bit, last-use time/counter
When seeking victim, advance clock:
- If page referenced recently (within window): skip, clear reference bit
- If page not referenced recently: candidate for eviction
Working set = pages with reference time within window
Working set size = count of pages not eligible for eviction

Advantages of WSClock:

Unified mechanism for replacement and WSS tracking
Efficient: only considers pages on eviction
Natural decay: unreferenced pages gradually age out
No separate scanning pass needed

Comparison of Tracking Approaches:

Working Set Tracking Algorithm Comparison
Algorithm	Accuracy	Overhead	Granularity	Best For
Timer Sampling	Medium	Low	Timer interval	General purpose
Bit Aging	Good	Medium	Timer interval	When history matters
WSClock	Good	Low	Per-eviction	Unified approach
Sampling	Statistical	Very Low	Configurable	Large address spaces
Fault Counting	Indirect	Very Low	Per-fault	Adaptive allocation

The Trend: Indirect Measurement

Limitations of the Working Set Model

Despite its elegance and influence, the pure working set model has significant limitations that have led to alternative approaches in practice.

Key Limitations

•Parameter Sensitivity: The window size Δ critically affects correctness, yet optimal Δ varies between programs, between phases, and over time. No single Δ works well universally.
•Tracking Overhead: Precisely tracking which pages are in the working set requires examining every page periodically. For processes with large address spaces, this overhead is significant.
•Abrupt Transitions: When a process transitions between phases, its working set can change dramatically. The model responds after the fact, potentially causing a fault spike during transition.
•No Predictive Element: The working set looks backward at recent history. It cannot anticipate future needs, only react to past behavior.
•Shared Memory Complexity: Modern systems heavily use shared memory (libraries, mmap'd files, inter-process communication). Working set semantics become unclear for shared pages.
•Virtual Time Complexity: Correctly implementing virtual time (stopping the clock when process isn't running) adds significant implementation complexity.

The Fundamental Tension:

The working set model embodies a tension between:

Theoretical purity: Reference-based windows, exact page tracking
Practical feasibility: Timer-based approximations, sampling

Real implementations compromise on purity to achieve feasibility. But once we're approximating anyway, alternative models (like PFF) may achieve better results with simpler mechanisms.

Why PFF Often Wins:

Page Fault Frequency (PFF) addresses many working set limitations:

Issue	Working Set Approach	PFF Approach
Parameter selection	Must choose Δ	Uses observed fault rate
Tracking overhead	Scan all pages	Only count faults
Correctness metric	Indirect (WSS ≈ frames needed)	Direct (fault rate = pain)
Implementation	Complex reference tracking	Simple counter
Self-adjusting	No	Yes (fault rate drives adjustment)

Historical Importance

Summary: The Working Set Window

We've explored the working set window in comprehensive depth, from its theoretical properties to practical implementation challenges. Let's consolidate the key insights:

Key Takeaways

•The Window Δ — Theoretically measured in references, practically approximated with timer intervals. Its value critically determines working set accuracy.
•The Δ Tradeoff — Small Δ causes excessive faults (edge effect); large Δ wastes memory (ghost pages). Optimal Δ matches locality regions but varies by program and phase.
•Timer-Based Tracking — Practical implementations use reference bit sampling on timer interrupts, trading precision for feasibility.
•Hardware Support — Reference bits in PTEs, set by hardware and cleared by software, are essential for any working set tracking.
•Implementation Challenges — Page table scanning, context switching, shared pages, and multi-core synchronization all complicate real implementations.
•Limitations — Parameter sensitivity and tracking overhead have driven adoption of alternative approaches like Page Fault Frequency.

What's Next:

Page Complete