Operating SystemsReal-Time Linux

Real-Time Linux

LevelAdvanced

Duration90 mins

TopicReal-Time Linux

3 / 5

Latency Reduction

The Pursuit of Determinism

Having the right scheduling policy is necessary but not sufficient for real-time performance. The scheduler can only dispatch a task when it's ready to run—but numerous system factors can delay that readiness or introduce jitter in execution.

Consider a real-time task scheduled at priority 99 with SCHED_FIFO. Even with an empty runqueue, this task might experience delays from: interrupt handling, memory allocation page faults, timer granularity, CPU power state transitions, cache effects, lock contention, and system management interrupts. Each source adds microseconds to milliseconds of unpredictable latency.

Latency reduction is the systematic elimination or minimization of these delay sources. It requires understanding the complete path from event occurrence to task response, identifying every potential delay, and applying targeted mitigations.

What You Will Learn

By the end of this page, you will understand: (1) The complete latency path from hardware event to task response; (2) Kernel configuration options that reduce latency; (3) Hardware and BIOS optimizations for determinism; (4) Application-level techniques to minimize jitter; (5) How to measure and verify latency guarantees; and (6) Trade-offs between average performance and worst-case latency.

Understanding Latency Components

Before reducing latency, we must understand what comprises it. The end-to-end response time from an external event to application response includes multiple distinct phases:

End-to-End Latency Breakdown

Conceptual

Event-to-Response Latency Path:
 
External Event (e.g., sensor input)
         │
         ▼ ┌─────────────────────────────────────────────────────┐
         │ │ Hardware Latency                                    │
         │ │ - Signal propagation                                │
         │ │ - Interrupt controller processing                   │
         │ │ - CPU interrupt recognition                         │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-10 μs
         ▼
         │ ┌─────────────────────────────────────────────────────┐
         │ │ Interrupt Latency                                   │
         │ │ - Interrupts-disabled sections                      │
         │ │ - Interrupt priority resolution                     │
         │ │ - Hardirq handler start                             │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-100 μs (can be ms without PREEMPT_RT)
         ▼
         │ ┌─────────────────────────────────────────────────────┐
         │ │ Handler Execution                                   │
         │ │ - ISR or threaded IRQ handler                       │
         │ │ - Wake waiting task                                 │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-50 μs
         ▼
         │ ┌─────────────────────────────────────────────────────┐
         │ │ Scheduling Latency                                  │
         │ │ - Scheduler invocation                              │
         │ │ - Task selection                                    │
         │ │ - Context switch                                    │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-20 μs (PREEMPT_RT)
         ▼
         │ ┌─────────────────────────────────────────────────────┐
         │ │ Task Wakeup Latency                                 │
         │ │ - Cache warmup                                      │
         │ │ - TLB repopulation                                  │
         │ │ - Branch predictor warmup                           │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-10 μs
         ▼
Application Code Runs
         │
         ▼
Response Action (e.g., motor command)
 
TOTAL: Sum of all components + variability of each

The Jitter Problem:

For real-time systems, jitter (variation in latency) is often more problematic than absolute latency. A control system can adapt to a consistent 100μs delay, but random variations between 10μs and 500μs make control loop tuning impossible and can cause instability.

Latency vs Jitter in Real-Time Systems
Characteristic	Consistent 100μs Latency	Variable 10-500μs Latency
Average Latency	100 μs	~200 μs
Worst-Case Latency	100 μs	500 μs
Control Loop Tuning	Straightforward compensation	Difficult, may be unstable
Timing Predictability	Fully predictable	Unpredictable
System Design	Simple, reliable	Complex, oversized margins

Focus on Worst-Case

When evaluating latency reduction techniques, always measure worst-case latency under load, not average latency. A technique that improves average by 50% but doesn't affect worst-case has zero value for real-time guarantees.

Kernel Configuration for Low Latency

Beyond selecting PREEMPT_RT, numerous kernel configuration options affect latency. Proper configuration can mean the difference between 50μs and 500μs worst-case latency.

Essential Kernel Configuration

Kconfig

# ============================================
# ESSENTIAL: Preemption Mode
# ============================================
CONFIG_PREEMPT_RT=y              # Full real-time preemption
 
# ============================================
# ESSENTIAL: Timer Configuration  
# ============================================
CONFIG_HIGH_RES_TIMERS=y         # Nanosecond-resolution timers
CONFIG_HZ_1000=y                 # 1000Hz timer tick (1ms resolution)
                                 # or CONFIG_NO_HZ_FULL for tickless
 
# NO_HZ options (choose one):
# CONFIG_HZ_PERIODIC=y           # Traditional periodic tick
# CONFIG_NO_HZ_IDLE=y            # Tickless when idle (common default)
# CONFIG_NO_HZ_FULL=y            # Full tickless for isolated CPUs
 
# ============================================
# RECOMMENDED: Reduce Interrupt Latency
# ============================================
CONFIG_IRQSOFF_TRACER=y          # Track IRQs-off latency (debug/tune)
CONFIG_PREEMPTIRQ_EVENTS=y       # Preemption/IRQ event tracing
CONFIG_IRQ_FORCED_THREADING=y    # Force threading of IRQ handlers
 
# ============================================
# MEMORY: Avoid allocation latency spikes
# ============================================
CONFIG_TRANSPARENT_HUGEPAGE=n    # Disable THP (compaction latency)
CONFIG_COMPACTION=n              # Or carefully tune if enabled
CONFIG_SLUB=y                    # SLUB allocator (vs SLAB)
CONFIG_SLUB_CPU_PARTIAL=y        # Reduce cross-CPU slab operations
 
# ============================================
# CPU POWER: Avoid C-state transition latency
# ============================================
# At runtime: processor.max_cstate=1 or =0 boot parameter
# Or use PM QoS to constrain C-states programmatically
 
# ============================================
# DISABLE: Features that add latency
# ============================================
CONFIG_DEBUG_PREEMPT=n           # Disable in production
CONFIG_DEBUG_SPINLOCK=n          # Disable in production  
CONFIG_LOCKDEP=n                 # Disable in production
CONFIG_PROVE_LOCKING=n           # Disable in production
CONFIG_DEBUG_MUTEXES=n           # Disable in production
 
# Note: Keep tracing enabled until system is validated,
# then consider disabling for absolute minimum latency:
# CONFIG_FTRACE=n                # Last resort for lowest latency

Timer Configuration Deep Dive:

Timer configuration profoundly affects scheduling granularity and latency:

Timer Configuration Options Impact
Option	Effect	Latency Impact	When to Use
CONFIG_HZ_100	100Hz tick, 10ms resolution	Poor granularity, higher jitter	Servers, throughput focus
CONFIG_HZ_250	250Hz tick, 4ms resolution	Moderate granularity	Desktop, general use
CONFIG_HZ_1000	1000Hz tick, 1ms resolution	Good granularity, slight overhead	Soft real-time, gaming
CONFIG_NO_HZ_IDLE	Tickless when idle	Reduces idle wakeups	General, with RT tasks
CONFIG_NO_HZ_FULL	Full tickless on isolated CPUs	Eliminates tick interrupt entirely	Dedicated RT CPUs

NO_HZ_FULL for Isolated RT CPUs

For the lowest latency on dedicated RT CPUs, use CONFIG_NO_HZ_FULL with the nohz_full= boot parameter. This eliminates the periodic timer tick entirely on specified CPUs, removing a source of jitter for RT tasks pinned to those CPUs.

Hardware and BIOS Configuration

Modern hardware includes features designed to improve average performance or power efficiency that wreak havoc on real-time determinism. Disabling or constraining these features is often necessary.

Hardware Features That Harm RT Performance

•CPU C-States (Deep Idle) — CPU idle states save power but take microseconds to milliseconds to exit. C3+ states can add 100+μs latency.
•CPU P-States (Frequency Scaling) — Frequency changes during scaling transitions cause execution time variability. Turbo boost adds unpredictability.
•System Management Interrupts (SMI) — Non-maskable hardware interrupts for BIOS functions can steal milliseconds with no OS visibility.
•Hardware Prefetching — Can cause cache pollution and timing variability. Usually leave enabled but be aware.
•NUMA Memory Access — Non-uniform memory access adds latency variability. Pin RT processes and their memory to same NUMA node.
•Intel Speed Shift / AMD CPPC — Hardware-controlled frequency scaling is unpredictable from software perspective.

BIOS/UEFI Configuration:

Recommended BIOS Settings for Real-Time
Setting	Recommended	Reason
C-States	Disable C3 and deeper	Eliminates 100μs+ wakeup latency
Intel SpeedStep/AMD Cool'n'Quiet	Disable or lock to max	Prevents frequency scaling latency
Intel Turbo Boost	Disable for consistency	Eliminates frequency uncertainty
Hyper-Threading	Consider disabling	Reduces contention on shared resources
NUMA Interleaving	Disable for RT	Predictable memory access latency
USB Legacy/PS2 Emulation	Disable	Reduces SMI frequency
Power Management (ACPI)	Minimal or disable	Fewer power transitions

Linux Boot Parameters for Hardware Control:

RT Boot Parameters
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Add to kernel command line (e.g., /etc/default/grub GRUB_CMDLINE_LINUX)
 
# CPU isolation for RT tasks on CPUs 2,3
isolcpus=2,3                    # Remove CPUs from general scheduler
nohz_full=2,3                   # Disable timer tick on these CPUs
rcu_nocbs=2,3                   # Move RCU callbacks off these CPUs
irqaffinity=0,1                 # Bind IRQs to CPUs 0,1 only
 
# C-state control
processor.max_cstate=1          # Limit to C1 (shallowest idle)
intel_idle.max_cstate=0         # Disable intel_idle driver deep states
# Or use kernel idle=poll for no idle at all (extreme, high power)
 
# CPU frequency control
intel_pstate=disable            # Use acpi-cpufreq for manual control
# Then set governor to performance
 
# Memory
transparent_hugepage=never      # Disable THP from boot
 
# Example complete RT command line:
# GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 \
#   irqaffinity=0,1 processor.max_cstate=1 intel_pstate=disable \
#   transparent_hugepage=never"

Runtime Hardware Control:

Runtime Hardware Optimization
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/bash
# Runtime configuration for RT hardware optimization
 
# Set CPU frequency governor to performance (all CPUs)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > "$cpu"
done
 
# Lock CPU frequency to maximum
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq; do
    max=$(cat "$cpu")
    echo $max > /sys/devices/system/cpu/cpu${cpu## * cpu
                                    } / cpufreq / scaling_min_freq
done
 
# Disable CPU frequency boost(Intel Turbo / AMD Boost)
echo 0 > /sys/devices / system / cpu / intel_pstate / no_turbo 2 > /dev/null
echo 0 > /sys/devices / system / cpu / cpufreq / boost 2 > /dev/null
 
# Move IRQs off RT CPUs(assuming CPUs 2, 3 are RT)
for irq in /proc/irq/*/smp_affinity; do
    echo 3 > "$irq" 2>/dev/null  # CPUs 0,1 only (bitmask 0011)
done
 
# PM QoS: Prevent deep C-states from application
# In C code: 
#   int fd = open("/dev/cpu_dma_latency", O_RDWR);
#   int latency = 0;  // Microseconds: 0 = no idle
#   write(fd, &latency, sizeof(latency));
#   // Keep fd open while running RT tasks
 
# Verify isolation
echo "Isolated CPUs: $(cat /sys/devices/system/cpu/isolated)"
echo "NO_HZ Full CPUs: $(cat /sys/devices/system/cpu/nohz_full 2>/dev/null)"

System Management Interrupts (SMI)

SMIs are the most insidious latency source: invisible to the OS, non-maskable, and can take milliseconds. Common causes: thermal monitoring, hardware error logging, memory scrubbing, USB emulation. Some can be disabled in BIOS; others require specialized hardware or acceptance of occasional long latencies.

CPU Isolation for RT Tasks

Dedicating specific CPUs to real-time tasks—CPU isolation—is one of the most effective latency reduction techniques. Isolated CPUs run only your RT tasks, free from kernel housekeeping, other processes, and most interrupts.

CPU Isolation Architecture

Conceptual

CPU Isolation Strategy:
 
System with 4 CPUs (0-3):
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  CPUs 0,1: Housekeeping                                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • General kernel threads (kworker, ksoftirqd, etc.)     │   │
│  │ • Non-RT applications                                    │   │
│  │ • Most device interrupts                                 │   │
│  │ • RCU callback processing                                │   │
│  │ • Timer tick (for these CPUs)                            │   │
│  │ • Network stack, block I/O                               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  CPUs 2,3: Isolated for RT                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • RT application threads ONLY                            │   │
│  │ • No timer tick (nohz_full)                              │   │
│  │ • No RCU callbacks (rcu_nocbs)                           │   │
│  │ • Minimal or no interrupts                               │   │
│  │ • No kernel housekeeping                                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Result: RT tasks run with minimal interference               │
└─────────────────────────────────────────────────────────────────┘

Implementing CPU Isolation:

CPU Isolation Implementation
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Step 1: Kernel boot parameters
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub:
isolcpus=2,3              # Remove from general scheduling
nohz_full=2,3             # Disable timer tick
rcu_nocbs=2,3             # Offload RCU callbacks
irqaffinity=0,1           # Default IRQ affinity to non-isolated CPUs
 
# After editing: update-grub && reboot
 
# Step 2: Verify isolation after boot
cat /sys/devices/system/cpu/isolated       # Should show: 2-3
cat /sys/devices/system/cpu/nohz_full      # Should show: 2-3
 
# Step 3: Move remaining kernel threads off isolated CPUs
# Most should already be off, but verify:
ps -eo pid,psr,comm | awk '$2 ~ /[23]/'
# Move any stragglers (example for kworker):
# Kernel threads may require cgroup or special handling
 
# Step 4: Pin RT application to isolated CPU
taskset -c 2 chrt -f 90 ./my_rt_application
 
# Or programmatically with CPU affinity:

cpu_isolation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
 
/**
 * Pin current thread to a specific isolated CPU
 */
int pin_to_cpu(int cpu) {
    cpu_set_t cpuset;
 
                                    CPU_ZERO(& cpuset);
                                    CPU_SET(cpu, & cpuset);
 
                                    if (sched_setaffinity(0, sizeof(cpuset), & cpuset) != 0) {
                                        perror("sched_setaffinity");
                                        return -1;
                                    }
 
                                    /* Verify affinity */
                                    CPU_ZERO(& cpuset);
                                    sched_getaffinity(0, sizeof(cpuset), & cpuset);
 
                                    if (!CPU_ISSET(cpu, & cpuset)) {
                                        fprintf(stderr, "Failed to pin to CPU %d
", cpu);
                                        return -1;
                                    }
 
                                    printf("Pinned to CPU %d
", cpu);
                                    return 0;
                                }
 
/**
 * Complete RT setup: pin CPU + set scheduling
 */
int setup_rt_thread(int cpu, int priority) {
    struct sched_param param;
 
                                    /* Pin to isolated CPU first */
                                    if (pin_to_cpu(cpu) != 0) {
                                        return -1;
                                    }
 
                                    /* Set SCHED_FIFO with given priority */
                                    param.sched_priority = priority;
                                    if (sched_setscheduler(0, SCHED_FIFO, & param) != 0) {
                                        perror("sched_setscheduler");
                                        return -1;
                                    }
 
                                    printf("Configured: CPU %d, SCHED_FIFO priority %d
",
                                        cpu, priority);
                                    return 0;
                                }
 
/* Example usage in RT thread */
void* rt_worker(void* arg) {
                            int cpu = * (int *)arg;
 
                            /* Setup on thread start */
                            if(setup_rt_thread(cpu, 90) != 0) {
        return NULL;
    }
 
    /* Lock memory */
    mlockall(MCL_CURRENT | MCL_FUTURE);
 
    /* RT work loop */
    while (1) {
        /* Do RT work */
        do_rt_work();
 
        /* Wait for next period */
        wait_for_next_period();
    }
 
    return NULL;
} 

Memory Locality on NUMA Systems

On NUMA systems, also pin memory allocation to the same NUMA node as the CPU. Use numactl --cpunodebind=N --membind=N or programmatically with set_mempolicy(). Cross-node memory access adds significant latency and variability.

Application-Level Latency Reduction

Even with perfect kernel and hardware configuration, application code can introduce latency and jitter. Following RT application best practices is essential for achieving deterministic behavior.

Memory Management

•Lock All Memory: Call mlockall(MCL_CURRENT | MCL_FUTURE) to prevent page faults during RT execution.
•Pre-allocate Everything: Allocate all buffers, objects, and data structures during initialization, not during RT operations.
•Pre-fault Stack: Touch all stack pages before entering RT section to ensure they're mapped.
•Avoid malloc/free in RT Path: Dynamic memory allocation is non-deterministic. Use pre-allocated pools or static allocation.
•Consider Huge Pages: For large data structures, huge pages reduce TLB misses and page table overhead.

memory_preparation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#include < sys / mman.h >
    #include <string.h>
#include <stdlib.h>
 
#define STACK_PREFAULT_SIZE(512 * 1024)  /* 512KB stack prefault */
 
/**
 * Prepare memory for real-time execution
 * Call BEFORE entering RT critical section
 */
int prepare_rt_memory(void) {
    /* 1. Lock all current and future memory */
    if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
        perror("mlockall failed - running as root?");
        return -1;
    }
 
    /* 2. Pre-fault stack by touching pages */
    volatile char stack_prefault[STACK_PREFAULT_SIZE];
    memset((void*)stack_prefault, 0, sizeof(stack_prefault));
 
    /* 3. Pre-fault heap allocations */
    /* Any malloc'd memory should be touched here */
 
    return 0;
}
 
/**
 * Pre-allocated buffer pool for RT-safe allocation
 */
struct buffer_pool {
    void* buffers[100];
    int free_mask[100];
    size_t buf_size;
};
 
struct buffer_pool * create_buffer_pool(size_t buf_size, int count) {
    struct buffer_pool * pool = malloc(sizeof(* pool));
 
    for (int i = 0; i < count && i < 100; i++) {
        pool -> buffers[i] = aligned_alloc(64, buf_size);
        memset(pool -> buffers[i], 0, buf_size);  /* Pre-fault */
        pool -> free_mask[i] = 1;  /* Available */
    }
    pool -> buf_size = buf_size;
 
    /* Lock pool memory */
    mlock(pool, sizeof(* pool));
    for (int i = 0; i < count; i++) {
        mlock(pool -> buffers[i], buf_size);
    }
 
    return pool;
}
 
/* O(n) but deterministic - no system calls */
void* pool_alloc(struct buffer_pool * pool) {
    for (int i = 0; i < 100; i++) {
        if (pool -> free_mask[i]) {
            pool -> free_mask[i] = 0;
            return pool -> buffers[i];
        }
    }
    return NULL;  /* Pool exhausted */
}
 
void pool_free(struct buffer_pool * pool, void* ptr) {
    for (int i = 0; i < 100; i++) {
        if (pool -> buffers[i] == ptr) {
            pool -> free_mask[i] = 1;
            return;
        }
    }
} 

Timing and Synchronization

•Use CLOCK_MONOTONIC: For timing measurements and sleeps, use CLOCK_MONOTONIC, not CLOCK_REALTIME (which can jump).
•Use Absolute Sleeps: Use clock_nanosleep() with TIMER_ABSTIME to avoid drift accumulation.
•Avoid Busy Waiting: Spin loops waste CPU and prevent other work. Use proper sleeping primitives.
•Minimize Lock Hold Time: Keep critical sections as short as possible to reduce blocking.
•Use Lock-Free Where Possible: For producer-consumer patterns, lock-free queues avoid blocking entirely.

rt_timing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <time.h>
#include <stdint.h>
 
/**
 * Precise periodic timing using absolute clock_nanosleep
 * 
 * This approach doesn't accumulate drift because each
 * sleep targets an absolute time, not a relative delay.
 */
void periodic_rt_loop(uint64_t period_ns) {
    struct timespec next_wake;
 
    /* Get initial time */
    clock_gettime(CLOCK_MONOTONIC, & next_wake);
 
    while (1) {
        /* Calculate next wake time */
        next_wake.tv_nsec += period_ns;
        while (next_wake.tv_nsec >= 1000000000L) {
            next_wake.tv_nsec -= 1000000000L;
            next_wake.tv_sec++;
        }
 
        /* ===== RT WORK SECTION ===== */
        do_real_time_work();
        /* ===== END RT WORK ===== */
 
        /* Sleep until absolute time (no drift!) */
        clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, 
                       & next_wake, NULL);
 
        /*
         * Compare to relative sleep (BAD - accumulates drift):
         * nanosleep(&period_duration, NULL);
         * 
         * With relative sleep, if work takes longer than
         * expected, the error accumulates each period.
         */
    }
}
 
/**
 * Measure execution time for WCET estimation
 */
uint64_t measure_execution_time(void (* func)(void)) {
    struct timespec start, end;
 
    clock_gettime(CLOCK_MONOTONIC, & start);
    func();
    clock_gettime(CLOCK_MONOTONIC, & end);
 
    return (end.tv_sec - start.tv_sec) * 1000000000ULL +
        (end.tv_nsec - start.tv_nsec);
} 

Hidden System Calls

Many innocent-looking operations invoke system calls: printf() (write()), malloc() (potentially brk() or mmap()), std::cout, even some math library functions. Profile your RT code to identify hidden syscalls and eliminate or move them outside the RT path.

Measuring and Validating Latency

You cannot improve what you cannot measure. Latency measurement is essential for validating RT system behavior and identifying optimization targets.

cyclictest: The Standard RT Benchmark

cyclictest is the de facto standard tool for measuring scheduling latency on Linux. It creates RT threads that sleep for a precise interval and measures the difference between expected and actual wake time.

cyclictest Usage
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Install rt - tests package
# Ubuntu / Debian: apt install rt - tests
# Fedora / RHEL: dnf install rt - tests
 
# Basic cyclictest run
sudo cyclictest--mlockall--priority = 90 --interval = 1000 --loops = 100000
# - m / --mlockall: Lock memory
# - p / --priority: RT priority(1 - 99)
# - i / --interval: Sleep interval in microseconds(1000 = 1ms)
# - l / --loops: Number of iterations(or 0 for infinite)
 
# Per - CPU thread test(recommended)
sudo cyclictest - t - p 90 - m - n - i 1000 - l 1000000
# - t: One thread per CPU
# - n: Use clock_nanosleep instead of nanosleep
 
# Test on isolated CPUs only
sudo cyclictest - t2 - a 2, 3 - p 90 - m - n - i 1000 - l 1000000
# - t2: Two threads
# - a 2, 3: Pin to CPUs 2 and 3
 
# Generate histogram output
sudo cyclictest - t - p 90 - m - n - i 1000 - l 100000 - h 100 > histogram.txt
# - h 100: Histogram with 100 buckets
 
# Example output:
# T: 0(12345) P: 90 I: 1000 C: 100000 Min: 5 Act: 11 Avg: 10 Max: 42
# T: 1(12346) P: 90 I: 1000 C: 100000 Min: 4 Act: 10 Avg: 9 Max: 38
#
# Key metrics:
#   Min: Minimum latency(best case)
#   Avg: Average latency  
#   Max: Maximum latency(WORST CASE - most important!)

Testing Under Load:

RT latency must be measured under realistic load. A system that achieves 20μs latency when idle may show 500μs under load. Use stress tools to simulate production conditions:

stress Testing While Measuring
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Run stress load in background while measuring with cyclictest
 
# CPU stress(run on non - isolated CPUs)
stress - ng--cpu 2 --cpu - load 100 --taskset 0, 1 &
 
# Memory stress
stress - ng--vm 2 --vm - bytes 1G--taskset 0, 1 &
 
# I / O stress
stress - ng--io 4 --taskset 0, 1 &
 
# Network stress(if applicable)
    iperf3 - s &  # Server on one machine
iperf3 - c < server_ip > -t 300 &  # Client generates traffic
 
# Disk stress
fio--name = randwrite--ioengine = libaio--iodepth = 32 --rw = randwrite \
--bs=4k--size = 1G--numjobs = 4 --runtime = 300 --time_based &
 
# Now run cyclictest on isolated CPUs
sudo cyclictest - t2 - a 2, 3 - p 90 - m - n - i 1000 - l 1000000
 
# Compare Max latency with and without load!
# A good PREEMPT_RT system should show similar worst -case 
# latency regardless of load on housekeeping CPUs.

Kernel Tracing for Latency Analysis:

Latency Tracing with ftrace
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Enable IRQs - off latency tracer
echo 0 > /sys/kernel / debug / tracing / tracing_on
echo irqsoff > /sys/kernel / debug / tracing / current_tracer
echo 1 > /sys/kernel / debug / tracing / tracing_on
 
# Let system run, then check max latency
cat / sys / kernel / debug / tracing / tracing_max_latency
# Shows maximum time IRQs were disabled(in microseconds)
 
# View the trace of the maximum latency event
cat / sys / kernel / debug / tracing / trace
# Shows call stack during longest IRQs - off period
 
# Reset and continue monitoring
echo 0 > /sys/kernel / debug / tracing / tracing_max_latency
 
# Similarly for preemptoff tracer(tracks preempt - disabled time)
echo preemptoff > /sys/kernel / debug / tracing / current_tracer
 
# Or use wakeup tracer(tracks task wakeup to run latency)
echo wakeup > /sys/kernel / debug / tracing / current_tracer

Long-Duration Testing

Real-time latency issues often manifest rarely—once per hour or per day. Test for extended periods (24+ hours) under load to catch rare worst-case events. A 10-minute test may miss the 1-in-a-million event that causes a deadline miss.

Trade-offs and Considerations

Latency reduction involves trade-offs. Understanding these helps you make informed decisions for your specific requirements.

Latency Reduction Trade-offs
Optimization	Latency Benefit	Cost/Trade-off
PREEMPT_RT kernel	10-100x lower worst-case	~5-10% throughput reduction
CPU isolation	Eliminates interference	Fewer CPUs for general work
Disable C-states	No wake latency	Higher power consumption
Disable frequency scaling	Consistent timing	Higher power, potential thermal issues
mlockall()	No page faults	Increased memory usage
Disable THP	No compaction latency	Potentially higher TLB misses
nohz_full	No tick interrupt	Slightly complex kernel behavior

Diminishing Returns:

Latency optimization follows diminishing returns. The first optimizations (PREEMPT_RT, memory locking) provide enormous benefits. Later optimizations provide smaller improvements at increasing complexity or cost.

Optimization Priority Order

Conceptual

Recommended Optimization Order(impact vs effort):
 
1. PREEMPT_RT kernel           ████████████████████ (Massive impact)
2. mlockall() + prefault       ██████████████       (Major impact)
3. Correct RT scheduling       █████████████        (Major impact)
4. CPU isolation               ███████████          (Significant)
5. IRQ affinity                █████████            (Moderate)
6. C - state / P - state tuning      ████████             (Moderate)
7. nohz_full                   ██████               (Incremental)
8. BIOS optimizations          █████                (Environment - specific)
9. Disable debugging / tracing   ████                 (Final polish)
 
Don't over-optimize: If your requirement is 100μs worst-case and
you're achieving 50μs, you're done! Extra optimization is wasted
effort and may introduce unnecessary complexity or power cost.

Measure Don't Guess

Always measure before and after each optimization. Some 'optimizations' may have no effect or even negative impact in your specific environment. Let measurements guide your efforts, not assumptions.

Summary: Latency Reduction

Latency reduction is a systematic discipline requiring attention to kernel, hardware, and application levels. Let's consolidate the key concepts:

Key Takeaways

•Latency Has Multiple Components — Hardware, interrupt, scheduling, and application layers each contribute. Optimize the entire path.
•Worst Case is What Matters — For real-time guarantees, measure and optimize worst-case latency under load, not average.
•CPU Isolation is Powerful — Dedicating CPUs to RT tasks with isolcpus, nohz_full, and rcu_nocbs provides dramatic improvements.
•Hardware Features Cause Jitter — C-states, frequency scaling, and SMIs add unpredictable latency. Configure or disable as needed.
•Lock Memory Early — mlockall() and pre-faulting prevent page fault latency during RT execution.
•Measure Continuously — Use cyclictest under load to validate latency. Test for extended periods to catch rare events.
•Trade-offs Exist — Lower latency often costs power, throughput, or complexity. Optimize until you meet requirements, not beyond.

What's Next:

With latency reduction techniques mastered, we'll next explore RT-Linux variants and history—examining the evolution of real-time Linux approaches and understanding how PREEMPT_RT fits into the broader RT Linux ecosystem.

Page Complete

You now possess a comprehensive toolkit for reducing and measuring latency in real-time Linux systems. These techniques enable you to achieve microsecond-level determinism on commodity hardware.

3 / 5

Loading learning content...

Operating SystemsReal-Time Linux

Real-Time Linux

LevelAdvanced

Duration90 mins

TopicReal-Time Linux

3 / 5

Latency Reduction

The Pursuit of Determinism

What You Will Learn

Understanding Latency Components

Before reducing latency, we must understand what comprises it. The end-to-end response time from an external event to application response includes multiple distinct phases:

End-to-End Latency Breakdown

Conceptual

Event-to-Response Latency Path:
 
External Event (e.g., sensor input)
         │
         ▼ ┌─────────────────────────────────────────────────────┐
         │ │ Hardware Latency                                    │
         │ │ - Signal propagation                                │
         │ │ - Interrupt controller processing                   │
         │ │ - CPU interrupt recognition                         │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-10 μs
         ▼
         │ ┌─────────────────────────────────────────────────────┐
         │ │ Interrupt Latency                                   │
         │ │ - Interrupts-disabled sections                      │
         │ │ - Interrupt priority resolution                     │
         │ │ - Hardirq handler start                             │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-100 μs (can be ms without PREEMPT_RT)
         ▼
         │ ┌─────────────────────────────────────────────────────┐
         │ │ Handler Execution                                   │
         │ │ - ISR or threaded IRQ handler                       │
         │ │ - Wake waiting task                                 │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-50 μs
         ▼
         │ ┌─────────────────────────────────────────────────────┐
         │ │ Scheduling Latency                                  │
         │ │ - Scheduler invocation                              │
         │ │ - Task selection                                    │
         │ │ - Context switch                                    │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-20 μs (PREEMPT_RT)
         ▼
         │ ┌─────────────────────────────────────────────────────┐
         │ │ Task Wakeup Latency                                 │
         │ │ - Cache warmup                                      │
         │ │ - TLB repopulation                                  │
         │ │ - Branch predictor warmup                           │
         │ └─────────────────────────────────────────────────────┘
         │      Typical: 1-10 μs
         ▼
Application Code Runs
         │
         ▼
Response Action (e.g., motor command)
 
TOTAL: Sum of all components + variability of each

The Jitter Problem:

Latency vs Jitter in Real-Time Systems
Characteristic	Consistent 100μs Latency	Variable 10-500μs Latency
Average Latency	100 μs	~200 μs
Worst-Case Latency	100 μs	500 μs
Control Loop Tuning	Straightforward compensation	Difficult, may be unstable
Timing Predictability	Fully predictable	Unpredictable
System Design	Simple, reliable	Complex, oversized margins

Focus on Worst-Case

Kernel Configuration for Low Latency

Beyond selecting PREEMPT_RT, numerous kernel configuration options affect latency. Proper configuration can mean the difference between 50μs and 500μs worst-case latency.

Essential Kernel Configuration

Kconfig

# ============================================
# ESSENTIAL: Preemption Mode
# ============================================
CONFIG_PREEMPT_RT=y              # Full real-time preemption
 
# ============================================
# ESSENTIAL: Timer Configuration  
# ============================================
CONFIG_HIGH_RES_TIMERS=y         # Nanosecond-resolution timers
CONFIG_HZ_1000=y                 # 1000Hz timer tick (1ms resolution)
                                 # or CONFIG_NO_HZ_FULL for tickless
 
# NO_HZ options (choose one):
# CONFIG_HZ_PERIODIC=y           # Traditional periodic tick
# CONFIG_NO_HZ_IDLE=y            # Tickless when idle (common default)
# CONFIG_NO_HZ_FULL=y            # Full tickless for isolated CPUs
 
# ============================================
# RECOMMENDED: Reduce Interrupt Latency
# ============================================
CONFIG_IRQSOFF_TRACER=y          # Track IRQs-off latency (debug/tune)
CONFIG_PREEMPTIRQ_EVENTS=y       # Preemption/IRQ event tracing
CONFIG_IRQ_FORCED_THREADING=y    # Force threading of IRQ handlers
 
# ============================================
# MEMORY: Avoid allocation latency spikes
# ============================================
CONFIG_TRANSPARENT_HUGEPAGE=n    # Disable THP (compaction latency)
CONFIG_COMPACTION=n              # Or carefully tune if enabled
CONFIG_SLUB=y                    # SLUB allocator (vs SLAB)
CONFIG_SLUB_CPU_PARTIAL=y        # Reduce cross-CPU slab operations
 
# ============================================
# CPU POWER: Avoid C-state transition latency
# ============================================
# At runtime: processor.max_cstate=1 or =0 boot parameter
# Or use PM QoS to constrain C-states programmatically
 
# ============================================
# DISABLE: Features that add latency
# ============================================
CONFIG_DEBUG_PREEMPT=n           # Disable in production
CONFIG_DEBUG_SPINLOCK=n          # Disable in production  
CONFIG_LOCKDEP=n                 # Disable in production
CONFIG_PROVE_LOCKING=n           # Disable in production
CONFIG_DEBUG_MUTEXES=n           # Disable in production
 
# Note: Keep tracing enabled until system is validated,
# then consider disabling for absolute minimum latency:
# CONFIG_FTRACE=n                # Last resort for lowest latency

Timer Configuration Deep Dive:

Timer configuration profoundly affects scheduling granularity and latency:

Timer Configuration Options Impact
Option	Effect	Latency Impact	When to Use
CONFIG_HZ_100	100Hz tick, 10ms resolution	Poor granularity, higher jitter	Servers, throughput focus
CONFIG_HZ_250	250Hz tick, 4ms resolution	Moderate granularity	Desktop, general use
CONFIG_HZ_1000	1000Hz tick, 1ms resolution	Good granularity, slight overhead	Soft real-time, gaming
CONFIG_NO_HZ_IDLE	Tickless when idle	Reduces idle wakeups	General, with RT tasks
CONFIG_NO_HZ_FULL	Full tickless on isolated CPUs	Eliminates tick interrupt entirely	Dedicated RT CPUs

NO_HZ_FULL for Isolated RT CPUs

Hardware and BIOS Configuration

Modern hardware includes features designed to improve average performance or power efficiency that wreak havoc on real-time determinism. Disabling or constraining these features is often necessary.

Hardware Features That Harm RT Performance

•CPU C-States (Deep Idle) — CPU idle states save power but take microseconds to milliseconds to exit. C3+ states can add 100+μs latency.
•CPU P-States (Frequency Scaling) — Frequency changes during scaling transitions cause execution time variability. Turbo boost adds unpredictability.
•System Management Interrupts (SMI) — Non-maskable hardware interrupts for BIOS functions can steal milliseconds with no OS visibility.
•Hardware Prefetching — Can cause cache pollution and timing variability. Usually leave enabled but be aware.
•NUMA Memory Access — Non-uniform memory access adds latency variability. Pin RT processes and their memory to same NUMA node.
•Intel Speed Shift / AMD CPPC — Hardware-controlled frequency scaling is unpredictable from software perspective.

BIOS/UEFI Configuration:

Recommended BIOS Settings for Real-Time
Setting	Recommended	Reason
C-States	Disable C3 and deeper	Eliminates 100μs+ wakeup latency
Intel SpeedStep/AMD Cool'n'Quiet	Disable or lock to max	Prevents frequency scaling latency
Intel Turbo Boost	Disable for consistency	Eliminates frequency uncertainty
Hyper-Threading	Consider disabling	Reduces contention on shared resources
NUMA Interleaving	Disable for RT	Predictable memory access latency
USB Legacy/PS2 Emulation	Disable	Reduces SMI frequency
Power Management (ACPI)	Minimal or disable	Fewer power transitions

Linux Boot Parameters for Hardware Control:

RT Boot Parameters
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Add to kernel command line (e.g., /etc/default/grub GRUB_CMDLINE_LINUX)
 
# CPU isolation for RT tasks on CPUs 2,3
isolcpus=2,3                    # Remove CPUs from general scheduler
nohz_full=2,3                   # Disable timer tick on these CPUs
rcu_nocbs=2,3                   # Move RCU callbacks off these CPUs
irqaffinity=0,1                 # Bind IRQs to CPUs 0,1 only
 
# C-state control
processor.max_cstate=1          # Limit to C1 (shallowest idle)
intel_idle.max_cstate=0         # Disable intel_idle driver deep states
# Or use kernel idle=poll for no idle at all (extreme, high power)
 
# CPU frequency control
intel_pstate=disable            # Use acpi-cpufreq for manual control
# Then set governor to performance
 
# Memory
transparent_hugepage=never      # Disable THP from boot
 
# Example complete RT command line:
# GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 \
#   irqaffinity=0,1 processor.max_cstate=1 intel_pstate=disable \
#   transparent_hugepage=never"

Runtime Hardware Control:

Runtime Hardware Optimization
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/bash
# Runtime configuration for RT hardware optimization
 
# Set CPU frequency governor to performance (all CPUs)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > "$cpu"
done
 
# Lock CPU frequency to maximum
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq; do
    max=$(cat "$cpu")
    echo $max > /sys/devices/system/cpu/cpu${cpu## * cpu
                                    } / cpufreq / scaling_min_freq
done
 
# Disable CPU frequency boost(Intel Turbo / AMD Boost)
echo 0 > /sys/devices / system / cpu / intel_pstate / no_turbo 2 > /dev/null
echo 0 > /sys/devices / system / cpu / cpufreq / boost 2 > /dev/null
 
# Move IRQs off RT CPUs(assuming CPUs 2, 3 are RT)
for irq in /proc/irq/*/smp_affinity; do
    echo 3 > "$irq" 2>/dev/null  # CPUs 0,1 only (bitmask 0011)
done
 
# PM QoS: Prevent deep C-states from application
# In C code: 
#   int fd = open("/dev/cpu_dma_latency", O_RDWR);
#   int latency = 0;  // Microseconds: 0 = no idle
#   write(fd, &latency, sizeof(latency));
#   // Keep fd open while running RT tasks
 
# Verify isolation
echo "Isolated CPUs: $(cat /sys/devices/system/cpu/isolated)"
echo "NO_HZ Full CPUs: $(cat /sys/devices/system/cpu/nohz_full 2>/dev/null)"

System Management Interrupts (SMI)

CPU Isolation for RT Tasks

CPU Isolation Architecture

Conceptual

CPU Isolation Strategy:
 
System with 4 CPUs (0-3):
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  CPUs 0,1: Housekeeping                                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • General kernel threads (kworker, ksoftirqd, etc.)     │   │
│  │ • Non-RT applications                                    │   │
│  │ • Most device interrupts                                 │   │
│  │ • RCU callback processing                                │   │
│  │ • Timer tick (for these CPUs)                            │   │
│  │ • Network stack, block I/O                               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  CPUs 2,3: Isolated for RT                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • RT application threads ONLY                            │   │
│  │ • No timer tick (nohz_full)                              │   │
│  │ • No RCU callbacks (rcu_nocbs)                           │   │
│  │ • Minimal or no interrupts                               │   │
│  │ • No kernel housekeeping                                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Result: RT tasks run with minimal interference               │
└─────────────────────────────────────────────────────────────────┘

Implementing CPU Isolation:

CPU Isolation Implementation
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Step 1: Kernel boot parameters
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub:
isolcpus=2,3              # Remove from general scheduling
nohz_full=2,3             # Disable timer tick
rcu_nocbs=2,3             # Offload RCU callbacks
irqaffinity=0,1           # Default IRQ affinity to non-isolated CPUs
 
# After editing: update-grub && reboot
 
# Step 2: Verify isolation after boot
cat /sys/devices/system/cpu/isolated       # Should show: 2-3
cat /sys/devices/system/cpu/nohz_full      # Should show: 2-3
 
# Step 3: Move remaining kernel threads off isolated CPUs
# Most should already be off, but verify:
ps -eo pid,psr,comm | awk '$2 ~ /[23]/'
# Move any stragglers (example for kworker):
# Kernel threads may require cgroup or special handling
 
# Step 4: Pin RT application to isolated CPU
taskset -c 2 chrt -f 90 ./my_rt_application
 
# Or programmatically with CPU affinity:

cpu_isolation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
 
/**
 * Pin current thread to a specific isolated CPU
 */
int pin_to_cpu(int cpu) {
    cpu_set_t cpuset;
 
                                    CPU_ZERO(& cpuset);
                                    CPU_SET(cpu, & cpuset);
 
                                    if (sched_setaffinity(0, sizeof(cpuset), & cpuset) != 0) {
                                        perror("sched_setaffinity");
                                        return -1;
                                    }
 
                                    /* Verify affinity */
                                    CPU_ZERO(& cpuset);
                                    sched_getaffinity(0, sizeof(cpuset), & cpuset);
 
                                    if (!CPU_ISSET(cpu, & cpuset)) {
                                        fprintf(stderr, "Failed to pin to CPU %d
", cpu);
                                        return -1;
                                    }
 
                                    printf("Pinned to CPU %d
", cpu);
                                    return 0;
                                }
 
/**
 * Complete RT setup: pin CPU + set scheduling
 */
int setup_rt_thread(int cpu, int priority) {
    struct sched_param param;
 
                                    /* Pin to isolated CPU first */
                                    if (pin_to_cpu(cpu) != 0) {
                                        return -1;
                                    }
 
                                    /* Set SCHED_FIFO with given priority */
                                    param.sched_priority = priority;
                                    if (sched_setscheduler(0, SCHED_FIFO, & param) != 0) {
                                        perror("sched_setscheduler");
                                        return -1;
                                    }
 
                                    printf("Configured: CPU %d, SCHED_FIFO priority %d
",
                                        cpu, priority);
                                    return 0;
                                }
 
/* Example usage in RT thread */
void* rt_worker(void* arg) {
                            int cpu = * (int *)arg;
 
                            /* Setup on thread start */
                            if(setup_rt_thread(cpu, 90) != 0) {
        return NULL;
    }
 
    /* Lock memory */
    mlockall(MCL_CURRENT | MCL_FUTURE);
 
    /* RT work loop */
    while (1) {
        /* Do RT work */
        do_rt_work();
 
        /* Wait for next period */
        wait_for_next_period();
    }
 
    return NULL;
} 

Memory Locality on NUMA Systems

Application-Level Latency Reduction

Even with perfect kernel and hardware configuration, application code can introduce latency and jitter. Following RT application best practices is essential for achieving deterministic behavior.

Memory Management

•Lock All Memory: Call mlockall(MCL_CURRENT | MCL_FUTURE) to prevent page faults during RT execution.
•Pre-allocate Everything: Allocate all buffers, objects, and data structures during initialization, not during RT operations.
•Pre-fault Stack: Touch all stack pages before entering RT section to ensure they're mapped.
•Avoid malloc/free in RT Path: Dynamic memory allocation is non-deterministic. Use pre-allocated pools or static allocation.
•Consider Huge Pages: For large data structures, huge pages reduce TLB misses and page table overhead.

memory_preparation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#include < sys / mman.h >
    #include <string.h>
#include <stdlib.h>
 
#define STACK_PREFAULT_SIZE(512 * 1024)  /* 512KB stack prefault */
 
/**
 * Prepare memory for real-time execution
 * Call BEFORE entering RT critical section
 */
int prepare_rt_memory(void) {
    /* 1. Lock all current and future memory */
    if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
        perror("mlockall failed - running as root?");
        return -1;
    }
 
    /* 2. Pre-fault stack by touching pages */
    volatile char stack_prefault[STACK_PREFAULT_SIZE];
    memset((void*)stack_prefault, 0, sizeof(stack_prefault));
 
    /* 3. Pre-fault heap allocations */
    /* Any malloc'd memory should be touched here */
 
    return 0;
}
 
/**
 * Pre-allocated buffer pool for RT-safe allocation
 */
struct buffer_pool {
    void* buffers[100];
    int free_mask[100];
    size_t buf_size;
};
 
struct buffer_pool * create_buffer_pool(size_t buf_size, int count) {
    struct buffer_pool * pool = malloc(sizeof(* pool));
 
    for (int i = 0; i < count && i < 100; i++) {
        pool -> buffers[i] = aligned_alloc(64, buf_size);
        memset(pool -> buffers[i], 0, buf_size);  /* Pre-fault */
        pool -> free_mask[i] = 1;  /* Available */
    }
    pool -> buf_size = buf_size;
 
    /* Lock pool memory */
    mlock(pool, sizeof(* pool));
    for (int i = 0; i < count; i++) {
        mlock(pool -> buffers[i], buf_size);
    }
 
    return pool;
}
 
/* O(n) but deterministic - no system calls */
void* pool_alloc(struct buffer_pool * pool) {
    for (int i = 0; i < 100; i++) {
        if (pool -> free_mask[i]) {
            pool -> free_mask[i] = 0;
            return pool -> buffers[i];
        }
    }
    return NULL;  /* Pool exhausted */
}
 
void pool_free(struct buffer_pool * pool, void* ptr) {
    for (int i = 0; i < 100; i++) {
        if (pool -> buffers[i] == ptr) {
            pool -> free_mask[i] = 1;
            return;
        }
    }
} 

Timing and Synchronization

•Use CLOCK_MONOTONIC: For timing measurements and sleeps, use CLOCK_MONOTONIC, not CLOCK_REALTIME (which can jump).
•Use Absolute Sleeps: Use clock_nanosleep() with TIMER_ABSTIME to avoid drift accumulation.
•Avoid Busy Waiting: Spin loops waste CPU and prevent other work. Use proper sleeping primitives.
•Minimize Lock Hold Time: Keep critical sections as short as possible to reduce blocking.
•Use Lock-Free Where Possible: For producer-consumer patterns, lock-free queues avoid blocking entirely.

rt_timing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <time.h>
#include <stdint.h>
 
/**
 * Precise periodic timing using absolute clock_nanosleep
 * 
 * This approach doesn't accumulate drift because each
 * sleep targets an absolute time, not a relative delay.
 */
void periodic_rt_loop(uint64_t period_ns) {
    struct timespec next_wake;
 
    /* Get initial time */
    clock_gettime(CLOCK_MONOTONIC, & next_wake);
 
    while (1) {
        /* Calculate next wake time */
        next_wake.tv_nsec += period_ns;
        while (next_wake.tv_nsec >= 1000000000L) {
            next_wake.tv_nsec -= 1000000000L;
            next_wake.tv_sec++;
        }
 
        /* ===== RT WORK SECTION ===== */
        do_real_time_work();
        /* ===== END RT WORK ===== */
 
        /* Sleep until absolute time (no drift!) */
        clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, 
                       & next_wake, NULL);
 
        /*
         * Compare to relative sleep (BAD - accumulates drift):
         * nanosleep(&period_duration, NULL);
         * 
         * With relative sleep, if work takes longer than
         * expected, the error accumulates each period.
         */
    }
}
 
/**
 * Measure execution time for WCET estimation
 */
uint64_t measure_execution_time(void (* func)(void)) {
    struct timespec start, end;
 
    clock_gettime(CLOCK_MONOTONIC, & start);
    func();
    clock_gettime(CLOCK_MONOTONIC, & end);
 
    return (end.tv_sec - start.tv_sec) * 1000000000ULL +
        (end.tv_nsec - start.tv_nsec);
} 

Hidden System Calls

Measuring and Validating Latency

You cannot improve what you cannot measure. Latency measurement is essential for validating RT system behavior and identifying optimization targets.

cyclictest: The Standard RT Benchmark

cyclictest Usage
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Install rt - tests package
# Ubuntu / Debian: apt install rt - tests
# Fedora / RHEL: dnf install rt - tests
 
# Basic cyclictest run
sudo cyclictest--mlockall--priority = 90 --interval = 1000 --loops = 100000
# - m / --mlockall: Lock memory
# - p / --priority: RT priority(1 - 99)
# - i / --interval: Sleep interval in microseconds(1000 = 1ms)
# - l / --loops: Number of iterations(or 0 for infinite)
 
# Per - CPU thread test(recommended)
sudo cyclictest - t - p 90 - m - n - i 1000 - l 1000000
# - t: One thread per CPU
# - n: Use clock_nanosleep instead of nanosleep
 
# Test on isolated CPUs only
sudo cyclictest - t2 - a 2, 3 - p 90 - m - n - i 1000 - l 1000000
# - t2: Two threads
# - a 2, 3: Pin to CPUs 2 and 3
 
# Generate histogram output
sudo cyclictest - t - p 90 - m - n - i 1000 - l 100000 - h 100 > histogram.txt
# - h 100: Histogram with 100 buckets
 
# Example output:
# T: 0(12345) P: 90 I: 1000 C: 100000 Min: 5 Act: 11 Avg: 10 Max: 42
# T: 1(12346) P: 90 I: 1000 C: 100000 Min: 4 Act: 10 Avg: 9 Max: 38
#
# Key metrics:
#   Min: Minimum latency(best case)
#   Avg: Average latency  
#   Max: Maximum latency(WORST CASE - most important!)

Testing Under Load:

RT latency must be measured under realistic load. A system that achieves 20μs latency when idle may show 500μs under load. Use stress tools to simulate production conditions:

stress Testing While Measuring
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Run stress load in background while measuring with cyclictest
 
# CPU stress(run on non - isolated CPUs)
stress - ng--cpu 2 --cpu - load 100 --taskset 0, 1 &
 
# Memory stress
stress - ng--vm 2 --vm - bytes 1G--taskset 0, 1 &
 
# I / O stress
stress - ng--io 4 --taskset 0, 1 &
 
# Network stress(if applicable)
    iperf3 - s &  # Server on one machine
iperf3 - c < server_ip > -t 300 &  # Client generates traffic
 
# Disk stress
fio--name = randwrite--ioengine = libaio--iodepth = 32 --rw = randwrite \
--bs=4k--size = 1G--numjobs = 4 --runtime = 300 --time_based &
 
# Now run cyclictest on isolated CPUs
sudo cyclictest - t2 - a 2, 3 - p 90 - m - n - i 1000 - l 1000000
 
# Compare Max latency with and without load!
# A good PREEMPT_RT system should show similar worst -case 
# latency regardless of load on housekeeping CPUs.

Kernel Tracing for Latency Analysis:

Latency Tracing with ftrace
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Enable IRQs - off latency tracer
echo 0 > /sys/kernel / debug / tracing / tracing_on
echo irqsoff > /sys/kernel / debug / tracing / current_tracer
echo 1 > /sys/kernel / debug / tracing / tracing_on
 
# Let system run, then check max latency
cat / sys / kernel / debug / tracing / tracing_max_latency
# Shows maximum time IRQs were disabled(in microseconds)
 
# View the trace of the maximum latency event
cat / sys / kernel / debug / tracing / trace
# Shows call stack during longest IRQs - off period
 
# Reset and continue monitoring
echo 0 > /sys/kernel / debug / tracing / tracing_max_latency
 
# Similarly for preemptoff tracer(tracks preempt - disabled time)
echo preemptoff > /sys/kernel / debug / tracing / current_tracer
 
# Or use wakeup tracer(tracks task wakeup to run latency)
echo wakeup > /sys/kernel / debug / tracing / current_tracer

Long-Duration Testing

Trade-offs and Considerations

Latency reduction involves trade-offs. Understanding these helps you make informed decisions for your specific requirements.

Latency Reduction Trade-offs
Optimization	Latency Benefit	Cost/Trade-off
PREEMPT_RT kernel	10-100x lower worst-case	~5-10% throughput reduction
CPU isolation	Eliminates interference	Fewer CPUs for general work
Disable C-states	No wake latency	Higher power consumption
Disable frequency scaling	Consistent timing	Higher power, potential thermal issues
mlockall()	No page faults	Increased memory usage
Disable THP	No compaction latency	Potentially higher TLB misses
nohz_full	No tick interrupt	Slightly complex kernel behavior

Diminishing Returns:

Optimization Priority Order

Conceptual

Recommended Optimization Order(impact vs effort):
 
1. PREEMPT_RT kernel           ████████████████████ (Massive impact)
2. mlockall() + prefault       ██████████████       (Major impact)
3. Correct RT scheduling       █████████████        (Major impact)
4. CPU isolation               ███████████          (Significant)
5. IRQ affinity                █████████            (Moderate)
6. C - state / P - state tuning      ████████             (Moderate)
7. nohz_full                   ██████               (Incremental)
8. BIOS optimizations          █████                (Environment - specific)
9. Disable debugging / tracing   ████                 (Final polish)
 
Don't over-optimize: If your requirement is 100μs worst-case and
you're achieving 50μs, you're done! Extra optimization is wasted
effort and may introduce unnecessary complexity or power cost.

Measure Don't Guess

Always measure before and after each optimization. Some 'optimizations' may have no effect or even negative impact in your specific environment. Let measurements guide your efforts, not assumptions.

Summary: Latency Reduction

Latency reduction is a systematic discipline requiring attention to kernel, hardware, and application levels. Let's consolidate the key concepts:

Key Takeaways

•Latency Has Multiple Components — Hardware, interrupt, scheduling, and application layers each contribute. Optimize the entire path.
•Worst Case is What Matters — For real-time guarantees, measure and optimize worst-case latency under load, not average.
•CPU Isolation is Powerful — Dedicating CPUs to RT tasks with isolcpus, nohz_full, and rcu_nocbs provides dramatic improvements.
•Hardware Features Cause Jitter — C-states, frequency scaling, and SMIs add unpredictable latency. Configure or disable as needed.
•Lock Memory Early — mlockall() and pre-faulting prevent page fault latency during RT execution.
•Measure Continuously — Use cyclictest under load to validate latency. Test for extended periods to catch rare events.
•Trade-offs Exist — Lower latency often costs power, throughput, or complexity. Optimize until you meet requirements, not beyond.

What's Next:

Page Complete

You now possess a comprehensive toolkit for reducing and measuring latency in real-time Linux systems. These techniques enable you to achieve microsecond-level determinism on commodity hardware.

3 / 5