Loading content...
In production machine learning systems, monitoring inference performance is mission-critical for maintaining service quality, detecting model degradation, and ensuring that service-level agreements (SLAs) are met. Operations teams rely on real-time dashboards that display key performance indicators derived from inference latency measurements.
Given a collection of inference latency measurements (in milliseconds) from a deployed ML model, your task is to compute the following essential monitoring statistics:
1. Throughput (requests per second): The theoretical maximum number of inference requests that can be processed per second, assuming single-threaded sequential processing. This is calculated as:
$$\text{Throughput} = \frac{1000}{\text{Average Latency (ms)}}$$
2. Average Latency: The arithmetic mean of all latency measurements, providing a general sense of typical response time:
$$\text{Average Latency} = \frac{1}{n} \sum_{i=1}^{n} \text{latency}_i$$
3. Percentile Latencies (p50, p95, p99): Percentiles are crucial for understanding the tail latency distribution—the experience of the slowest requests that often impacts user satisfaction the most.
For percentile calculations, use linear interpolation between adjacent values when the percentile falls between two data points.
Percentile Calculation with Linear Interpolation:
For a sorted array of n latencies and a target percentile p (expressed as a decimal, e.g., 0.95 for p95):
Your Task: Implement a function that takes a list of latency measurements and returns a dictionary containing all computed statistics. If the input list is empty, return an empty dictionary.
Important Notes:
latencies_ms = [10, 20, 30, 40, 50]{"throughput_per_sec":33.33,"avg_latency_ms":30.0,"p50_ms":30.0,"p95_ms":48.0,"p99_ms":49.6}With 5 latency measurements [10, 20, 30, 40, 50]:
Average Latency: (10 + 20 + 30 + 40 + 50) / 5 = 150 / 5 = 30.0 ms
Throughput: 1000 / 30.0 = 33.33 requests/second
Percentile Calculations (n = 5, so n - 1 = 4):
• p50: rank = 0.50 × 4 = 2.0 → data[2] = 30.0 ms (exact index, no interpolation)
• p95: rank = 0.95 × 4 = 3.8 → i = 3, f = 0.8
Interpolation: data[3] + 0.8 × (data[4] - data[3]) = 40 + 0.8 × 10 = 48.0 ms
• p99: rank = 0.99 × 4 = 3.96 → i = 3, f = 0.96
Interpolation: 40 + 0.96 × (50 - 40) = 40 + 9.6 = 49.6 ms
latencies_ms = [5.0, 10.0, 15.0]{"throughput_per_sec":100.0,"avg_latency_ms":10.0,"p50_ms":10.0,"p95_ms":14.5,"p99_ms":14.9}With 3 latency measurements [5.0, 10.0, 15.0]:
Average Latency: (5.0 + 10.0 + 15.0) / 3 = 30.0 / 3 = 10.0 ms
Throughput: 1000 / 10.0 = 100.0 requests/second
Percentile Calculations (n = 3, so n - 1 = 2): • p50: rank = 0.50 × 2 = 1.0 → data[1] = 10.0 ms • p95: rank = 0.95 × 2 = 1.9 → i = 1, f = 0.9 Interpolation: 10.0 + 0.9 × (15.0 - 10.0) = 10.0 + 4.5 = 14.5 ms • p99: rank = 0.99 × 2 = 1.98 → i = 1, f = 0.98 Interpolation: 10.0 + 0.98 × (15.0 - 10.0) = 10.0 + 4.9 = 14.9 ms
latencies_ms = [25.0, 25.0, 25.0, 25.0]{"throughput_per_sec":40.0,"avg_latency_ms":25.0,"p50_ms":25.0,"p95_ms":25.0,"p99_ms":25.0}With 4 identical latency measurements [25.0, 25.0, 25.0, 25.0]:
Average Latency: Since all values are the same, the average is 25.0 ms
Throughput: 1000 / 25.0 = 40.0 requests/second
Percentile Calculations: When all values are identical, every percentile equals that value (25.0 ms). The interpolation formula still works correctly: • Any interpolation between 25.0 and 25.0 yields 25.0
This scenario represents highly consistent model performance—a desirable characteristic in production systems where predictable latency is valued.
Constraints