Loading problem...
In production machine learning systems, processing individual inference requests one at a time is highly inefficient. Request batching is a critical optimization technique that groups multiple incoming requests together, enabling hardware accelerators (GPUs, TPUs) to leverage their parallel processing capabilities and dramatically improve throughput.
Your task is to implement an intelligent batching mechanism for ML model inference that balances two competing constraints:
Batch Size Constraint: A batch should not exceed a specified maximum number of requests, ensuring memory constraints are respected and batch processing remains efficient.
Latency Constraint: The elapsed time since the first request in the current batch should not exceed a specified wait time threshold, ensuring no request experiences excessive queuing delays.
Batching Logic: When a new request arrives, it should be added to the current batch if both constraints remain satisfied. If adding the request would violate either constraint, the current batch must be finalized and processed, and a new batch should begin with the incoming request.
Important Implementation Details:
This batching strategy is fundamental to building high-performance ML serving infrastructure and relates directly to reducing per-request latency while maximizing hardware utilization.
requests = [{'id': 1, 'timestamp': 0.0, 'features': [1.0, 2.0]}, {'id': 2, 'timestamp': 0.1, 'features': [3.0, 4.0]}, {'id': 3, 'timestamp': 0.2, 'features': [5.0, 6.0]}, {'id': 4, 'timestamp': 0.3, 'features': [7.0, 8.0]}]
max_batch_size = 2
max_wait_time = 10.0[([1, 2], [[1.0, 2.0], [3.0, 4.0]], 0.1), ([3, 4], [[5.0, 6.0], [7.0, 8.0]], 0.3)]With max_batch_size=2 and max_wait_time=10.0 seconds:
• Batch 1: Request 1 (t=0.0) starts the batch. Request 2 (t=0.1) arrives within the wait window and doesn't exceed batch size, so it joins. Batch reaches max size (2), so it's processed at t=0.1.
• Batch 2: Request 3 (t=0.2) starts a new batch. Request 4 (t=0.3) joins without violating constraints. Batch reaches max size, processed at t=0.3.
The wait time constraint (10.0s) is never triggered since all inter-arrival gaps are ≤0.3 seconds.
requests = [{'id': 1, 'timestamp': 0.0, 'features': [1.0, 2.0]}, {'id': 2, 'timestamp': 0.05, 'features': [3.0, 4.0]}, {'id': 3, 'timestamp': 0.15, 'features': [5.0, 6.0]}, {'id': 4, 'timestamp': 0.2, 'features': [7.0, 8.0]}]
max_batch_size = 10
max_wait_time = 0.1[([1, 2], [[1.0, 2.0], [3.0, 4.0]], 0.05), ([3, 4], [[5.0, 6.0], [7.0, 8.0]], 0.2)]With max_batch_size=10 (high) and max_wait_time=0.1 seconds (tight):
• Batch 1: Request 1 (t=0.0) starts the batch. Request 2 (t=0.05) arrives, and elapsed time is 0.05s ≤ 0.1s, so it joins. Request 3 (t=0.15) would cause elapsed time of 0.15s > 0.1s from first request, triggering batch finalization at t=0.05.
• Batch 2: Request 3 (t=0.15) starts new batch. Request 4 (t=0.2) arrives with elapsed time 0.05s ≤ 0.1s. All remaining requests processed at t=0.2.
Here, the wait time constraint drives batch formation, not batch size.
requests = [{'id': 3, 'timestamp': 0.2, 'features': [5.0, 6.0]}, {'id': 1, 'timestamp': 0.0, 'features': [1.0, 2.0]}, {'id': 4, 'timestamp': 0.3, 'features': [7.0, 8.0]}, {'id': 2, 'timestamp': 0.1, 'features': [3.0, 4.0]}]
max_batch_size = 2
max_wait_time = 10.0[([1, 2], [[1.0, 2.0], [3.0, 4.0]], 0.1), ([3, 4], [[5.0, 6.0], [7.0, 8.0]], 0.3)]The requests arrive in non-chronological order (IDs: 3, 1, 4, 2), but they must be processed by timestamp:
• First, sort by timestamp: Request 1 (t=0.0), Request 2 (t=0.1), Request 3 (t=0.2), Request 4 (t=0.3)
• Then apply the same batching logic: Requests 1 & 2 form Batch 1, Requests 3 & 4 form Batch 2.
This demonstrates that the batching algorithm must first establish temporal ordering before applying batch constraints.
Constraints