High Performance Video Inference at Scale: Addressing System Overhead with C++ Coroutines and io_uring
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Multi-stream video inference systems typically employ a thread-heavy architecture as the underlying infrastructure. At scale, this model suffers from CPU migrations, context switch storms, synchronization overhead, cache thrashing, TLB pollution, and unpredictable latency. This thesis presents a thread-per-core alternative that replaces OS-managed thread scheduling with user-space coroutine scheduling. Our system combines C++20 stackless coroutines for cooperative multitasking, Linux io_uring for asynchronous I/O, and non-blocking GPU completions for asynchronous inference requests. Benchmarking across four scale factors and four execution modes on an AMD EPYC / NVIDIA A2 platform with perf stat and NVIDIA Nsight Systems profiling, our architecture achieves up to 13.6% higher throughput, 365x fewer CPU migrations, 2.48x fewer context switches, 2.7x fewer page faults, and 16.9% reduction in total CPU work compared to a properly optimized threaded baseline. Both architectures converge within 1% GPU utilization.