High Performance Video Inference at Scale: Addressing System Overhead with C++ Coroutines and io_uring

TR Number

Date

2026-06-01

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Multi-stream video inference systems typically employ a thread-heavy architecture as the underlying infrastructure. At scale, this model suffers from CPU migrations, context switch storms, synchronization overhead, cache thrashing, TLB pollution, and unpredictable latency. This thesis presents a thread-per-core alternative that replaces OS-managed thread scheduling with user-space coroutine scheduling. Our system combines C++20 stackless coroutines for cooperative multitasking, Linux io_uring for asynchronous I/O, and non-blocking GPU completions for asynchronous inference requests. Benchmarking across four scale factors and four execution modes on an AMD EPYC / NVIDIA A2 platform with perf stat and NVIDIA Nsight Systems profiling, our architecture achieves up to 13.6% higher throughput, 365x fewer CPU migrations, 2.48x fewer context switches, 2.7x fewer page faults, and 16.9% reduction in total CPU work compared to a properly optimized threaded baseline. Both architectures converge within 1% GPU utilization.

Description

Keywords

Video Inference, C++ Coroutines, User-Space Scheduling, io_uring, AI Infrastructure

Citation

Collections