Craun, MiloWilliams, Dan2025-10-012025-10-012025-09-08https://hdl.handle.net/10919/137882BPF programs are extensively used for tracing and observability in production systems where performance overheads matter. Many individual BPF programs do not incur serious performance degrading overhead on their own, but increasingly more than a single BPF program is used to understand production system performance. BPF deployments have begun to look more like distributed applications; however, this is a mismatch with the underlying Linux kernel, potentially leading to high overhead cost. In particular, we identify that many BPF programs follow a pattern based on pairwise program deployment where entry and exit probes will be attached to measure a single quantity. We find that the pairwise BPF program pattern results in unnecessary overheads. We identify three optimizations—BPF program inlining, context aware optimization, and intermediate state internalization—that apply to pairwise BPF programs. We show that applying these optimizations to an example pairwise BPF program can reduce overhead on random read throughput from 28.13% to 8.98% and on random write throughput from 26.97% to 8.60%. We then examine some key design questions that arise when seeking to integrate optimizations with the existing BPF system.application/pdfenCreative Commons Attribution 4.0 InternationalPairwise BPF Programs Should Be Optimized TogetherArticle - Refereed2025-10-01The author(s)https://doi.org/10.1145/3748355.3748362