Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation
It has historically been the case that CPUs have run programs ever faster without significant intervention on the behalf of the programmer. However, this "free lunch" has largely ended due to the end of exponentially increasing core frequency and the current slow increase in instruction-level parallelism but continues to a degree in cache size improvements. But since Moore's law still largely continues "lunch", i.e. program performance, can still be bought at the price of rewriting code for multiple cores, which is enabled by the trend Moore's law describes. Multicore architectures cannot aid performance for problems whose solutions are necessarily sequential in nature and writing efficient and correct concurrent programs is not easy in all cases when using synchronization methods like fine-grained mutex locks.
Transactional memory, and its implementation as hardware transactional memory, allow programmers to write concurrent applications without the attendant complexity of programming with mutex locks. This allows programmers to focus on optimizing the application for performance. Given that transactions can run two segments of code in parallel that a mutex lock would force to run sequentially and that transactions can abort, causing a program to do the same work more than once, whether transactions perform better or worse than mutex locks is dependent on the program's execution profile and the coarseness or fineness at which mutex locks are used.
In this thesis the GNU C Library's futex implementation of mutex locks and Intel's Restricted Transactional Memory have been compared and the behavior of those transactions has been analyzed. This analysis includes a pathological behavior permitted by the GNU C Library's hardware transactional memory implementation of mutex locks. The tradeoffs between fine-grained and global locking implementations have been discussed, compared, and used in the context of fallback locks for hardware transactions. This thesis provides evidence to the effect that fine-grained locking is not critical for program performance and that in many cases global locking and hardware transactions can provide nearly equivalent performance without the programming difficulties. This work has shown that across the 23 applications examined, with relation to their original locking implementation, a global locking scheme without elision has a 0.96x speedup, Intel's Restricted Transactional Memory (RTM) with the application's original locks as a fallback has a 1.01x speedup and with global lock fallback RTM has a speedup of 0.97x.
This work is supported in part by NAVSEA/NEEC under grant 3003279297. Any opinions, findings, and conclusions or recommendations expressed in this thesis are those of the author and do not necessarily reflect the views of NAVSEA.