Toward Bridging the Semantic Gap Between x86-64 Software Binaries and Abstract Languages for Formal Verification and Security

TR Number

Date

2025-01-13

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Formal verification of software systems has historically focused on verifying properties at the source code (abstract language) level (e.g., C++, Java, Python), as the language semantics are more amenable to use by mathematical libraries. However, this approach comes with limitations. Firstly, not all software has source code. Secondly, the source code may be unavailable for licensing reasons. Thirdly, even if the source code is available, formal verification at the source code level generally assumes that the compiler is trustworthy and mathematically sound. This dissertation addresses challenges with lifting the abstraction of software binaries to support formal verification and reasoning toward bridging the semantic gap. The semantic gap is the loss and obfuscation of program meaning that occurs as a result of compiling source code into software binaries. This work's primary application is verifying safety-critical systems where source code is unavailable. The x86-64 Instruction Set Architecture (ISA) was chosen due to its applicability to a wide range of relevant deployed software. As a necessary dependency for the formal verification of software binaries, a machine model of the ISA is required. The machine model is analogous to the language semantics necessary when reasoning over source code. The x86-64 ISA was also chosen because, in addition to being complex, it lacked a formal specification, making it a more challenging target. Historically, much research in the area of reasoning over x86 software binaries has focused on small fragments of code, utilizing a small subset of the ISA that could be handwritten formally, limiting research to a small subset of real-world software. The first contribution of this dissertation addresses these limitations, presenting a robust hardware-derived formal executable machine model of the x86-64 ISA, supporting 1,625 instruction variants expressed as small-step operational semantics embedded within an Interactive Theorem Prover (ITP) (a 4-fold increase over prior work). The second contribution of this dissertation presents a largely automated methodology for generating formally proven equivalence theorems between decompiled x86-64 machine code and big-step operational semantics. These formally proven abstract representations of software binaries allow one to construct correctness proofs per block. To support this, we developed improved Decompilation-into-logic and formal symbolic execution techniques. We then applied this methodology to several case studies, including examples heavily relying on floating-point instructions and real-world examples. This dissertation's third contribution presents efforts to bridge the semantic gap further and tie the results to formal verification techniques commonly utilized over source code. It presents a first-of-the-art approach to leveraging a hardware-derived model derived from machine learning in software binary formal verification efforts. It introduces a library of 628 proven simplification lemmas designed to lift the abstraction of big-step operational semantics to a form more useful to proof engineers. A methodology for performing axiomatic reasoning across blocks is presented, with case studies demonstrating, for software binaries, standard formal verification techniques typically applicable to source code. Lastly, this dissertation explores a methodology for sound software binary diversity to employ a moving target defense strategy with associated security benefits. Toward bridging the semantic gap between software binaries and source code, this work demonstrates large-scale black-box software binary formal verification in settings where source code is unavailable, on a hardware-derived machine model, as well as a sound software binary diversity approach for moving target defense.

Description

Keywords

x86-64 Assembly, Code Abstraction, Symbolic Execution, Formal Verification, Binary Analysis, Binary Diversity, Moving Target Defense, Cyber Resilience

Citation