FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems
In a software life-cycle Source code repositories serve as vast storage areas for program code, ensuring its maintenance and version control throughout the development process. It is not uncommon for these repositories to house programs with hidden errors, which only manifest under specific input conditions, causing the program to deviate from its intended functionality. The growing intricacy of software design has amplified the time and resources required to pinpoint and rectify these issues. These errors, often unintended by developers, can be challenging to identify and correct. While there are techniques to auto-correct faulty code, the expansive realm of potential solutions for a single bug means there's a scarcity of tools and datasets for effective evaluation of the corrected code. This study presents FIXEVAL, a benchmark that includes flawed code entries from competitive coding challenges and their corresponding corrections. FIXEVAL offers an extensive test suite that not only gauges the accuracy of fixes generated by models but also allows for the assessment of a program's functional correctness. This suite further sheds light on time, memory limits, and acceptance based on specific outcomes. We utilize cutting-edge language models, trained on coding languages, as our reference point and juxtapose them using match-based (essentially token similarity) and execution-based (focusing on functional assessment) criteria. Our research indicates that while match-based criteria might not truly represent the functional precision of fixes generated by models, execution-based approaches offer a comprehensive evaluation tailored to the solution. Consequently, we posit that FIXEVAL paves the way for practical automated error correction and assessment of code generated by models. Dataset and models for all of our experiments are made publicly available at https://github.com/mahimanzum/FixEval.