Evaluating AI models for Autograding Explain in Plain English Questions: Challenges and Considerations

Loading...
Thumbnail Image

Files

TR Number

Date

2025-12

Journal Title

Journal ISSN

Volume Title

Publisher

ACM

Abstract

Code reading ability has traditionally been under-emphasized in assessments as it is difficult to assess at scale. Prior research has shown that code reading and code writing are intimately related skills; thus being able to assess and train code reading skills may be necessary for student learning. One way to assess code reading ability is using Explain in Plain English (EiPE) questions, which ask students to describe what a piece of code does with natural language. Previous research deployed a binary (correct/incorrect) autograder using bigram models that performed comparably with human teaching assistants on student responses. With a data set of 3,064 student responses from 17 EiPE questions, we investigated multiple autograders for EiPE questions. We evaluated methods as simple as logistic regression trained on bigram features, to more complicated support vector machines (SVMs) trained on embeddings from large language models (LLMs), to GPT-4. We found multiple useful autograders, most with accuracies in the 86-88% range, with different advantages. SVMs trained on LLM embeddings had the highest accuracy; few-shot chat completion with GPT-4 required minimal human effort; pipelines with multiple autograders for specific dimensions (what we call 3D autograders) can provide fine-grained feedback; and code generation with GPT-4 to leverage automatic code testing as a grading mechanism in exchange for slightly more lenient grading standards. While piloting these autograders in a non-major introductory Python course, students had largely similar views of all autograders, although they more often found the GPT-based grader and code generation graders more helpful and liked the code generation grader the most.

Description

Keywords

Citation