How Well Do ChatGPT Models Maintain Software?

Loading...
Thumbnail Image

TR Number

Date

2025-06-23

Journal Title

Journal ISSN

Volume Title

Publisher

ACM

Abstract

Since the launch of ChatGPT in 2022, people have conducted various studies to investigate its capabilities in code generation, bug-fixing, test generation, and program comprehension. While ChatGPT has demonstrated strong capabilities in several aspects of software engineering, their effectiveness in maintaining software remains under-explored. Motivated by such a lack of study, we conducted an empirical study to systematically evaluate the performance of ChatGPT in software maintenance. Specifically, we distilled 58 software maintenance tasks from 58 GitHub projects. For each task, we prompted two ChatGPT models—ChatGPT-3.5 and ChatGPT-4o—to separately revise a given Java file, in response to a prescribed maintenance request. Once the models returned results,we assessed each model’s capability by comparing those revisions with developers’ modifications recorded in the version history.

We found that ChatGPT-3.5 correctly revised code for 30 of the 58 tasks, while ChatGPT-4o correctly fulfilled 31 tasks. Neither model fulfilled all tasks successfully mainly because they either truncated Java files unnecessarily, missed project-specific logic, or failed to cover all corner cases. This phenomenon implies that ChatGPT can help developers in software maintenance, but is unlikely to replace developers completely. Our study characterizes ChatGPT’s capabilities in software maintenance and its progression across model versions. It also sheds light on ChatGPT’s potential roles in future software-maintenance practices.

Description

Keywords

Citation