Trust at Your Own Peril: A Mixed Methods Exploration of the Ability of Large Language Models to Generate Expert-Like Systems Engineering Artifacts and a Characterization of Failure Modes

dc.contributor.authorTopcu, Taylan G.en
dc.contributor.authorHusain, Mohammeden
dc.contributor.authorOfsa, Maxen
dc.contributor.authorWach, Paulen
dc.date.accessioned2025-03-26T11:59:45Zen
dc.date.available2025-03-26T11:59:45Zen
dc.date.issued2025-02-21en
dc.description.abstractMulti-purpose large language models (LLMs), a subset of generative artificial intelligence (AI), have recently made significant progress. While expectations for LLMs to assist systems engineering (SE) tasks are paramount; the interdisciplinary and complex nature of systems, along with the need to synthesize deep-domain knowledge and operational context, raise questions regarding the efficacy of LLMs to generate SE artifacts, particularly given that they are trained using data that is broadly available on the internet. To that end, we present results from an empirical exploration, where a human expert-generated SE artifact was taken as a benchmark, parsed, and fed into various LLMs through prompt engineering to generate segments of typical SE artifacts. This procedure was applied without any fine-tuning or calibration to document baseline LLM performance. We then adopted a two-fold mixed-methods approach to compare AI generated artifacts against the benchmark. First, we quantitatively compare the artifacts using natural language processing algorithms and find that when prompted carefully, the state-of-the-art algorithms cannot differentiate AI-generated artifacts from the human-expert benchmark. Second, we conduct a qualitative deep dive to investigate how they differ in terms of quality. We document that while the two-material appear very similar, AI generated artifacts exhibit serious failure modes that could be difficult to detect. We characterize these as: premature requirements definition, unsubstantiated numerical estimates, and propensity to overspecify. We contend that this study tells a cautionary tale about why the SE community must be more cautious adopting AI suggested feedback, at least when generated by multi-purpose LLMs.en
dc.description.versionAccepted versionen
dc.format.extent22 page(s)en
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1002/sys.21810en
dc.identifier.eissn1520-6858en
dc.identifier.issn1098-1241en
dc.identifier.orcidTopcu, Taylan [0000-0002-0110-312X]en
dc.identifier.urihttps://hdl.handle.net/10919/125082en
dc.language.isoenen
dc.publisherWileyen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectgenerative artificial intelligence (AI)en
dc.subjecthuman-AI collaborationen
dc.subjectlarge language models (LLMs)en
dc.subjectproblem formulationen
dc.subjectsystems engineeringen
dc.titleTrust at Your Own Peril: A Mixed Methods Exploration of the Ability of Large Language Models to Generate Expert-Like Systems Engineering Artifacts and a Characterization of Failure Modesen
dc.title.serialSystems Engineeringen
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten
dc.type.otherArticleen
dc.type.otherEarly Accessen
dc.type.otherJournalen
pubs.organisational-groupVirginia Techen
pubs.organisational-groupVirginia Tech/Engineeringen
pubs.organisational-groupVirginia Tech/Engineering/Industrial and Systems Engineeringen
pubs.organisational-groupVirginia Tech/Libraryen
pubs.organisational-groupVirginia Tech/All T&R Facultyen
pubs.organisational-groupVirginia Tech/Engineering/COE T&R Facultyen
pubs.organisational-groupVirginia Tech/Graduate studentsen
pubs.organisational-groupVirginia Tech/Graduate students/Doctoral studentsen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2502.09690v1.pdf
Size:
1014.04 KB
Format:
Adobe Portable Document Format
Description:
Accepted version
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Plain Text
Description: