The performance of large language models on quantitative and verbal ability tests: Initial evidence and implications for unproctored high-stakes testing

dc.contributor.authorHickman, Louisen
dc.contributor.authorDunlop, Patrick D.en
dc.contributor.authorWolf, Jasper Leoen
dc.date.accessioned2025-11-21T18:11:18Zen
dc.date.available2025-11-21T18:11:18Zen
dc.date.issued2024-12-01en
dc.description.abstractUnproctored assessments are widely used in pre-employment assessment. However, widely accessible large language models (LLMs) pose challenges for unproctored personnel assessments, given that applicants may use them to artificially inflate their scores beyond their true abilities. This may be particularly concerning in cognitive ability tests, which are widely used and traditionally considered to be less fakeable by humans than personality tests. Thus, this study compares the performance of LLMs on two common types of cognitive tests: quantitative ability (number series completion) and verbal ability (use a passage of text to determine whether a statement is true). The tests investigated are used in real-world, high-stakes selection. We also examine the performance of the LLMs across different test formats (i.e., open-ended vs. multiple choice). Further, we contrast the performance of two LLMs (Generative Pretrained Transformers, GPT-3.5 and GPT-4) across multiple prompt approaches and "temperature" settings (i.e., a parameter that determines the amount of randomness in the model's output). We found that the LLMs performed well on the verbal ability test but extremely poorly on the quantitative ability test, even when accounting for the test format. GPT-4 outperformed GPT-3.5 across both types of tests. Notably, although prompt approaches and temperature settings did affect LLM test performance, those effects were mostly minor relative to differences across tests and language models. We provide recommendations for securing pre-employment testing against LLM influences. Additionally, we call for rigorous research investigating the prevalence of LLM usage in pre-employment testing as well as on how LLM usage affects selection test validity. Job candidates may use large language models like ChatGPT to complete ability tests on their behalf, but we currently know little about how well these models perform on commercial cognitive ability tests. OpenAI's (free) Generative Pretrained Transformers (GPT)-3.5 and (paid subscription) GPT-4 models both performed very poorly on a quantitative ability test. GPT-4 achieved results above the 90th percentile on the verbal ability test, and GPT-3.5 scored at approximately the 60th percentile. Temperature settings did not substantially affect the performance of the large language models and different prompt approaches tended not to, with few exceptions.en
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1111/ijsa.12479en
dc.identifier.eissn1468-2389en
dc.identifier.issn0965-075Xen
dc.identifier.issue4en
dc.identifier.urihttps://hdl.handle.net/10919/139715en
dc.identifier.volume32en
dc.language.isoenen
dc.publisherWileyen
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivatives 4.0 Internationalen
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/en
dc.subjectartificial intelligenceen
dc.subjectchatbotsen
dc.subjectcognitive ability testingen
dc.subjectgenerative pretrained transformeren
dc.subjectlarge language modelsen
dc.titleThe performance of large language models on quantitative and verbal ability tests: Initial evidence and implications for unproctored high-stakes testingen
dc.title.serialInternational Journal of Selection and Assessmenten
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
HickmanPerformance.pdf
Size:
820.36 KB
Format:
Adobe Portable Document Format
Description:
Published version