Can an LLM find its way around a Spreadsheet?

Lee, Cho Ting

Can an LLM find its way around a Spreadsheet?

dc.contributor.author	Lee, Cho Ting	en
dc.contributor.committeechair	Ramakrishnan, Narendran	en
dc.contributor.committeemember	Simeone, John	en
dc.contributor.committeemember	Lu, Chang Tien	en
dc.contributor.department	Computer Science and#38; Applications	en
dc.date.accessioned	2024-06-06T08:00:51Z	en
dc.date.available	2024-06-06T08:00:51Z	en
dc.date.issued	2024-06-05	en
dc.description.abstract	Spreadsheets are routinely used in business and scientific contexts, and one of the most vexing challenges data analysts face is performing data cleaning prior to analysis and evaluation. The ad-hoc and arbitrary nature of data cleaning problems, such as typos, inconsistent formatting, missing values, and a lack of standardization, often creates the need for highly specialized pipelines. We ask whether an LLM can find its way around a spreadsheet and how to support end-users in taking their free-form data processing requests to fruition. Just like RAG retrieves context to answer users' queries, we demonstrate how we can retrieve elements from a code library to compose data processing pipelines. Through comprehensive experiments, we demonstrate the quality of our system and how it is able to continuously augment its vocabulary by saving new codes and pipelines back to the code library for future retrieval.	en
dc.description.abstractgeneral	Spreadsheets are frequently utilized in both business and scientific settings, and one of the most challenging tasks that must be accomplished before analysis and evaluation can take place is the cleansing of the data. The ad-hoc and arbitrary nature of issues in data quality, such as typos, inconsistent formatting, missing values, and lack of standardization, often creates the need for highly specialized data cleaning pipelines. Within the scope of this thesis, we investigate whether a large language model (LLM) can navigate its way around a spreadsheet, as well as how to assist end-users in bringing their free-form data processing requests to fruition. Just like Retrieval-Augmented Generation (RAG) retrieves context to answer user queries, we demonstrate how we can retrieve elements from a Python code reference to compose data processing pipelines. Through comprehensive experiments, we showcase the quality of our system and how it is capable of continuously improving its code-writing ability by saving new codes and pipelines back to the code library for future retrieval.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:40725	en
dc.identifier.uri	https://hdl.handle.net/10919/119304	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.subject	LLMs	en
dc.subject	data cleaning	en
dc.subject	end-user programming	en
dc.title	Can an LLM find its way around a Spreadsheet?	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Lee_C_T_2024.pdf
Size:: 3.1 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses