LLM-Assisted Detecting and Redacting Confidential Information for Government Information Disclosure

Hasegawa, Masaki

LLM-Assisted Detecting and Redacting Confidential Information for Government Information Disclosure

dc.contributor.author	Hasegawa, Masaki	en
dc.contributor.committeechair	MATSUO, SHINICHIRO	en
dc.contributor.committeechair	Lou, Wenjing	en
dc.contributor.committeemember	Cameron, Melissa	en
dc.contributor.department	Computer Science and#38; Applications	en
dc.date.accessioned	2025-05-31T08:04:31Z	en
dc.date.available	2025-05-31T08:04:31Z	en
dc.date.issued	2025-05-30	en
dc.description.abstract	Generative AI, especially large language models (LLMs), has advanced rapidly, with real-world applications growing steadily. However, the use of generative AI in the public sector has lagged behind the private sector. This paper focuses on the "Governmental Information Disclosure Process," which is vital in democratic countries' administrative systems. Many developed nations require government agencies to disclose information to citizens, excluding confidential data such as personal information. Although agencies must confirm the presence of confidential information and redact or mask it before release, this process is still manual, creating significant room for improvement. Additionally, since the information to be masked is defined in natural language, such as legal text, interpreting documents' contexts to determine what qualifies as confidential is resource-intensive. In this context, LLMs, capable of inferring context and general knowledge, could efficiently identify parts of documents that require masking. This paper first reviews the existing literature on sensitive or confidential information detection using LLMs, clarifying the use cases and the category of information identified in both the private and public sectors. Then, as a case study, we create sample documents modeled after Japanese administrative texts and compare the detecting and masking results performed by testers with administrative experience, following legal requirements, with those generated by an LLM. This study contributes by proposing end-to-end approach where LLMs directly generate masked text with dynamically determined granularity. This resolves the fundamental trade-off in previous methods by allowing the model to decide appropriate masking units (characters, words, or phrases) based on contextual requirements rather than predetermined structural units.	en
dc.description.abstractgeneral	The rapid growth of generative AI, driven by large language models (LLMs), has led to significant exploration of real-world applications. However, these efforts have largely been led by the private sector, while the adoption of generative AI in public sector organizations, especially in administrative processes, remains limited. This paper explores the "Governmental Information Disclosure Process" as a potential LLM use case for public sector organizations in democratic countries. In many democracies, government agencies are required to disclose information and documents, excluding confidential data, such as personal or sensitive information, upon request from citizens. Typically, administrative bodies must verify and mask confidential content before releasing documents. However, this verification and masking is still done manually, leaving room for efficiency improvements. Moreover, the confidential information to be masked is often defined in natural language, such as legal texts, and requires context interpretation to determine what qualifies as confidential, which is resource-intensive. LLMs, leveraging context and general knowledge, could provide an effective solution. This study evaluates how well LLMs perform in detecting and masking Japanese administrative documents by comparing results from experienced testers, who follow legal guidelines, with those generated by LLMs. This study contributes by proposing end-to-end approach where LLMs directly generate masked text with dynamically determined granularity. This resolves the fundamental trade-off in previous methods by allowing the model to decide appropriate masking units (characters, words, or phrases) based on contextual requirements rather than predetermined structural units.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:43183	en
dc.identifier.uri	https://hdl.handle.net/10919/134960	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Large Language Model	en
dc.subject	Public Sector	en
dc.subject	Government Process Optimization	en
dc.title	LLM-Assisted Detecting and Redacting Confidential Information for Government Information Disclosure	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Hasegawa_M_T_2025.pdf
Size:: 4.86 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses