LLM-Assisted Detecting and Redacting Confidential Information for Government Information Disclosure
dc.contributor.author | Hasegawa, Masaki | en |
dc.contributor.committeechair | MATSUO, SHINICHIRO | en |
dc.contributor.committeechair | Lou, Wenjing | en |
dc.contributor.committeemember | Cameron, Melissa | en |
dc.contributor.department | Computer Science and#38; Applications | en |
dc.date.accessioned | 2025-05-31T08:04:31Z | en |
dc.date.available | 2025-05-31T08:04:31Z | en |
dc.date.issued | 2025-05-30 | en |
dc.description.abstract | Generative AI, especially large language models (LLMs), has advanced rapidly, with real-world applications growing steadily. However, the use of generative AI in the public sector has lagged behind the private sector. This paper focuses on the "Governmental Information Disclosure Process," which is vital in democratic countries' administrative systems. Many developed nations require government agencies to disclose information to citizens, excluding confidential data such as personal information. Although agencies must confirm the presence of confidential information and redact or mask it before release, this process is still manual, creating significant room for improvement. Additionally, since the information to be masked is defined in natural language, such as legal text, interpreting documents' contexts to determine what qualifies as confidential is resource-intensive. In this context, LLMs, capable of inferring context and general knowledge, could efficiently identify parts of documents that require masking. This paper first reviews the existing literature on sensitive or confidential information detection using LLMs, clarifying the use cases and the category of information identified in both the private and public sectors. Then, as a case study, we create sample documents modeled after Japanese administrative texts and compare the detecting and masking results performed by testers with administrative experience, following legal requirements, with those generated by an LLM. This study contributes by proposing end-to-end approach where LLMs directly generate masked text with dynamically determined granularity. This resolves the fundamental trade-off in previous methods by allowing the model to decide appropriate masking units (characters, words, or phrases) based on contextual requirements rather than predetermined structural units. | en |
dc.description.abstractgeneral | The rapid growth of generative AI, driven by large language models (LLMs), has led to significant exploration of real-world applications. However, these efforts have largely been led by the private sector, while the adoption of generative AI in public sector organizations, especially in administrative processes, remains limited. This paper explores the "Governmental Information Disclosure Process" as a potential LLM use case for public sector organizations in democratic countries. In many democracies, government agencies are required to disclose information and documents, excluding confidential data, such as personal or sensitive information, upon request from citizens. Typically, administrative bodies must verify and mask confidential content before releasing documents. However, this verification and masking is still done manually, leaving room for efficiency improvements. Moreover, the confidential information to be masked is often defined in natural language, such as legal texts, and requires context interpretation to determine what qualifies as confidential, which is resource-intensive. LLMs, leveraging context and general knowledge, could provide an effective solution. This study evaluates how well LLMs perform in detecting and masking Japanese administrative documents by comparing results from experienced testers, who follow legal guidelines, with those generated by LLMs. This study contributes by proposing end-to-end approach where LLMs directly generate masked text with dynamically determined granularity. This resolves the fundamental trade-off in previous methods by allowing the model to decide appropriate masking units (characters, words, or phrases) based on contextual requirements rather than predetermined structural units. | en |
dc.description.degree | Master of Science | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:43183 | en |
dc.identifier.uri | https://hdl.handle.net/10919/134960 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | Large Language Model | en |
dc.subject | Public Sector | en |
dc.subject | Government Process Optimization | en |
dc.title | LLM-Assisted Detecting and Redacting Confidential Information for Government Information Disclosure | en |
dc.type | Thesis | en |
thesis.degree.discipline | Computer Science & Applications | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | masters | en |
thesis.degree.name | Master of Science | en |
Files
Original bundle
1 - 1 of 1