In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis

Just, Hoang Anh

In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis

dc.contributor.author	Just, Hoang Anh	en
dc.contributor.committeechair	Jia, Ruoxi	en
dc.contributor.committeemember	Abbott, Amos L.	en
dc.contributor.committeemember	Jones, Creed Farris	en
dc.contributor.committeemember	Xing, Xin	en
dc.contributor.committeemember	Ramakrishnan, Narendran	en
dc.contributor.department	Electrical and Computer Engineering	en
dc.date.accessioned	2026-06-23T08:02:02Z	en
dc.date.available	2026-06-23T08:02:02Z	en
dc.date.issued	2026-06-22	en
dc.description.abstract	The field of artificial intelligence has increasingly shifted from model-centric to data-centric approaches. As Large Language Models (LLMs) scale, the quality, distribution, and infor- mational density of training data have become major bottlenecks shaping performance and alignment. However, data quality is often deceptive; data that appears high-quality on the surface may lack the precise instructional signals required for effective learning, or worse, introduce latent biases and degrade reasoning capabilities. This dissertation studies training data optimization through three complementary stages: valuation, selection, and data synthesis and modification. Across these stages, we exam- ine how training data can be diagnosed, selected, and enriched under different practical constraints. Stage 1 (Valuation) moves beyond superficial heuristics by using Optimal Transport distances (LAVA) and 2D-Shapley values to estimate the utility of individual training samples and fragmented data components. Stage 2 (Selection) studies how data distributions can be improved through Projektor, which composes data from multiple sources under partial observability. Finally, Stage 3 (Synthesis and Modification) studies how data signals can be augmented, restructured, or curated to encourage more robust and gen- eralizable behaviors in LLMs. Together, these contributions show how training data can be valued, selected, and enriched to improve data eﬀiciency and model capability.	en
dc.description.abstractgeneral	information they study. Traditionally, AI researchers focused on building larger and more complex "brains" (the models). Today, many important improvements in AI increasingly depend on curating better "textbooks" (the data). However, identifying what makes a piece of data truly educational for an AI is incredibly diﬀicult. This dissertation tackles this challenge by studying three complementary ways to improve the data used to train AI systems. First, we develop "Valuation" tools that act as diagnostic tests, digging beneath the surface to estimate which data points are genuinely helpful and which may quietly limit the AI's learning. Second, we develop "Selection" tools for choosing more useful combinations of information from different sources, helping the AI focus on information that complements what it already knows. Finally, we study "Synthesis" and curation methods that create, modify, or select data to fill important learning gaps. For example, these methods can teach the AI why humans prefer certain answers, expose it to multiple perspectives, or select reasoning examples that are easier for a smaller model to learn. Together, these contributions show how better training data can make AI systems more capable, more reliable, less prone to bias, and better aligned with human preferences.	en
dc.description.degree	Doctor of Philosophy	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:46989	en
dc.identifier.uri	https://hdl.handle.net/10919/143484	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution-ShareAlike 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by-sa/4.0/	en
dc.subject	data valuation	en
dc.subject	machine learning	en
dc.subject	data-centric AI	en
dc.title	In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Engineering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Doctor of Philosophy	en

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Just_H_D_2026.pdf
Size:: 7.66 MB
Format:: Adobe Portable Document Format

Download

Name:: Just_H_D_2026_support_1.pdf
Size:: 30.91 KB
Format:: Adobe Portable Document Format
Description:: Supporting documents

Download

Collections

Doctoral Dissertations