In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis
| dc.contributor.author | Just, Hoang Anh | en |
| dc.contributor.committeechair | Jia, Ruoxi | en |
| dc.contributor.committeemember | Abbott, Amos L. | en |
| dc.contributor.committeemember | Jones, Creed Farris | en |
| dc.contributor.committeemember | Xing, Xin | en |
| dc.contributor.committeemember | Ramakrishnan, Narendran | en |
| dc.contributor.department | Electrical and Computer Engineering | en |
| dc.date.accessioned | 2026-06-23T08:02:02Z | en |
| dc.date.available | 2026-06-23T08:02:02Z | en |
| dc.date.issued | 2026-06-22 | en |
| dc.description.abstract | The field of artificial intelligence has increasingly shifted from model-centric to data-centric approaches. As Large Language Models (LLMs) scale, the quality, distribution, and infor- mational density of training data have become major bottlenecks shaping performance and alignment. However, data quality is often deceptive; data that appears high-quality on the surface may lack the precise instructional signals required for effective learning, or worse, introduce latent biases and degrade reasoning capabilities. This dissertation studies training data optimization through three complementary stages: valuation, selection, and data synthesis and modification. Across these stages, we exam- ine how training data can be diagnosed, selected, and enriched under different practical constraints. Stage 1 (Valuation) moves beyond superficial heuristics by using Optimal Transport distances (LAVA) and 2D-Shapley values to estimate the utility of individual training samples and fragmented data components. Stage 2 (Selection) studies how data distributions can be improved through Projektor, which composes data from multiple sources under partial observability. Finally, Stage 3 (Synthesis and Modification) studies how data signals can be augmented, restructured, or curated to encourage more robust and gen- eralizable behaviors in LLMs. Together, these contributions show how training data can be valued, selected, and enriched to improve data efficiency and model capability. | en |
| dc.description.abstractgeneral | information they study. Traditionally, AI researchers focused on building larger and more complex "brains" (the models). Today, many important improvements in AI increasingly depend on curating better "textbooks" (the data). However, identifying what makes a piece of data truly educational for an AI is incredibly difficult. This dissertation tackles this challenge by studying three complementary ways to improve the data used to train AI systems. First, we develop "Valuation" tools that act as diagnostic tests, digging beneath the surface to estimate which data points are genuinely helpful and which may quietly limit the AI's learning. Second, we develop "Selection" tools for choosing more useful combinations of information from different sources, helping the AI focus on information that complements what it already knows. Finally, we study "Synthesis" and curation methods that create, modify, or select data to fill important learning gaps. For example, these methods can teach the AI why humans prefer certain answers, expose it to multiple perspectives, or select reasoning examples that are easier for a smaller model to learn. Together, these contributions show how better training data can make AI systems more capable, more reliable, less prone to bias, and better aligned with human preferences. | en |
| dc.description.degree | Doctor of Philosophy | en |
| dc.format.medium | ETD | en |
| dc.identifier.other | vt_gsexam:46989 | en |
| dc.identifier.uri | https://hdl.handle.net/10919/143484 | en |
| dc.language.iso | en | en |
| dc.publisher | Virginia Tech | en |
| dc.rights | Creative Commons Attribution-ShareAlike 4.0 International | en |
| dc.rights.uri | http://creativecommons.org/licenses/by-sa/4.0/ | en |
| dc.subject | data valuation | en |
| dc.subject | machine learning | en |
| dc.subject | data-centric AI | en |
| dc.title | In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis | en |
| dc.type | Dissertation | en |
| thesis.degree.discipline | Computer Engineering | en |
| thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
| thesis.degree.level | doctoral | en |
| thesis.degree.name | Doctor of Philosophy | en |