In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis

dc.contributor.authorJust, Hoang Anhen
dc.contributor.committeechairJia, Ruoxien
dc.contributor.committeememberAbbott, Amos L.en
dc.contributor.committeememberJones, Creed Farrisen
dc.contributor.committeememberXing, Xinen
dc.contributor.committeememberRamakrishnan, Narendranen
dc.contributor.departmentElectrical and Computer Engineeringen
dc.date.accessioned2026-06-23T08:02:02Zen
dc.date.available2026-06-23T08:02:02Zen
dc.date.issued2026-06-22en
dc.description.abstractThe field of artificial intelligence has increasingly shifted from model-centric to data-centric approaches. As Large Language Models (LLMs) scale, the quality, distribution, and infor- mational density of training data have become major bottlenecks shaping performance and alignment. However, data quality is often deceptive; data that appears high-quality on the surface may lack the precise instructional signals required for effective learning, or worse, introduce latent biases and degrade reasoning capabilities. This dissertation studies training data optimization through three complementary stages: valuation, selection, and data synthesis and modification. Across these stages, we exam- ine how training data can be diagnosed, selected, and enriched under different practical constraints. Stage 1 (Valuation) moves beyond superficial heuristics by using Optimal Transport distances (LAVA) and 2D-Shapley values to estimate the utility of individual training samples and fragmented data components. Stage 2 (Selection) studies how data distributions can be improved through Projektor, which composes data from multiple sources under partial observability. Finally, Stage 3 (Synthesis and Modification) studies how data signals can be augmented, restructured, or curated to encourage more robust and gen- eralizable behaviors in LLMs. Together, these contributions show how training data can be valued, selected, and enriched to improve data efficiency and model capability.en
dc.description.abstractgeneralinformation they study. Traditionally, AI researchers focused on building larger and more complex "brains" (the models). Today, many important improvements in AI increasingly depend on curating better "textbooks" (the data). However, identifying what makes a piece of data truly educational for an AI is incredibly difficult. This dissertation tackles this challenge by studying three complementary ways to improve the data used to train AI systems. First, we develop "Valuation" tools that act as diagnostic tests, digging beneath the surface to estimate which data points are genuinely helpful and which may quietly limit the AI's learning. Second, we develop "Selection" tools for choosing more useful combinations of information from different sources, helping the AI focus on information that complements what it already knows. Finally, we study "Synthesis" and curation methods that create, modify, or select data to fill important learning gaps. For example, these methods can teach the AI why humans prefer certain answers, expose it to multiple perspectives, or select reasoning examples that are easier for a smaller model to learn. Together, these contributions show how better training data can make AI systems more capable, more reliable, less prone to bias, and better aligned with human preferences.en
dc.description.degreeDoctor of Philosophyen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:46989en
dc.identifier.urihttps://hdl.handle.net/10919/143484en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-ShareAlike 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-sa/4.0/en
dc.subjectdata valuationen
dc.subjectmachine learningen
dc.subjectdata-centric AIen
dc.titleIn Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesisen
dc.typeDissertationen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.nameDoctor of Philosophyen

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Just_H_D_2026.pdf
Size:
7.66 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
Just_H_D_2026_support_1.pdf
Size:
30.91 KB
Format:
Adobe Portable Document Format
Description:
Supporting documents