In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis

Just, Hoang Anh

In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis

Files

Just_H_D_2026.pdf (7.66 MB)

Downloads: 15

Supporting documents (30.91 KB)

Downloads: 5

Date

2026-06-22

Authors

Just, Hoang Anh

Publisher

Virginia Tech

Abstract

The field of artificial intelligence has increasingly shifted from model-centric to data-centric approaches. As Large Language Models (LLMs) scale, the quality, distribution, and infor- mational density of training data have become major bottlenecks shaping performance and alignment. However, data quality is often deceptive; data that appears high-quality on the surface may lack the precise instructional signals required for effective learning, or worse, introduce latent biases and degrade reasoning capabilities. This dissertation studies training data optimization through three complementary stages: valuation, selection, and data synthesis and modification. Across these stages, we exam- ine how training data can be diagnosed, selected, and enriched under different practical constraints. Stage 1 (Valuation) moves beyond superficial heuristics by using Optimal Transport distances (LAVA) and 2D-Shapley values to estimate the utility of individual training samples and fragmented data components. Stage 2 (Selection) studies how data distributions can be improved through Projektor, which composes data from multiple sources under partial observability. Finally, Stage 3 (Synthesis and Modification) studies how data signals can be augmented, restructured, or curated to encourage more robust and gen- eralizable behaviors in LLMs. Together, these contributions show how training data can be valued, selected, and enriched to improve data eﬀiciency and model capability.

Keywords

data valuation, machine learning, data-centric AI

Persistent link

https://hdl.handle.net/10919/143484

Collections

Doctoral Dissertations

Full item page

In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections