In Pursuit of Optimal Training Data: Towards a Unified Framework for Curation and Synthesis

Loading...
Thumbnail Image

TR Number

Date

2026-06-22

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

The field of artificial intelligence has increasingly shifted from model-centric to data-centric approaches. As Large Language Models (LLMs) scale, the quality, distribution, and infor- mational density of training data have become major bottlenecks shaping performance and alignment. However, data quality is often deceptive; data that appears high-quality on the surface may lack the precise instructional signals required for effective learning, or worse, introduce latent biases and degrade reasoning capabilities. This dissertation studies training data optimization through three complementary stages: valuation, selection, and data synthesis and modification. Across these stages, we exam- ine how training data can be diagnosed, selected, and enriched under different practical constraints. Stage 1 (Valuation) moves beyond superficial heuristics by using Optimal Transport distances (LAVA) and 2D-Shapley values to estimate the utility of individual training samples and fragmented data components. Stage 2 (Selection) studies how data distributions can be improved through Projektor, which composes data from multiple sources under partial observability. Finally, Stage 3 (Synthesis and Modification) studies how data signals can be augmented, restructured, or curated to encourage more robust and gen- eralizable behaviors in LLMs. Together, these contributions show how training data can be valued, selected, and enriched to improve data efficiency and model capability.

Description

Keywords

data valuation, machine learning, data-centric AI

Citation