New Approaches to Synthetic Tabular Data Generation
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Synthetic data generation, while already becoming well-known as part of Generative AI (GenAI), has been primarily focused on images, voice, and text, which mostly have homogeneous data formats. This dissertation focuses on the modeling and generation of synthetic tables, which involve a range of characteristics: numerous variables, diverse attribute types, functional dependencies across columns, and temporal dependencies across rows. We aim to explore how to generate higher-quality synthetic tabular data through the following subproblems: (1) auto-regressive DNNs for synthetic table generation (STG), (2) large language models (LLMs) for adaptive STG with higher fidelity, (3) reducing in-context learning burden in STG via LLM priors, (4) embedding isotropy as a trust indicator for STG with LLMs, and (5) STG for next-generation wireless as a telecom application. Through Problems 1 and 2, we aim to improve the quality of generated synthetic tables; in Problem 3, we reduce the computational cost while maintaining quality; Problem 4 proposes a trust indicator for evaluating synthetic data quality by analyzing the isotropy of the model's internal embeddings; and Problem 5 demonstrates an application scenario in wireless telecommunications.