New Approaches to Synthetic Tabular Data Generation

dc.contributor.authorXu, Shengzheen
dc.contributor.committeechairRamakrishnan, Narendranen
dc.contributor.committeememberJia, Ruoxien
dc.contributor.committeememberYao, Danfengen
dc.contributor.committeememberMarwah, Manishen
dc.contributor.committeememberLu, Chang Tienen
dc.contributor.departmentComputer Science and#38; Applicationsen
dc.date.accessioned2025-07-30T08:00:48Zen
dc.date.available2025-07-30T08:00:48Zen
dc.date.issued2025-07-29en
dc.description.abstractSynthetic data generation, while already becoming well-known as part of Generative AI (GenAI), has been primarily focused on images, voice, and text, which mostly have homogeneous data formats. This dissertation focuses on the modeling and generation of synthetic tables, which involve a range of characteristics: numerous variables, diverse attribute types, functional dependencies across columns, and temporal dependencies across rows. We aim to explore how to generate higher-quality synthetic tabular data through the following subproblems: (1) auto-regressive DNNs for synthetic table generation (STG), (2) large language models (LLMs) for adaptive STG with higher fidelity, (3) reducing in-context learning burden in STG via LLM priors, (4) embedding isotropy as a trust indicator for STG with LLMs, and (5) STG for next-generation wireless as a telecom application. Through Problems 1 and 2, we aim to improve the quality of generated synthetic tables; in Problem 3, we reduce the computational cost while maintaining quality; Problem 4 proposes a trust indicator for evaluating synthetic data quality by analyzing the isotropy of the model's internal embeddings; and Problem 5 demonstrates an application scenario in wireless telecommunications.en
dc.description.abstractgeneralSynthetic data generation, while already becoming well-known as part of Generative AI (GenAI), has been primarily focused on images, voice, and text, which mostly have homogeneous data formats. This dissertation focuses on the modeling and generation of synthetic tables, which involve a range of characteristics: numerous variables, diverse attribute types, functional dependencies across columns, and temporal dependencies across rows. We aim to explore how to generate higher-quality synthetic tabular data through the following subproblems: (1) auto-regressive DNNs for synthetic table generation (STG), (2) large language models (LLMs) for adaptive STG with higher fidelity, (3) reducing in-context learning burden in STG via LLM priors, (4) embedding isotropy as a trust indicator for STG with LLMs, and (5) STG for next-generation wireless as a telecom application. Through Problems 1 and 2, we aim to improve the quality of generated synthetic tables; in Problem 3, we reduce the computational cost while maintaining quality; Problem 4 proposes a trust indicator for evaluating synthetic data quality by analyzing the isotropy of the model's internal embeddings; and Problem 5 demonstrates an application scenario in wireless telecommunications.en
dc.description.degreeDoctor of Philosophyen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:44339en
dc.identifier.urihttps://hdl.handle.net/10919/136928en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/en
dc.subjectGenerative Neural Modelsen
dc.subjectTabular Dataen
dc.subjectLarge Language Modelsen
dc.titleNew Approaches to Synthetic Tabular Data Generationen
dc.typeDissertationen
thesis.degree.disciplineComputer Science & Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.nameDoctor of Philosophyen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Xu_S_D_2025.pdf
Size:
13.84 MB
Format:
Adobe Portable Document Format