The success of large-language models (LLMs), especially with regard to their generative capacities, made them interesting for applications outside of natural language processing (NLP) tasks. One such task is to generate synthetic tabular data, where LLMs outperform the state-of-the-art. However, it is unclear how this method performs with respect to privacy metrics. Recent literature has shown that Diffusion Models, a different kind of large generative model, most famously used in image generators such as Stable Diffusion or DALL-E 2, are much less private than previous generative models such as Generative Adversarial Networks (GANs).
Since tabular data is often used in domains where privacy is crucial (e.g. medicine, finance or IoT sensor data), it is critical to fill this gap and develop an understanding of how well LLMs perform with respect to privacy of the training data.
The goal of this work is to generate synthetic data using LLM-based methods. First experiments should be done using standard tabular datasets (e.g. diabetes or adult dataset), later on more challenging IoT sensor data.