We generate synthetic tabular datasets designed to preserve statistical structure and improve downstream ML utility. Below are reproducible LightGBM benchmarks comparing models trained on real data, synthetic data, and real+synth, evaluated on held-out test sets.
Utility preservation across datasets (synthetic vs real baseline). Values above 100% indicate synthetic data improved downstream ML utility in this setup.
View dataset-by-dataset notebooks rendered as static HTML, with direct downloads for real/synthetic/train/test and notebooks.
Dig In →