[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data

Source: MarkTechPost

In this tutorial, we build a complete, production-grade synthetic data pipeline using CTGAN and the SDV ecosystem. We start from raw mixed-type tabular data and progressively move toward constrained generation, conditional sampling, statistical validation, and downstream utility testing. Rather than stopping at sample generation, we focus on understanding how well synthetic data preserves structure, distributions, and predictive signal. This tutorial demonstrates how CTGAN can be used responsibly and rigorously in real-world data science workflows.

!pip -q install "ctgan" "sdv" "sdmetrics" "scikit-learn" "pandas" "numpy" "matplotlib"   import numpy as np import pandas as pd import warnings warnings.filterwarnings("ignore")   import ctgan, sdv, sdmetrics from ctgan import load_demo, CTGAN   from sdv.metadata import SingleTableMetadata from sdv.single_table import CTGANSynthesizer   from sdv.cag import Inequality, FixedCombinations from sdv.sampling import Condition   from sdmetrics.reports.single_table import DiagnosticReport, QualityReport   from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline   import matplotlib.pyplot as plt   print("Versions:") print("ctgan:", ctgan.__version__) print("sdv:", sdv.__version__) print("sdmetrics:", sdmetrics.__version__)

We set up the environment by installing all required libraries and importing the full dependency stack. We explicitly load CTGAN, SDV, SDMetrics, and downstream ML tooling to ensure compatibility across the pipeline. We also surface library versions to make the experiment reproducible and debuggable.

real = load_demo().copy() real.columns = [c.strip().replace(" ", "_") for c in real.columns]   target_col = "income" real[target_col] = real[target_col].astype(str)   categorical_cols = real.select_dtypes(include=["object"]).columns.tolist() numerical_cols = [c for c in real.columns if c not in categorical_cols]   print("Rows:", len(real), "Cols:", len(real.columns)) print("Categorical:", len(categorical_cols), "Numerical:", len(numerical_cols)) display(real.head())   ctgan_model = CTGAN(    epochs=30,    batch_size=500,    verbose=True ) ctgan_model.fit(real, discrete_columns=categorical_cols) synthetic_ctgan = ctgan_model.sample(5000) print("Standalone CTGAN sample:") display(synthetic_ctgan.head())

We load the CTGAN Adult demo dataset and perform minimal normalization on column names and data types. We explicitly identify categorical and numerical columns, which is critical for both CTGAN training and evaluation. We then train a baseline standalone CTGAN model and generate synthetic samples for comparison.

metadata = SingleTableMetadata() metadata.detect_from_dataframe(data=real) metadata.update_column(column_name=target_col, sdtype="categorical")   constraints = []   if len(numerical_cols) >= 2:    col_lo, col_hi = numerical_cols[0], numerical_cols[1]    constraints.append(Inequality(low_column_name=col_lo, high_column_name=col_hi))    print(f"Added Inequality constraint: {col_hi} > {col_lo}")   if len(categorical_cols) >= 2:    c1, c2 = categorical_cols[0], categorical_cols[1]    constraints.append(FixedCombinations(column_names=[c1, c2]))    print(f"Added FixedCombinations constraint on: [{c1}, {c2}]")   synth = CTGANSynthesizer(    metadata=metadata,    epochs=30,    batch_size=500 )   if constraints:    synth.add_constraints(constraints)   synth.fit(real)   synthetic_sdv = synth.sample(num_rows=5000) print("SDV CTGANSynthesizer sample:") display(synthetic_sdv.head())

We construct a formal metadata object and attach explicit semantic types to the dataset. We introduce structural constraints using SDV’s constraint graph system, enforcing numeric inequalities and validity of categorical combinations. We then train a CTGAN-based SDV synthesizer that respects these constraints during generation.

loss_df = synth.get_loss_values() display(loss_df.tail())   x_candidates = ["epoch", "step", "steps", "iteration", "iter", "batch", "update"] xcol = next((c for c in x_candidates if c in loss_df.columns), None)   g_candidates = ["generator_loss", "gen_loss", "g_loss"] d_candidates = ["discriminator_loss", "disc_loss", "d_loss"] gcol = next((c for c in g_candidates if c in loss_df.columns), None) dcol = next((c for c in d_candidates if c in loss_df.columns), None)   plt.figure(figsize=(10,4))   if xcol is None:    x = np.arange(len(loss_df)) else:    x = loss_df[xcol].to_numpy()   if gcol is not None:    plt.plot(x, loss_df[gcol].to_numpy(), label=gcol) if dcol is not None:    plt.plot(x, loss_df[dcol].to_numpy(), label=dcol)   plt.xlabel(xcol if xcol is not None else "index") plt.ylabel("loss") plt.legend() plt.title("CTGAN training losses (SDV wrapper)") plt.show()   cond_col = categorical_cols[0] common_value = real[cond_col].value_counts().index[0] conditions = [Condition({cond_col: common_value}, num_rows=2000)]   synthetic_cond = synth.sample_from_conditions(    conditions=conditions,    max_tries_per_batch=200,    batch_size=5000 )   print("Conditional sampling requested:", 2000, "got:", len(synthetic_cond)) print("Conditional sample distribution (top 5):") print(synthetic_cond[cond_col].value_counts().head(5)) display(synthetic_cond.head())

We extract and visualize the dynamics of generator and discriminator losses using a version-robust plotting strategy. We perform conditional sampling to generate data under specific attribute constraints and verify that the conditions are satisfied. This demonstrates how CTGAN behaves under guided generation scenarios.

metadata_dict = metadata.to_dict()   diagnostic = DiagnosticReport() diagnostic.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True) print("Diagnostic score:", diagnostic.get_score())   quality = QualityReport() quality.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True) print("Quality score:", quality.get_score())   def show_report_details(report, title):    print(f"n===== {title} details =====")    props = report.get_properties()    for p in props:        print(f"n--- {p} ---")        details = report.get_details(property_name=p)        try:            display(details.head(10))        except Exception:            display(details)   show_report_details(diagnostic, "DiagnosticReport") show_report_details(quality, "QualityReport")   train_real, test_real = train_test_split(    real, test_size=0.25, random_state=42, stratify=real[target_col] )   def make_pipeline(cat_cols, num_cols):    pre = ColumnTransformer(        transformers=[            ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),            ("num", "passthrough", num_cols),        ],        remainder="drop"    )    clf = LogisticRegression(max_iter=200)    return Pipeline([("pre", pre), ("clf", clf)])   pipe_syn = make_pipeline(categorical_cols, numerical_cols) pipe_syn.fit(synthetic_sdv.drop(columns=[target_col]), synthetic_sdv[target_col])   proba_syn = pipe_syn.predict_proba(test_real.drop(columns=[target_col]))[:, 1] y_true = (test_real[target_col].astype(str).str.contains(">")).astype(int) auc_syn = roc_auc_score(y_true, proba_syn) print("Synthetic-train -> Real-test AUC:", auc_syn)   pipe_real = make_pipeline(categorical_cols, numerical_cols) pipe_real.fit(train_real.drop(columns=[target_col]), train_real[target_col])   proba_real = pipe_real.predict_proba(test_real.drop(columns=[target_col]))[:, 1] auc_real = roc_auc_score(y_true, proba_real) print("Real-train -> Real-test AUC:", auc_real)   model_path = "ctgan_sdv_synth.pkl" synth.save(model_path) print("Saved synthesizer to:", model_path)   from sdv.utils import load_synthesizer synth_loaded = load_synthesizer(model_path)   synthetic_loaded = synth_loaded.sample(1000) print("Loaded synthesizer sample:") display(synthetic_loaded.head())

We evaluate synthetic data using SDMetrics diagnostic and quality reports and a property-level inspection. We validate downstream usefulness by training a classifier on synthetic data and testing it on real data. Finally, we serialize the trained synthesizer and confirm that it can be reloaded and sampled reliably.

In conclusion, we demonstrated that synthetic data generation with CTGAN becomes significantly more powerful when paired with metadata, constraints, and rigorous evaluation. By validating both statistical similarity and downstream task performance, we ensured that the synthetic data is not only realistic but also useful. This pipeline serves as a strong foundation for privacy-preserving analytics, data sharing, and simulation workflows. With careful configuration and evaluation, CTGAN can be safely deployed in real-world data science systems.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.