Source: MarkTechPost
In this tutorial, we walk through a complete, end-to-end workflow for correcting bias in survey data using the balance library. We simulate a realistic population, deliberately introduce sampling bias, and then apply multiple re-weighting techniques to recover unbiased estimates. We focus on four widely used methods: Inverse Probability Weighting (IPW), Covariate Balancing Propensity Scores (CBPS), ranking, and post-stratification, and evaluate how effectively each method restores balance between the sample and the target population. Throughout the process, we analyze diagnostics such as ASMD, outcome estimates, and design effects to build a strong intuitive and practical understanding of survey weighting.
import subprocess, sys subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "balance"]) import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings("ignore") from balance import Sample np.random.seed(2024) sns.set_theme(style="whitegrid", context="notebook")
We begin by installing the balance package and importing all the required libraries for data manipulation and visualization. We set a random seed to ensure reproducibility and configure plotting aesthetics for clearer diagnostics. This setup prepares a clean, consistent environment for running the full reweighting workflow.
def simulate_population(n=50_000): age = np.clip(np.random.normal(45, 17, n), 18, 90).astype(int) gender = np.random.choice(["M", "F"], size=n, p=[0.49, 0.51]) education = np.random.choice( ["HS", "SomeCollege", "Bachelor", "Graduate"], size=n, p=[0.35, 0.25, 0.25, 0.15], ) income = np.exp(np.random.normal(10.5, 0.5, n)) region = np.random.choice( ["Urban", "Suburban", "Rural"], size=n, p=[0.40, 0.35, 0.25] ) happiness = ( 50 + 0.20 * (age - 45) + (education == "Graduate") * 8 + (education == "Bachelor") * 4 + (region == "Urban") * 3 + np.log(income) * 2 + np.random.normal(0, 5, n) ) return pd.DataFrame({ "id": np.arange(n).astype(str), "age": age, "gender": gender, "education": education, "income": income.round(2), "region": region, "happiness": happiness.round(2), }) def biased_sample(pop, n=2_000): score = ( -0.04 * (pop["age"] - 30) + (pop["education"] == "Graduate") * 1.0 + (pop["education"] == "Bachelor") * 0.6 + (pop["region"] == "Urban") * 0.7 - (pop["region"] == "Rural") * 0.5 ) p = 1 / (1 + np.exp(-score)) p = p / p.sum() idx = np.random.choice(pop.index, size=n, replace=False, p=p) return pop.loc[idx].reset_index(drop=True) target_df = simulate_population(50_000) sample_df = biased_sample(target_df, 2_000) target_for_balance = target_df.drop(columns=["happiness"]) print(f"Sample size : {len(sample_df):,}") print(f"Target size : {len(target_for_balance):,}") print(f"nTRUE population mean happiness : {target_df['happiness'].mean():.2f}") print(f"Naive sample mean happiness : {sample_df['happiness'].mean():.2f} <-- biased!")
We simulate a realistic population dataset with demographic and socioeconomic features along with an outcome variable. We then introduce sampling bias by preferentially selecting younger, more educated, and urban individuals to mimic real-world survey bias. Finally, we compare the naive sample mean to the true population mean to highlight bias.
sample = Sample.from_frame( sample_df, id_column="id", outcome_columns=["happiness"] ) target = Sample.from_frame(target_for_balance, id_column="id") sample_with_target = sample.set_target(target) print("n--- Sample object ---") print(sample_with_target) print("n" + "=" * 60) print(" PRE-ADJUSTMENT DIAGNOSTICS") print("=" * 60) asmd_before = sample_with_target.covars().asmd() print("nASMD (Absolute Standardized Mean Difference) — lower = better balance") print("Rule of thumb: |ASMD| > 0.10 indicates meaningful imbalance.") print(asmd_before.T.round(3)) print("nMean of covariates (sample vs target):") print(sample_with_target.covars().mean().T.round(3))
We convert both the biased sample and the target population into structured Sample objects for processing. We compute pre-adjustment diagnostics, such as ASMD and covariate means, to quantify imbalance between the sample and the target. This step helps us clearly understand how far the sample deviates before applying any correction.
print("n" + "=" * 60) print(" FITTING WEIGHTS — 4 METHODS") print("=" * 60) print("n>>> [1/4] IPW with LASSO logistic regression") adjusted_ipw = sample_with_target.adjust(method="ipw") print(adjusted_ipw.summary()) print("n>>> [2/4] CBPS — Covariate Balancing Propensity Score") try: adjusted_cbps = sample_with_target.adjust(method="cbps") print(adjusted_cbps.summary()) except Exception as e: print("CBPS failed (skipping):", e) adjusted_cbps = None print("n>>> [3/4] Raking (iterative proportional fitting)") adjusted_rake = sample_with_target.adjust(method="rake") print(adjusted_rake.summary()) print("n>>> [4/4] Post-stratification (categoricals only)") cat_cols = ["id", "gender", "education", "region"] sample_cat = Sample.from_frame( sample_df[cat_cols + ["happiness"]], id_column="id", outcome_columns=["happiness"], ) target_cat = Sample.from_frame(target_for_balance[cat_cols], id_column="id") adjusted_post = sample_cat.set_target(target_cat).adjust(method="poststratify") print(adjusted_post.summary()) print("n" + "=" * 60) print(" METHOD COMPARISON") print("=" * 60) methods = { "IPW": adjusted_ipw, "CBPS": adjusted_cbps, "Rake": adjusted_rake, "PostStrat": adjusted_post, } def safe_mean_asmd(asmd_df, prefer="self"): """Mean ASMD across covariates from a balance asmd DataFrame.""" row = prefer if prefer in asmd_df.index else asmd_df.index[0] if "mean(asmd)" in asmd_df.columns: return float(asmd_df.loc[row, "mean(asmd)"]) return float(asmd_df.loc[row].mean()) asmd_means = {"Unadjusted": safe_mean_asmd(asmd_before)} outcome_means = {"Naive sample": float(sample_df["happiness"].mean())} deff_vals = {} for name, m in methods.items(): if m is None: continue asmd_means[name] = safe_mean_asmd(m.covars().asmd(), prefer="self") outcome_means[name] = float(m.outcomes().mean()["happiness"].iloc[0]) w = m.to_df()["weight"].values deff_vals[name] = (w.sum() ** 2) / (len(w) * np.sum(w ** 2)) outcome_means["TRUE pop"] = float(target_df["happiness"].mean()) print("nMean ASMD across covariates (lower = better balance):") for k, v in asmd_means.items(): print(f" {k:14s}: {v:.4f}") print("nWeighted estimate of mean happiness:") for k, v in outcome_means.items(): print(f" {k:14s}: {v:.3f}") print("nKish's effective sample-size ratio (1.0 = no info loss):") for k, v in deff_vals.items(): print(f" {k:14s}: {v:.3f} (n_eff ≈ {int(v * len(sample_df))})")
We apply four different weighting methods, IPW, CBPS, ranking, and post-stratification, to adjust the biased sample. We evaluate each method using balance metrics, outcome estimates, and calculations of effective sample size. This comparison allows us to understand how different techniques trade off bias reduction and variance.
fig, axes = plt.subplots(2, 2, figsize=(14, 10)) colors_a = ["gray", "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"][: len(asmd_means)] axes[0, 0].bar(list(asmd_means.keys()), list(asmd_means.values()), color=colors_a) axes[0, 0].axhline(0.1, ls="--", color="red", label="0.10 imbalance threshold") axes[0, 0].set_title("Mean ASMD across covariates") axes[0, 0].set_ylabel("Mean ASMD"); axes[0, 0].legend() axes[0, 0].tick_params(axis="x", rotation=20) truth = target_df["happiness"].mean() colors_b = ["#888"] + ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"][: len(methods)] + ["black"] axes[0, 1].bar(list(outcome_means.keys()), list(outcome_means.values()), color=colors_b[: len(outcome_means)]) axes[0, 1].axhline(truth, ls="--", color="black", label=f"truth = {truth:.2f}") axes[0, 1].set_title("Estimated mean happiness vs ground truth") axes[0, 1].set_ylabel("Mean happiness"); axes[0, 1].legend() axes[0, 1].tick_params(axis="x", rotation=20) w_ipw = adjusted_ipw.to_df()["weight"].values axes[1, 0].hist(w_ipw, bins=40, color="steelblue", edgecolor="white") axes[1, 0].set_title( f"IPW weight distributionn" f"min={w_ipw.min():.2f} median={np.median(w_ipw):.2f} max={w_ipw.max():.2f}" ) axes[1, 0].set_xlabel("weight"); axes[1, 0].set_ylabel("count") ages = sample_df["age"].values bins = np.linspace(18, 90, 31) axes[1, 1].hist(target_df["age"], bins=bins, density=True, alpha=0.45, color="green", label="Target (truth)") axes[1, 1].hist(ages, bins=bins, density=True, alpha=0.45, color="red", label="Sample (biased)") axes[1, 1].hist(ages, bins=bins, density=True, alpha=0.45, color="blue", weights=w_ipw, label="Sample (IPW-weighted)") axes[1, 1].set_title("Age distribution: bias correction by IPW") axes[1, 1].set_xlabel("Age"); axes[1, 1].set_ylabel("density"); axes[1, 1].legend() plt.tight_layout() plt.savefig("balance_diagnostics.png", dpi=110, bbox_inches="tight") plt.show() print("n" + "=" * 60) print(" ADVANCED — controlling variance with max_de") print("=" * 60) print("max_de=1.5 trims extreme weights so the design effect stays ≤ 1.5,") print("trading a little bias for tighter confidence intervals.n") adjusted_trim = sample_with_target.adjust(method="ipw", max_de=1.5) print(adjusted_trim.summary()) out = adjusted_ipw.to_df() out.to_csv("balance_weighted_sample.csv", index=False) print("nSaved weighted sample → balance_weighted_sample.csv") print("Saved diagnostics plot → balance_diagnostics.png") print("nFirst 5 rows of weighted output:") print(out.head()) err_naive = abs(sample_df["happiness"].mean() - truth) err_ipw = abs(outcome_means["IPW"] - truth) print("n" + "=" * 60) print(" BIAS REDUCTION SUMMARY") print("=" * 60) print(f"Naive estimator error : {err_naive:.3f}") print(f"IPW estimator error : {err_ipw:.3f}") print(f"Bias reduction : {(1 - err_ipw / max(err_naive, 1e-9)) * 100:.1f}%")
We visualize the results using plots for ASMD, outcome estimates, weight distributions, and feature alignment. We also explore variance control using trimmed weights and save the final weighted dataset for downstream use. Also, we compute bias-reduction metrics to confirm the extent to which the adjustment improves estimation accuracy.
In conclusion, we saw how re-weighting techniques can substantially reduce bias and bring sample estimates much closer to the true population values. We compared multiple adjustment methods and examined the trade-offs between bias reduction and variance, particularly when handling extreme weights. Using the balance framework, we built a reproducible pipeline that not only corrects for selection bias but also provides clear diagnostics and interpretability. This workflow equips us with practical tools to handle real-world biased datasets, enabling more reliable inference and decision-making in survey analysis and observational studies.
Check out the Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
Sana Hassan
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


