A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics

Source: MarkTechPost

In this tutorial, we explore Datashader, a powerful, high-performance visualization library for rendering massive datasets that quickly overwhelm traditional plotting tools. We work through its full rendering pipeline in Google Colab, starting from dense point clouds and reduction-based aggregations to categorical rendering, line visualizations, raster data, quadmesh grids, compositing, and dashboard-style analytical views. As we move through each section, we focus on how Datashader transforms raw large-scale data into meaningful visual structure with speed, flexibility, and visual clarity, while keeping Matplotlib as the final presentation layer.

import subprocess, sys subprocess.check_call([sys.executable, "-m", "pip", "install", "-q",                       "datashader", "colorcet", "numba", "scipy"])   import numpy  as np import pandas as pd import datashader as ds import datashader.transfer_functions as tf from datashader import reductions as rd import colorcet as cc import matplotlib.pyplot as plt import matplotlib.colors as mcolors from matplotlib.gridspec import GridSpec from scipy.stats import multivariate_normal import time, warnings warnings.filterwarnings("ignore")   print("Datashader version:", ds.__version__)   def show(img, title="", ax=None, figsize=(6, 5)):    standalone = ax is None    if standalone:        fig, ax = plt.subplots(figsize=figsize)    rgba = img.to_pil()    ax.imshow(rgba, origin="upper", aspect="auto")    ax.set_title(title, fontsize=11, fontweight="bold")    ax.axis("off")    if standalone:        plt.tight_layout()        plt.show()   print("n=== SECTION 1: Core Pipeline ===")   rng = np.random.default_rng(42) N   = 2_000_000   x = np.concatenate([rng.normal(-1, 0.5, N//3),                    rng.normal( 1, 0.5, N//3),                    rng.normal( 0, 1.5, N//3)]) y = np.concatenate([rng.normal(-1, 0.5, N//3),                    rng.normal( 1, 0.5, N//3),                    rng.normal( 0, 0.5, N//3)]) df_base = pd.DataFrame({"x": x, "y": y})   canvas = ds.Canvas(plot_width=600, plot_height=500,                   x_range=(-4, 4), y_range=(-4, 4))   agg = canvas.points(df_base, "x", "y", agg=rd.count())   fig, axes = plt.subplots(1, 3, figsize=(15, 4)) combos = [    ("Linear / blues",  tf.shade(agg, cmap=cc.blues,        how="linear")),    ("Log    / fire",   tf.shade(agg, cmap=cc.fire,         how="log"   )),    ("Eq-hist / bmy",   tf.shade(agg, cmap=cc.bmy,          how="eq_hist")), ] for ax, (title, img) in zip(axes, combos):    show(img, title, ax=ax) plt.suptitle("Section 1 – 2 M points: Linear vs Log vs Eq-Hist normalisation",             fontsize=13, fontweight="bold") plt.tight_layout() plt.show()   print("n=== SECTION 2: Reduction Types ===")   n_actual = len(df_base) df_base["value"] = rng.exponential(scale=2, size=n_actual) df_base["label"] = pd.Categorical(    rng.choice(["A", "B", "C"], size=n_actual),    categories=["A", "B", "C"] )   canvas2 = ds.Canvas(plot_width=400, plot_height=350,                    x_range=(-4, 4), y_range=(-4, 4))   reductions_cfg = [    ("count()",          rd.count(),                 cc.kbc),    ("sum(value)",       rd.sum("value"),             cc.CET_L3),    ("mean(value)",      rd.mean("value"),            cc.CET_D4),    ("std(value)",       rd.std("value"),             cc.CET_L16),    ("min(value)",       rd.min("value"),             cc.CET_L17),    ("max(value)",       rd.max("value"),             cc.bgyw),    ("var(value)",       rd.var("value"),             cc.CET_L18),    ("count_cat(label)", rd.count_cat("label"),       None), ]   fig, axes = plt.subplots(2, 4, figsize=(18, 9)) axes = axes.flat   for ax, (name, agg_fn, cmap) in zip(axes, reductions_cfg):    agg_r = canvas2.points(df_base, "x", "y", agg=agg_fn)    if cmap is None:        img = tf.shade(agg_r, color_key={"A":"#e41a1c","B":"#377eb8","C":"#4daf4a"})    else:        img = tf.shade(agg_r, cmap=cmap, how="eq_hist")    show(img, name, ax=ax)   plt.suptitle("Section 2 – All Reduction Types on 2 M points", fontsize=14, fontweight="bold") plt.tight_layout() plt.show()   print("n=== SECTION 3: Categorical Visualisation ===")   N_cat = 500_000 categories = ["Cluster A", "Cluster B", "Cluster C", "Cluster D"] centers = [(-2, -2), (-2, 2), (2, -2), (2, 2)] colors  = {"Cluster A":"#e41a1c","Cluster B":"#377eb8",           "Cluster C":"#4daf4a","Cluster D":"#ff7f00"}   frames = [] for cat, (cx, cy) in zip(categories, centers):    n = N_cat // len(categories)    frames.append(pd.DataFrame({        "x":    rng.normal(cx, 0.8, n),        "y":    rng.normal(cy, 0.8, n),        "cat":  pd.Categorical([cat]*n, categories=categories),    })) df_cat = pd.concat(frames, ignore_index=True)   canvas3 = ds.Canvas(plot_width=500, plot_height=500,                    x_range=(-5, 5), y_range=(-5, 5)) agg_cat = canvas3.points(df_cat, "x", "y", agg=rd.count_cat("cat"))   fig, axes = plt.subplots(1, 3, figsize=(16, 5))   img_raw  = tf.shade(agg_cat, color_key=colors) show(img_raw, "Raw (no spread)", ax=axes[0])   img_sp1  = tf.spread(tf.shade(agg_cat, color_key=colors), px=1) show(img_sp1, "Spread px=1", ax=axes[1])   img_bg   = tf.set_background(tf.shade(agg_cat, color_key=colors), color="black") show(img_bg, "Black background", ax=axes[2])   for cat, col in colors.items():    axes[2].plot([], [], "o", color=col, label=cat, markersize=8) axes[2].legend(loc="lower right", fontsize=8, framealpha=0.6)   plt.suptitle("Section 3 – Categorical Rendering (500 k points)", fontsize=13, fontweight="bold") plt.tight_layout() plt.show()

We install the required libraries and import everything needed to build a complete Datashader workflow in Google Colab. We define a helper function to display Datashader images with Matplotlib, which keeps the rendering pipeline simple and visually consistent. We then begin with the core Datashader pipeline, explore multiple reduction types, and show how categorical data can be rendered clearly using color keys, spreading, and background adjustments.

print("n=== SECTION 4: Line Rendering ===")   n_series, n_steps = 5_000, 500 t  = np.linspace(0, 1, n_steps) xs = np.tile(t, n_series)   walks = np.cumsum(rng.normal(0, 0.05, (n_series, n_steps)), axis=1) ys    = walks.ravel()   series_id = np.repeat(np.arange(n_series), n_steps) df_lines  = pd.DataFrame({"x": xs, "y": ys, "id": series_id})   canvas4 = ds.Canvas(plot_width=700, plot_height=450,                    x_range=(0, 1), y_range=(-6, 6))   agg_lines = canvas4.line(df_lines, "x", "y",                         agg=rd.count(), line_width=1)   fig, axes = plt.subplots(1, 2, figsize=(14, 5)) show(tf.shade(agg_lines, cmap=cc.fire,  how="eq_hist"),     "5 000 random walks – eq_hist / fire", ax=axes[0]) show(tf.shade(agg_lines, cmap=cc.blues, how="log"),     "5 000 random walks – log / blues",    ax=axes[1]) plt.suptitle("Section 4 – Line / Time-Series Rendering", fontsize=13, fontweight="bold") plt.tight_layout() plt.show()   print("n=== SECTION 5: Raster / Grid Data ===")   import xarray as xr   res = 1000 lon = np.linspace(-180, 180, res) lat = np.linspace(-90,   90, res) LON, LAT = np.meshgrid(lon, lat)   z = (  multivariate_normal.pdf(np.stack([LON, LAT], -1),                                mean=[30, 30], cov=[[800,0],[0,500]])     + multivariate_normal.pdf(np.stack([LON, LAT], -1),                                mean=[-60, -20], cov=[[600,0],[0,400]])     + 0.02 * rng.standard_normal((res, res)))   da = xr.DataArray(z, dims=["y", "x"],                  coords={"x": lon, "y": lat})   canvas5 = ds.Canvas(plot_width=700, plot_height=400,                    x_range=(-180, 180), y_range=(-90, 90)) agg_raster = canvas5.raster(da)   fig, axes = plt.subplots(1, 2, figsize=(14, 4)) show(tf.shade(agg_raster, cmap=cc.CET_L18, how="eq_hist"),     "Synthetic elevation – eq_hist", ax=axes[0]) show(tf.shade(agg_raster, cmap=cc.rainbow,  how="linear"),     "Synthetic elevation – linear",  ax=axes[1]) plt.suptitle("Section 5 – Raster / Grid (xarray DataArray)", fontsize=13, fontweight="bold") plt.tight_layout() plt.show()   print("n=== SECTION 6: QuadMesh / 2-D Grid Glyph ===")   lon6  = np.concatenate([np.linspace(-180, -60, 80),                         np.linspace(-60,   60, 30),                         np.linspace( 60,  180, 80)]) lat6  = np.concatenate([np.linspace(-90, -30, 40),                         np.linspace(-30,  30, 20),                         np.linspace( 30,  90, 40)]) LON6, LAT6 = np.meshgrid(lon6, lat6)   def vortex(lon0, lat0, amp=1.0):    return amp * np.exp(-((LON6-lon0)**2/1200 + (LAT6-lat0)**2/600))   field6 = vortex(-40, 30, 1.2) + vortex(120, -20, 0.9) +          0.05 * rng.standard_normal(LON6.shape)   da6 = xr.DataArray(field6.astype(np.float32),                   dims=["y", "x"],                   coords={"x": lon6, "y": lat6},                   name="intensity")   canvas6 = ds.Canvas(plot_width=700, plot_height=380,                    x_range=(-180, 180), y_range=(-90, 90)) agg6    = canvas6.quadmesh(da6)   canvas6z = ds.Canvas(plot_width=500, plot_height=400,                     x_range=(-80, 0), y_range=(0, 60)) agg6z    = canvas6z.quadmesh(da6)   field6_smooth = vortex(-40, 30, 1.0) + vortex(120, -20, 0.8) da6_diff = xr.DataArray((field6 - field6_smooth).astype(np.float32),                         dims=["y","x"],                         coords={"x": lon6, "y": lat6},                         name="anomaly") agg6d = canvas6.quadmesh(da6_diff)   fig, axes = plt.subplots(1, 3, figsize=(18, 5)) show(tf.shade(agg6,  cmap=cc.fire,   how="eq_hist"), "Global field – eq_hist",      ax=axes[0]) show(tf.shade(agg6z, cmap=cc.CET_L3, how="linear"),  "N. Atlantic zoom – linear",   ax=axes[1]) show(tf.shade(agg6d, cmap=cc.CET_D4, how="eq_hist"), "Residual (anomaly) – eq_hist",ax=axes[2]) plt.suptitle("Section 6 – canvas.quadmesh(): non-uniform 2-D grids",             fontsize=13, fontweight="bold") plt.tight_layout() plt.show()

We move beyond point clouds and use Datashader to render thousands of overlapping random-walk lines efficiently without visual clutter. We then work with raster and grid-based data through xarray, showing how Datashader handles continuous spatial fields and non-uniform quadmesh structures with ease. By the end of this part, we explore global fields, local zoom views, and anomaly-style visual comparisons to understand Datashader’s strength on structured 2-D data.

print("n=== SECTION 7: Spreading, Stack & Composite ===")   N7 = 300_000 x7 = rng.normal(0, 3, N7) y7 = rng.normal(0, 3, N7) df7 = pd.DataFrame({"x": x7, "y": y7})   canvas7 = ds.Canvas(plot_width=500, plot_height=500,                    x_range=(-10, 10), y_range=(-10, 10)) agg7 = canvas7.points(df7, "x", "y", agg=rd.count())   fig, axes = plt.subplots(1, 4, figsize=(20, 5)) for ax, (label, px) in zip(axes, [("No spread", 0), ("spread px=1", 1),                                   ("spread px=2", 2), ("spread px=4", 4)]):    img = tf.shade(agg7, cmap=cc.fire, how="eq_hist")    if px > 0:        img = tf.spread(img, px=px)    show(img, label, ax=ax)   plt.suptitle("Section 7 – Effect of spread() on sparse data", fontsize=13, fontweight="bold") plt.tight_layout() plt.show()   N_fg = 50_000 x_fg = rng.normal(0, 0.5, N_fg) y_fg = rng.normal(0, 0.5, N_fg) df_fg = pd.DataFrame({"x": x_fg, "y": y_fg})   agg_bg = canvas7.points(df7,  "x", "y", agg=rd.count()) agg_fg = canvas7.points(df_fg,"x", "y", agg=rd.count())   img_bg_shade = tf.shade(agg_bg, cmap=cc.blues, how="log",   alpha=200) img_fg_shade = tf.shade(agg_fg, cmap=cc.fire,  how="eq_hist")   stacked = tf.stack(img_bg_shade, img_fg_shade)   fig, axes = plt.subplots(1, 3, figsize=(15, 5)) show(img_bg_shade, "Background (blue / log)",    ax=axes[0]) show(img_fg_shade, "Foreground (fire / eq_hist)", ax=axes[1]) show(stacked,      "Stacked composite",           ax=axes[2]) plt.suptitle("Section 7 – Stack / Composite two layers", fontsize=13, fontweight="bold") plt.tight_layout() plt.show()   print("n=== SECTION 8: Performance Benchmark ===")   sizes    = [10_000, 100_000, 1_000_000, 5_000_000, 20_000_000] timings  = []   for n in sizes:    xb = rng.normal(0, 1, n).astype(np.float32)    yb = rng.normal(0, 1, n).astype(np.float32)    dfb = pd.DataFrame({"x": xb, "y": yb})    cvb = ds.Canvas(plot_width=800, plot_height=700)    cvb.points(dfb, "x", "y", agg=rd.count())    t0  = time.perf_counter()    cvb.points(dfb, "x", "y", agg=rd.count())    elapsed = time.perf_counter() - t0    timings.append(elapsed)    print(f"  {n:>12,} points → {elapsed*1000:6.1f} ms")   fig, ax = plt.subplots(figsize=(8, 4)) ax.loglog([s/1e6 for s in sizes], [t*1000 for t in timings],          "o-", linewidth=2, markersize=8, color="#e41a1c") ax.set_xlabel("Dataset size (millions of points)", fontsize=12) ax.set_ylabel("Render time (ms)",                  fontsize=12) ax.set_title("Section 8 – Datashader Render Performance (800×700 canvas)",             fontsize=12, fontweight="bold") ax.grid(True, which="both", alpha=0.4) plt.tight_layout() plt.show()   print("n=== SECTION 9: Custom Matplotlib Colourmap Pipeline ===")   N9 = 3_000_000 angle   = rng.uniform(0, 2*np.pi, N9) radius  = rng.exponential(1.5, N9) x9 = radius * np.cos(angle) y9 = radius * np.sin(angle) df9 = pd.DataFrame({"x": x9.astype(np.float32),                    "y": y9.astype(np.float32)})   canvas9 = ds.Canvas(plot_width=600, plot_height=600,                    x_range=(-8, 8), y_range=(-8, 8)) agg9 = canvas9.points(df9, "x", "y", agg=rd.count())   mpl_cmaps = ["inferno", "plasma", "viridis", "cividis"] fig, axes = plt.subplots(1, 4, figsize=(18, 5)) for ax, cmap_name in zip(axes, mpl_cmaps):    cmap = plt.get_cmap(cmap_name)    colours = [mcolors.to_hex(cmap(i/255)) for i in range(256)]    img = tf.shade(agg9, cmap=colours, how="eq_hist")    show(img, f"mpl: {cmap_name}", ax=ax)   plt.suptitle("Section 9 – Any Matplotlib colourmap with Datashader",             fontsize=13, fontweight="bold") plt.tight_layout() plt.show()

We focus on spreading, stacking, and compositing to see how Datashader improves the visibility of sparse data and combines multiple rendered layers into a single visual output. We then benchmark performance across datasets ranging from thousands to tens of millions of points, which helps us observe how rendering time scales with data size. Finally, we build a custom color pipeline using Matplotlib colormaps, showing how we can connect Datashader’s aggregation engine with familiar visualization palettes.

print("n=== SECTION 10: Multi-Panel Dashboard ===")   N10   = 1_500_000 price = np.cumsum(rng.normal(0, 0.01, N10)) + 100 vol   = np.abs(rng.normal(0, 1, N10)) * (1 + 0.5 * rng.exponential(1, N10)) ret   = np.diff(price, prepend=price[0]) hour  = (np.arange(N10) % 390) / 390   df10 = pd.DataFrame({    "price": price.astype(np.float32),    "vol":   vol.astype(np.float32),    "ret":   ret.astype(np.float32),    "hour":  hour.astype(np.float32), })   fig = plt.figure(figsize=(16, 12)) gs  = GridSpec(2, 3, figure=fig, hspace=0.35, wspace=0.3)   panels = [    (gs[0,0], "price", "vol",  "Price vs Volume",          cc.fire),    (gs[0,1], "ret",   "vol",  "Return vs Volume",         cc.bmy),    (gs[0,2], "hour",  "price","Intraday Price Distribution",cc.CET_L3),    (gs[1,0], "ret",   "price","Return vs Price",           cc.CET_D4),    (gs[1,1], "hour",  "ret",  "Intraday Return Profile",  cc.CET_L16),    (gs[1,2], "hour",  "vol",  "Intraday Volume Profile",  cc.CET_L17), ]   for spec, xcol, ycol, title, cmap in panels:    ax  = fig.add_subplot(spec)    xr_ = (float(df10[xcol].quantile(0.001)),           float(df10[xcol].quantile(0.999)))    yr_ = (float(df10[ycol].quantile(0.001)),           float(df10[ycol].quantile(0.999)))    cv  = ds.Canvas(plot_width=300, plot_height=250,                    x_range=xr_, y_range=yr_)    ag  = cv.points(df10, xcol, ycol, agg=rd.count())    img = tf.shade(ag, cmap=cmap, how="eq_hist")    show(img, title, ax=ax)    ax.set_axis_off()   fig.suptitle("Section 10 – Multi-Panel Dashboard: 1.5 M Synthetic Trades",             fontsize=15, fontweight="bold", y=1.01) plt.show()   print("n=== SECTION 11: Zoom / Sub-Region Magnification ===")   zoom_ranges = [    ((-5, 5),    (-5, 5),    "Full extent"),    ((-3, 0),    (-3, 0),    "Quadrant III"),    ((-1.5, 0.5),(-1.5, 0.5),"Zoomed into cluster A"), ]   fig, axes = plt.subplots(1, 3, figsize=(15, 5)) for ax, (xr, yr, title) in zip(axes, zoom_ranges):    cv = ds.Canvas(plot_width=400, plot_height=400, x_range=xr, y_range=yr)    ag = cv.points(df_cat, "x", "y", agg=rd.count_cat("cat"))    img = tf.shade(ag, color_key=colors)    show(img, title, ax=ax)   plt.suptitle("Section 11 – Zoom: No Data Loss at Any Scale", fontsize=13, fontweight="bold") plt.tight_layout() plt.show()

We create a multi-panel dashboard that uses Datashader repeatedly across several variable pairs, allowing us to inspect large synthetic trading-style data from multiple analytical perspectives at once. We calculate robust plotting ranges using quantiles, ensuring each panel focuses on the most informative region of the data. We then demonstrate zoom-based magnification, showing that Datashader preserves detail at every scale and allows us to inspect subregions without losing data fidelity.

print("n=== SECTION 12: Overlay with Matplotlib ===")   canvas12 = ds.Canvas(plot_width=600, plot_height=600,                     x_range=(-4, 4), y_range=(-4, 4)) agg12 = canvas12.points(df_base, "x", "y", agg=rd.count()) img12 = tf.shade(agg12, cmap=cc.fire, how="eq_hist")   from scipy.stats import gaussian_kde sample_idx = rng.integers(0, len(df_base), 20_000) kde   = gaussian_kde(df_base.iloc[sample_idx][["x","y"]].values.T, bw_method=0.15) gx    = np.linspace(-4, 4, 80) gy    = np.linspace(-4, 4, 80) GX, GY = np.meshgrid(gx, gy) Z     = kde(np.vstack([GX.ravel(), GY.ravel()])).reshape(GX.shape)   fig, ax = plt.subplots(figsize=(7, 6)) ax.imshow(img12.to_pil(), origin="upper", aspect="auto",          extent=[-4, 4, -4, 4]) ax.contour(GX, GY, Z, levels=8, colors="white", linewidths=0.8, alpha=0.7) ax.set_title("Section 12 – Datashader + Matplotlib Contour Overlay",             fontsize=12, fontweight="bold") ax.set_xlabel("x"); ax.set_ylabel("y") plt.tight_layout() plt.show()   print("n✅  All sections complete!")

We combine Datashader with Matplotlib overlays by rendering a dense, aggregated image and then placing contour lines on top using a KDE computed from a sampled subset of points. This shows how Datashader can serve as the high-performance visual foundation, while Matplotlib adds additional analytical annotations. We finish the tutorial by completing the full workflow and demonstrating how large-scale rendering and traditional plotting can work together effectively.

In conclusion, we built a strong practical understanding of how Datashader efficiently and scalably handles millions of points, multiple glyph types, grid-based data, layered composites, and custom color workflows. We saw how its aggregation-first approach enables preservation of detail, avoidance of overplotting, and zooming into dense regions without losing fidelity. Through these examples, we learn how to use Datashader’s key features in practice and understand why it is such a valuable tool for advanced large-scale data visualization workflows in Python.

Check out the Full Codes here. Find 100s of ML/Data Science Colab Notebooks here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.