[Tutorial] Building a Visual Document Retrieval Pipeline with ColPali and Late Interaction Scoring

Source: MarkTechPost

In this tutorial, we build an end-to-end visual document retrieval pipeline using ColPali. We focus on making the setup robust by resolving common dependency conflicts and ensuring the environment stays stable. We render PDF pages as images, embed them using ColPali’s multi-vector representations, and rely on late-interaction scoring to retrieve the most relevant pages for a natural-language query. By treating each page visually rather than as plain text, we preserve layout, tables, and figures that are often lost in traditional text-only retrieval.

import subprocess, sys, os, json, hashlib   def pip(cmd):    subprocess.check_call([sys.executable, "-m", "pip"] + cmd)   pip(["uninstall", "-y", "pillow", "PIL", "torchaudio", "colpali-engine"]) pip(["install", "-q", "--upgrade", "pip"]) pip(["install", "-q", "pillow<12", "torchaudio==2.8.0"]) pip(["install", "-q", "colpali-engine", "pypdfium2", "matplotlib", "tqdm", "requests"])

We prepare a clean and stable execution environment by uninstalling conflicting packages and upgrading pip. We explicitly pin compatible versions of Pillow and torchaudio to avoid runtime import errors. We then install ColPali and its required dependencies so the rest of the tutorial runs without interruptions.

import torch import requests import pypdfium2 as pdfium from PIL import Image from tqdm import tqdm import matplotlib.pyplot as plt from transformers.utils.import_utils import is_flash_attn_2_available from colpali_engine.models import ColPali, ColPaliProcessor   device = "cuda" if torch.cuda.is_available() else "cpu" dtype = torch.float16 if device == "cuda" else torch.float32   MODEL_NAME = "vidore/colpali-v1.3"   model = ColPali.from_pretrained(    MODEL_NAME,    torch_dtype=dtype,    device_map=device,    attn_implementation="flash_attention_2" if device == "cuda" and is_flash_attn_2_available() else None, ).eval()   processor = ColPaliProcessor.from_pretrained(MODEL_NAME)

We import all required libraries and detect whether a GPU is available for acceleration. We load the ColPali model and processor with the appropriate precision and attention implementation based on the runtime. We ensure the model is ready for inference by switching it to evaluation mode.

PDF_URL = "https://arxiv.org/pdf/2407.01449.pdf" pdf_bytes = requests.get(PDF_URL).content   pdf = pdfium.PdfDocument(pdf_bytes) pages = [] MAX_PAGES = 15   for i in range(min(len(pdf), MAX_PAGES)):    page = pdf[i]    img = page.render(scale=2).to_pil().convert("RGB")    pages.append(img)

We download a sample PDF and render its pages as high-resolution RGB images. We limit the number of pages to keep the tutorial lightweight and fast on Colab. We store the rendered pages in memory for direct visual embedding.

page_embeddings = [] batch_size = 2 if device == "cuda" else 1   for i in tqdm(range(0, len(pages), batch_size)):    batch_imgs = pages[i:i+batch_size]    batch = processor.process_images(batch_imgs)    batch = {k: v.to(model.device) for k, v in batch.items()}    with torch.no_grad():        emb = model(**batch)    page_embeddings.extend(list(emb.cpu()))   page_embeddings = torch.stack(page_embeddings)

We generate multi-vector embeddings for each rendered page using ColPali’s image encoder. We process pages in small batches to stay within GPU memory limits. We then stack all page embeddings into a single tensor that supports efficient late-interaction scoring.

def retrieve(query, top_k=3):    q = processor.process_queries([query])    q = {k: v.to(model.device) for k, v in q.items()}    with torch.no_grad():        q_emb = model(**q).cpu()    scores = processor.score_multi_vector(q_emb, page_embeddings)[0]    vals, idxs = torch.topk(scores, top_k)    return [(int(i), float(v)) for i, v in zip(idxs, vals)]   def show(img, title):    plt.figure(figsize=(6,6))    plt.imshow(img)    plt.axis("off")    plt.title(title)    plt.show()   query = "What is ColPali and what problem does it solve?" results = retrieve(query, top_k=3)   for rank, (idx, score) in enumerate(results, 1):    show(pages[idx], f"Rank {rank} — Page {idx+1}")   def search(query, k=5):    return [{"page": i+1, "score": s} for i, s in retrieve(query, k)]   print(json.dumps(search("late interaction retrieval"), indent=2))

We define the retrieval logic that scores queries against page embeddings using late interaction. We visualize the top-ranked pages to qualitatively inspect retrieval quality. We also expose a small search helper that returns structured results, making the pipeline easy to extend or integrate further.

In conclusion, we have a compact yet powerful visual search system that demonstrates how ColPali enables layout-aware document retrieval in practice. We embedded pages once, reuse those embeddings efficiently, and retrieve results with interpretable relevance scores. This workflow gives us a strong foundation for scaling to larger document collections, adding indexing for speed, or layering generation on top of retrieved pages, while keeping the core pipeline simple, reproducible, and Colab-friendly.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.