Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export

crawlee-for-python:-build-a-web-crawling-pipeline-with-robots-handling,-link-graphs,-and-rag-chunk-export

Source: MarkTechPost

In this tutorial, we build a full Crawlee-for-Python workflow that covers environment setup, local website generation, static crawling, dynamic crawling, structured extraction, and downstream data processing. We begin by configuring a compatible Crawlee runtime with pinned Pydantic support, Playwright browser installation, persistent storage directories, and Colab-safe execution handling. We then generate a realistic local demo website containing product pages, documentation pages, blog content, internal links, robots.txt rules, JSON-LD metadata, and JavaScript-rendered catalog items. Using BeautifulSoupCrawler, we perform fast recursive HTML crawling and extract page titles, metadata, text previews, outgoing links, product attributes, documentation headings, code blocks, and blog tags. With ParselCrawler, we run precise CSS- and XPath-based extraction on product detail pages. With PlaywrightCrawler, we render JavaScript content in a headless Chromium browser, wait for dynamic DOM elements to appear, extract client-side data, and capture full-page screenshots.

Setting Up the Crawlee Python Runtime and Helpers

import os import sys import re import csv import json import time import math import shutil import socket import hashlib import asyncio import textwrap import subprocess import threading from pathlib import Path from functools import partial from http.server import ThreadingHTTPServer, SimpleHTTPRequestHandler from importlib.metadata import version, PackageNotFoundError SETUP_SENTINEL = "https://www.marktechpost.com/content/.crawlee_python_tutorial_setup_done_v2" def sh(command, check=True, quiet=False):    print(f"n$ {command}")    result = subprocess.run(        command,        shell=True,        text=True,        stdout=subprocess.PIPE,        stderr=subprocess.STDOUT,    )    if not quiet and result.stdout:        print(result.stdout[-5000:])    if check and result.returncode != 0:        raise RuntimeError(f"Command failed with exit code {result.returncode}: {command}")    return result.returncode == 0 def package_version(package_name):    try:        return version(package_name)    except PackageNotFoundError:        return None def is_good_pydantic_version(v):    if not v:        return False    m = re.match(r"^(d+).(d+)", v)    if not m:        return False    major, minor = int(m.group(1)), int(m.group(2))    return major == 2 and minor == 11 current_crawlee = package_version("crawlee") current_pydantic = package_version("pydantic") needs_setup = (    not os.path.exists(SETUP_SENTINEL)    or current_crawlee is None    or not is_good_pydantic_version(current_pydantic) ) if needs_setup:    print("PHASE 1: Installing compatible Crawlee + Pydantic + Playwright dependencies.")    print("After this finishes, Colab will restart automatically. Then run this same cell again.")    sh(f'{sys.executable} -m pip uninstall -y crawlee pydantic pydantic-core', check=False)    sh(        f'{sys.executable} -m pip install -q -U '        f'"pydantic>=2.11,<2.12" '        f'"crawlee[all]" '        f'pandas matplotlib networkx nest_asyncio beautifulsoup4 parsel'    )    sh(f'{sys.executable} -m playwright install --with-deps chromium', check=False)    Path(SETUP_SENTINEL).write_text("done", encoding="utf-8")    print("nInstalled versions:")    sh(f'{sys.executable} -m pip show crawlee pydantic pydantic-core', check=False)    try:        import google.colab        print("nRestarting Colab runtime now. After it reconnects, run this same cell again.")        os.kill(os.getpid(), 9)    except Exception:        raise SystemExit("Setup complete. Restart the runtime/kernel manually, then run this cell again.") print("PHASE 2: Dependencies are ready. Running the Crawlee tutorial.") import pandas as pd import matplotlib.pyplot as plt import networkx as nx import nest_asyncio nest_asyncio.apply() TUTORIAL_ROOT = Path("https://www.marktechpost.com/content/crawlee_python_advanced_tutorial") SITE_DIR = TUTORIAL_ROOT / "demo_site" OUTPUT_DIR = TUTORIAL_ROOT / "outputs" STORAGE_DIR = TUTORIAL_ROOT / "crawlee_storage" SCREENSHOT_DIR = OUTPUT_DIR / "screenshots" for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR]:    if path.exists():        shutil.rmtree(path) for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR, SCREENSHOT_DIR]:    path.mkdir(parents=True, exist_ok=True) os.environ["CRAWLEE_STORAGE_DIR"] = str(STORAGE_DIR) os.environ["CRAWLEE_LOG_LEVEL"] = "INFO" os.environ["CRAWLEE_PURGE_ON_START"] = "true" from crawlee import Glob, ConcurrencySettings from crawlee.crawlers import (    BeautifulSoupCrawler,    BeautifulSoupCrawlingContext,    ParselCrawler,    ParselCrawlingContext,    PlaywrightCrawler,    PlaywrightCrawlingContext, ) try:    import crawlee    print("Crawlee version:", crawlee.__version__) except Exception:    print("Crawlee imported successfully.") print("Pydantic version:", package_version("pydantic")) def safe_slug(value):    value = re.sub(r"[^a-zA-Z0-9]+", "-", str(value)).strip("-").lower()    return value or "item" def money_to_float(value):    if value is None:        return None    cleaned = re.sub(r"[^0-9.]", "", str(value))    return float(cleaned) if cleaned else None def normalize_text(value, max_len=None):    value = re.sub(r"s+", " ", value or "").strip()    return value[:max_len] if max_len else value def write_file(path, content):    path = Path(path)    path.parent.mkdir(parents=True, exist_ok=True)    path.write_text(textwrap.dedent(content).strip() + "n", encoding="utf-8") 

We begin by preparing the complete Colab runtime for the Crawlee tutorial. We install compatible versions of Crawlee, Pydantic, Playwright, and the required analysis libraries, and handle the automatic restart required after setup. We then configure storage folders, environment variables, crawler imports, and helper functions to ensure the rest of the workflow runs smoothly.

Generating the Demo Website and Product Catalog

PRODUCTS = [    {        "sku": "CRW-101",        "name": "Crawler Reliability Kit",        "category": "automation",        "price": 149.0,        "rating": 4.8,        "stock": 18,        "features": ["retry policy", "queue replay", "structured logs"],        "related": ["CRW-202", "CRW-303"],    },    {        "sku": "CRW-202",        "name": "Playwright Rendering Pack",        "category": "browser",        "price": 249.0,        "rating": 4.7,        "stock": 9,        "features": ["headless chromium", "screenshots", "dynamic DOM extraction"],        "related": ["CRW-101", "CRW-404"],    },    {        "sku": "CRW-303",        "name": "RAG Extraction Bundle",        "category": "ai-data",        "price": 199.0,        "rating": 4.9,        "stock": 13,        "features": ["clean text chunks", "metadata capture", "JSONL export"],        "related": ["CRW-101", "CRW-505"],    },    {        "sku": "CRW-404",        "name": "Anti-Fragile Session Toolkit",        "category": "resilience",        "price": 299.0,        "rating": 4.6,        "stock": 5,        "features": ["session rotation", "state recovery", "graceful failures"],        "related": ["CRW-202", "CRW-505"],    },    {        "sku": "CRW-505",        "name": "Data Export Control Plane",        "category": "storage",        "price": 179.0,        "rating": 4.5,        "stock": 21,        "features": ["datasets", "key-value store", "CSV and JSON export"],        "related": ["CRW-303", "CRW-404"],    }, ] def layout(title, body, extra_head="", extra_script=""):    css = """        """    return f"""                                              {title}        {css}        {extra_head}                    

{title}

{body}
Local demo website generated for Crawlee Python advanced tutorial.
{extra_script} """ def build_demo_site(): write_file( SITE_DIR / "robots.txt", """ User-agent: * Disallow: /admin/ Allow: / """, ) product_cards = [] for product in PRODUCTS: product_cards.append( f"""

{product['name']}

{product['category']} crawler module with rating {product['rating']}.

${product['price']:.2f}

Stock: {product['stock']}

""" ) write_file( SITE_DIR / "index.html", layout( "Crawlee Demo Commerce + Docs Hub", f"""

Why this site exists

This local website gives us predictable pages for testing Crawlee without scraping a third-party website. We include static HTML pages, documentation pages, product detail pages, a blog article, robots.txt, and a JavaScript-rendered page.

Featured crawler modules

{''.join(product_cards)}

Internal links for recursive crawling

""", ), ) for product in PRODUCTS: related_links = "n".join( f'
  • {sku}
  • ' for sku in product["related"] ) feature_list = "n".join(f"
  • {feature}
  • " for feature in product["features"]) json_ld = json.dumps( { "@context": "https://schema.org", "@type": "Product", "sku": product["sku"], "name": product["name"], "category": product["category"], "offers": { "@type": "Offer", "price": product["price"], "priceCurrency": "USD", }, "aggregateRating": { "@type": "AggregateRating", "ratingValue": product["rating"], }, }, indent=2, ) write_file( SITE_DIR / "products" / f"product-{safe_slug(product['sku'])}.html", layout( f"{product['name']} | Product Detail", f"""

    {product['name']}

    SKU: {product['sku']}

    Category: {product['category']}

    ${product['price']:.2f}

    Rating: {product['rating']} / 5

    Stock: {product['stock']}

    Features

      {feature_list}

    Related modules

      {related_links}
    """, ), )

    We create a realistic product catalog that becomes the structured data source for our demo website. We define reusable HTML layout logic, styling, navigation, and page templates to make the local website look and behave like a small commercial and documentation portal. We then generate the homepage and product detail pages, including prices, ratings, stock levels, product features, related links, and JSON-LD metadata.

    Adding Docs, Blog, Dynamic, and Admin Pages

       write_file(        SITE_DIR / "docs" / "getting-started.html",        layout(            "Getting Started with Reliable Crawlers",            """            

    HTTP-first crawling strategy

    We start with HTTP crawlers because they are lightweight and efficient. Browser crawling is reserved for pages that need JavaScript rendering.

    Core extraction fields

    Each crawler extracts URL, title, page type, text summary, outgoing links, and page-specific metadata.

    crawler = BeautifulSoupCrawler(max_requests_per_crawl=20)

    Next: advanced routing

    """, ), ) write_file( SITE_DIR / "docs" / "advanced-routing.html", layout( "Advanced Routing and Storage", """

    Queue filtering

    We filter links to keep the crawl focused on the same local domain and skip admin pages.

    Storage design

    Structured rows go to datasets. Binary screenshots and snapshots go to a key-value store.

    await context.enqueue_links(include=[Glob("https://example.com/**")])

    Read the scaling article

    """, ), ) write_file( SITE_DIR / "blog" / "crawling-at-scale.html", layout( "Crawling at Scale", """

    Scaling crawler jobs without losing reliability

    Production crawlers need controlled concurrency, retry behavior, stable request queues, structured exports, and monitoring-ready output.

    For AI data workflows, we also normalize text, preserve source URLs, create chunks, and record extraction provenance.

    queues datasets rag playwright

    """, ), ) dynamic_items = json.dumps( [ { "sku": "JS-900", "name": "Dynamic Inventory Scanner", "price": 329.0, "stock": 4, "desc": "Rendered only after JavaScript executes.", }, { "sku": "JS-901", "name": "Client-Side Review Miner", "price": 279.0, "stock": 11, "desc": "Created by browser-side DOM manipulation.", }, { "sku": "JS-902", "name": "Async Catalog Watcher", "price": 389.0, "stock": 7, "desc": "Useful for testing PlaywrightCrawler extraction.", }, ], indent=2, ) dynamic_script = f""" """ write_file( SITE_DIR / "dynamic.html", layout( "JavaScript Rendered Catalog", """

    Dynamic content test

    A plain HTTP crawler can download this page, but it will not see the cards below until JavaScript runs. PlaywrightCrawler opens a real browser and extracts the rendered DOM.

    Waiting for JavaScript rendering...

    """, extra_script=dynamic_script, ), ) write_file( SITE_DIR / "admin" / "hidden.html", layout( "Hidden Admin Page", """

    This page should be skipped

    The crawler excludes this admin path to demonstrate control over the rawl scope

    """, ), ) build_demo_site() print(f"Demo site generated at: {SITE_DIR}") class QuietHandler(SimpleHTTPRequestHandler): def log_message(self, format, *args): pass def start_local_server(directory): probe = socket.socket() probe.bind(("127.0.0.1", 0)) port = probe.getsockname()[1] probe.close() handler = partial(QuietHandler, directory=str(directory)) httpd = ThreadingHTTPServer(("127.0.0.1", port), handler) thread = threading.Thread(target=httpd.serve_forever, daemon=True) thread.start() base_url = f"http://127.0.0.1:{port}" time.sleep(0.5) return httpd, base_url def extract_json_ld(soup): blocks = [] for script in soup.select('script[type="application/ld+json"]'): raw = script.string or script.get_text() if not raw: continue try: blocks.append(json.loads(raw)) except Exception: blocks.append({"raw": raw}) return blocks def write_json(path, rows): path = Path(path) path.write_text(json.dumps(rows, ensure_ascii=False, indent=2), encoding="utf-8") def write_csv(path, rows): path = Path(path) if not rows: path.write_text("", encoding="utf-8") return flattened = [] for row in rows: flat = {} for key, value in row.items(): if isinstance(value, (list, dict)): flat[key] = json.dumps(value, ensure_ascii=False) else: flat[key] = value flattened.append(flat) fieldnames = sorted({key for row in flattened for key in row.keys()}) with path.open("w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() writer.writerows(flattened)

    We expand the demo website by adding documentation pages, a blog article, a JavaScript-rendered catalog page, and an admin page intended to be excluded from crawling. We use these pages to test different crawling scenarios, including static HTML extraction, documentation parsing, blog metadata extraction, dynamic browser rendering, and crawl filtering. We also start a local HTTP server and define utilities to extract JSON-LD content and export crawl results to JSON and CSV.

    Static Crawling with BeautifulSoupCrawler and ParselCrawler

    async def run_beautifulsoup_crawl(base_url):    print("n=== 1) BeautifulSoupCrawler: fast recursive HTTP crawl ===")    rows = []    crawler = BeautifulSoupCrawler(        parser="html.parser",        max_requests_per_crawl=30,        max_request_retries=1,        respect_robots_txt_file=True,        concurrency_settings=ConcurrencySettings(            desired_concurrency=4,            max_concurrency=6,        ),    )    @crawler.router.default_handler    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:        soup = context.soup        url = context.request.url        title = normalize_text(soup.title.get_text(" ", strip=True) if soup.title else "")        meta_description = ""        meta_tag = soup.find("meta", attrs={"name": "description"})        if meta_tag:            meta_description = normalize_text(meta_tag.get("content", ""))        out_links = []        for a in soup.select("a[href]"):            href = a.get("href")            label = normalize_text(a.get_text(" ", strip=True), 120)            out_links.append({"href": href, "label": label})        page_text = normalize_text(soup.get_text(" ", strip=True), 1000)        if "https://www.marktechpost.com/products/" in url:            page_type = "product"        elif "https://www.marktechpost.com/docs/" in url:            page_type = "documentation"        elif "https://www.marktechpost.com/blog/" in url:            page_type = "blog"        elif "https://www.marktechpost.com/dynamic" in url:            page_type = "dynamic-shell"        else:            page_type = "index"        row = {            "source": "beautifulsoup-http",            "url": url,            "title": title,            "page_type": page_type,            "meta_description": meta_description,            "text_preview": page_text,            "out_links": out_links,            "json_ld": extract_json_ld(soup),            "extracted_at_unix": time.time(),        }        if page_type == "product":            article = soup.select_one("article.product")            if article:                price_node = soup.select_one(".price")                row["product"] = {                    "sku": article.get("data-sku"),                    "category": article.get("data-category"),                    "name": normalize_text(                        soup.select_one(".product-title").get_text(" ", strip=True)                        if soup.select_one(".product-title")                        else ""                    ),                    "price": money_to_float(price_node.get("data-price") if price_node else None),                    "rating": float(article.get("data-rating")) if article.get("data-rating") else None,                    "stock": int(article.get("data-stock")) if article.get("data-stock") else None,                    "features": [                        normalize_text(li.get_text(" ", strip=True))                        for li in soup.select(".features li")                    ],                }        if page_type == "documentation":            row["doc"] = {                "headings": [                    normalize_text(h.get_text(" ", strip=True))                    for h in soup.select("h2, h3")                ],                "code_blocks": [                    normalize_text(code.get_text(" ", strip=True))                    for code in soup.select("pre code")                ],            }        if page_type == "blog":            row["blog"] = {                "author": soup.select_one(".blog-post").get("data-author") if soup.select_one(".blog-post") else None,                "reading_time": soup.select_one(".blog-post").get("data-reading-time") if soup.select_one(".blog-post") else None,                "tags": [                    normalize_text(tag.get_text(" ", strip=True))                    for tag in soup.select(".tag")                ],            }        rows.append(row)        await context.push_data(row)        await context.enqueue_links(            include=[Glob(f"{base_url}/**")],            exclude=[                Glob(f"{base_url}/admin/**"),                Glob(f"{base_url}/dynamic.html"),            ],        )    await crawler.run([f"{base_url}/index.html"])    write_json(OUTPUT_DIR / "beautifulsoup_crawl.json", rows)    write_csv(OUTPUT_DIR / "beautifulsoup_crawl.csv", rows)    print(f"BeautifulSoup rows extracted: {len(rows)}")    return rows async def run_parsel_precision_crawl(base_url):    print("n=== 2) ParselCrawler: precise CSS/XPath extraction from product pages ===")    rows = []    product_urls = [        f"{base_url}/products/product-{safe_slug(product['sku'])}.html"        for product in PRODUCTS    ]    crawler = ParselCrawler(        max_requests_per_crawl=len(product_urls),        max_request_retries=1,        concurrency_settings=ConcurrencySettings(            desired_concurrency=5,            max_concurrency=8,        ),    )    @crawler.router.default_handler    async def request_handler(context: ParselCrawlingContext) -> None:        selector = context.selector        title = selector.css("title::text").get()        sku = selector.css("article.product::attr(data-sku)").get()        category = selector.css("article.product::attr(data-category)").get()        rating = selector.css("article.product::attr(data-rating)").get()        stock = selector.css("article.product::attr(data-stock)").get()        name = selector.css(".product-title::text").get()        price = selector.css(".price::attr(data-price)").get()        features = [            normalize_text(feature)            for feature in selector.css(".features li::text").getall()        ]        row = {            "source": "parsel-precision",            "url": context.request.url,            "title": normalize_text(title),            "sku": sku,            "name": normalize_text(name),            "category": category,            "price": money_to_float(price),            "rating": float(rating) if rating else None,            "stock": int(stock) if stock else None,            "features": features,            "xpath_title": normalize_text(selector.xpath("//title/text()").get()),        }        rows.append(row)        await context.push_data(row)    await crawler.run(product_urls)    write_json(OUTPUT_DIR / "parsel_products.json", rows)    write_csv(OUTPUT_DIR / "parsel_products.csv", rows)    print(f"Parsel product rows extracted: {len(rows)}")    return rows 

    We implement the static crawling part of the workflow using BeautifulSoupCrawler and ParselCrawler. With BeautifulSoupCrawler, we recursively crawl the local website and extract page titles, metadata, text previews, outgoing links, product details, documentation headings, code blocks, and blog tags. With ParselCrawler, we perform more targeted CSS and XPath extraction from product pages to collect clean product-level fields, including SKU, category, price, rating, stock, and features.

    Dynamic Rendering with PlaywrightCrawler and Link Graphs

    async def run_playwright_dynamic_crawl(base_url):    print("n=== 3) PlaywrightCrawler: browser-rendered JavaScript crawl ===")    rows = []    crawler = PlaywrightCrawler(        max_requests_per_crawl=2,        max_request_retries=1,        headless=True,        browser_type="chromium",        browser_launch_options={            "args": ["--no-sandbox", "--disable-dev-shm-usage"],        },        goto_options={            "wait_until": "domcontentloaded",        },        concurrency_settings=ConcurrencySettings(            desired_concurrency=1,            max_concurrency=2,        ),    )    @crawler.router.default_handler    async def request_handler(context: PlaywrightCrawlingContext) -> None:        await context.page.wait_for_selector(".js-card", timeout=10000)        cards = await context.page.locator(".js-card").evaluate_all(            """            (cards) => cards.map((card) => {              const h3 = card.querySelector("h3");              const desc = card.querySelector(".desc");              const price = card.querySelector(".price");              return {                sku: card.dataset.sku,                name: h3 ? h3.textContent.trim() : null,                description: desc ? desc.textContent.trim() : null,                price_text: price ? price.textContent.trim() : null,                price: Number(card.dataset.price),                stock: Number(card.dataset.stock),                rendered_text: card.innerText.trim()              };            })            """        )        screenshot_bytes = await context.page.screenshot(full_page=True)        screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"        screenshot_path.write_bytes(screenshot_bytes)        try:            kvs = await context.get_key_value_store()            await kvs.set_value(                key="dynamic-catalog-full-page",                value=screenshot_bytes,                content_type="image/png",            )        except Exception as exc:            print("Key-value store screenshot save skipped:", repr(exc))        for card in cards:            row = {                **card,                "source": "playwright-rendered-js",                "url": context.request.url,                "screenshot_path": str(screenshot_path),                "extracted_at_unix": time.time(),            }            rows.append(row)        await context.push_data(rows)    try:        await crawler.run([f"{base_url}/dynamic.html"])    except Exception as exc:        print("Playwright section failed gracefully.")        print("Reason:", repr(exc))    write_json(OUTPUT_DIR / "playwright_dynamic.json", rows)    write_csv(OUTPUT_DIR / "playwright_dynamic.csv", rows)    print(f"Playwright dynamic rows extracted: {len(rows)}")    return rows def flatten_products(rows):    products = []    for row in rows:        if row.get("page_type") == "product" and isinstance(row.get("product"), dict):            product = row["product"]            products.append(                {                    "source": row.get("source"),                    "url": row.get("url"),                    "sku": product.get("sku"),                    "name": product.get("name"),                    "category": product.get("category"),                    "price": product.get("price"),                    "rating": product.get("rating"),                    "stock": product.get("stock"),                    "features": "; ".join(product.get("features", [])),                }            )        elif row.get("source") == "parsel-precision":            products.append(                {                    "source": row.get("source"),                    "url": row.get("url"),                    "sku": row.get("sku"),                    "name": row.get("name"),                    "category": row.get("category"),                    "price": row.get("price"),                    "rating": row.get("rating"),                    "stock": row.get("stock"),                    "features": "; ".join(row.get("features", [])),                }            )        elif row.get("source") == "playwright-rendered-js":            products.append(                {                    "source": row.get("source"),                    "url": row.get("url"),                    "sku": row.get("sku"),                    "name": row.get("name"),                    "category": "dynamic-js",                    "price": row.get("price") or money_to_float(row.get("price_text")),                    "rating": None,                    "stock": row.get("stock"),                    "features": row.get("description"),                }            )    return products def absolute_url(base_url, href):    if not href:        return None    if href.startswith("http://") or href.startswith("https://"):        return href    if href.startswith("https://www.marktechpost.com/"):        return base_url + href    return base_url + "/" + href def build_link_graph(base_url, rows):    graph = nx.DiGraph()    for row in rows:        src = row.get("url")        if not src:            continue        graph.add_node(            src,            title=row.get("title", ""),            page_type=row.get("page_type", ""),        )        for link in row.get("out_links", []) or []:            dst = absolute_url(base_url, link.get("href"))            if not dst:                continue            if "https://www.marktechpost.com/admin/" in dst:                continue            graph.add_node(dst)            graph.add_edge(src, dst, label=link.get("label", ""))    return graph 

    We handle dynamic content using PlaywrightCrawler, which opens the JavaScript-rendered page in a headless Chromium browser. We wait for client-side product cards to appear, extract their rendered fields, capture a full-page screenshot, and save the browser-based results for later analysis. We then define helper functions to normalize product records and build a directed link graph from the internal links discovered during crawling.

    Building AI-Ready Outputs and Running the Pipeline

    def make_rag_chunks(rows, max_chars=700):    chunks = []    for row in rows:        text = (            row.get("text_preview")            or row.get("rendered_text")            or row.get("description")            or ""        )        text = normalize_text(text)        if not text:            continue        sentences = re.split(r"(?<=[.!?])s+", text)        current = ""        for sentence in sentences:            if len(current) + len(sentence) + 1 <= max_chars:                current = (current + " " + sentence).strip()            else:                if current:                    chunks.append(                        {                            "chunk_id": hashlib.sha1(                                (row.get("url", "") + current).encode()                            ).hexdigest()[:12],                            "url": row.get("url"),                            "source": row.get("source"),                            "page_type": row.get("page_type"),                            "title": row.get("title") or row.get("name"),                            "text": current,                        }                    )                current = sentence        if current:            chunks.append(                {                    "chunk_id": hashlib.sha1(                        (row.get("url", "") + current).encode()                    ).hexdigest()[:12],                    "url": row.get("url"),                    "source": row.get("source"),                    "page_type": row.get("page_type"),                    "title": row.get("title") or row.get("name"),                    "text": current,                }            )    return chunks def analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows):    all_rows = bs4_rows + parsel_rows + playwright_rows    products = flatten_products(all_rows)    crawl_df = pd.DataFrame(all_rows)    product_df = pd.DataFrame(products)    if not product_df.empty:        product_df["price"] = pd.to_numeric(product_df["price"], errors="coerce")        product_df["stock"] = pd.to_numeric(product_df["stock"], errors="coerce")        product_df["rating"] = pd.to_numeric(product_df["rating"], errors="coerce")        product_df["inventory_value"] = product_df["price"] * product_df["stock"]    graph = build_link_graph(base_url, bs4_rows)    graph_path = OUTPUT_DIR / "site_link_graph.graphml"    if graph.number_of_nodes() > 0:        nx.write_graphml(graph, graph_path)    chunks = make_rag_chunks(all_rows)    rag_path = OUTPUT_DIR / "rag_chunks.jsonl"    with rag_path.open("w", encoding="utf-8") as f:        for chunk in chunks:            f.write(json.dumps(chunk, ensure_ascii=False) + "n")    crawl_json_path = OUTPUT_DIR / "combined_crawl_results.json"    crawl_json_path.write_text(        json.dumps(all_rows, ensure_ascii=False, indent=2),        encoding="utf-8",    )    product_csv_path = OUTPUT_DIR / "normalized_product_catalog.csv"    if not product_df.empty:        product_df.to_csv(product_csv_path, index=False)    price_plot_path = OUTPUT_DIR / "product_price_chart.png"    if not product_df.empty and product_df["price"].notna().any():        plot_df = product_df.dropna(subset=["price"]).copy()        plot_df["label"] = plot_df["sku"].fillna("unknown") + "n" + plot_df["source"].fillna("")        ax = plot_df.plot(            kind="bar",            x="label",            y="price",            legend=False,            figsize=(11, 5),            title="Extracted Product Prices by Source",        )        ax.set_xlabel("Product / extraction source")        ax.set_ylabel("Price")        plt.xticks(rotation=35, ha="right")        plt.tight_layout()        plt.savefig(price_plot_path, dpi=160)        plt.show()    graph_stats = {        "nodes": graph.number_of_nodes(),        "edges": graph.number_of_edges(),        "weakly_connected_components": (            nx.number_weakly_connected_components(graph)            if graph.number_of_nodes()            else 0        ),    }    if graph.number_of_nodes() > 0:        in_degrees = dict(graph.in_degree())        out_degrees = dict(graph.out_degree())        graph_stats["top_in_degree"] = sorted(            in_degrees.items(),            key=lambda x: x[1],            reverse=True,        )[:5]        graph_stats["top_out_degree"] = sorted(            out_degrees.items(),            key=lambda x: x[1],            reverse=True,        )[:5]    summary = {        "base_url": base_url,        "rows_total": len(all_rows),        "beautifulsoup_rows": len(bs4_rows),        "parsel_rows": len(parsel_rows),        "playwright_rows": len(playwright_rows),        "products_total": len(product_df),        "rag_chunks_total": len(chunks),        "graph": graph_stats,        "outputs": {            "beautifulsoup_json": str(OUTPUT_DIR / "beautifulsoup_crawl.json"),            "beautifulsoup_csv": str(OUTPUT_DIR / "beautifulsoup_crawl.csv"),            "parsel_json": str(OUTPUT_DIR / "parsel_products.json"),            "parsel_csv": str(OUTPUT_DIR / "parsel_products.csv"),            "playwright_json": str(OUTPUT_DIR / "playwright_dynamic.json"),            "playwright_csv": str(OUTPUT_DIR / "playwright_dynamic.csv"),            "combined_json": str(crawl_json_path),            "product_csv": str(product_csv_path) if product_csv_path.exists() else None,            "rag_jsonl": str(rag_path),            "graphml": str(graph_path) if graph_path.exists() else None,            "price_plot": str(price_plot_path) if price_plot_path.exists() else None,            "screenshots_dir": str(SCREENSHOT_DIR),        },    }    summary_path = OUTPUT_DIR / "run_summary.md"    summary_path.write_text(        "# Crawlee Python Advanced Tutorial Run Summarynn"        f"- Local demo site: `{base_url}`n"        f"- Total extracted rows: `{summary['rows_total']}`n"        f"- BeautifulSoup rows: `{summary['beautifulsoup_rows']}`n"        f"- Parsel rows: `{summary['parsel_rows']}`n"        f"- Playwright rows: `{summary['playwright_rows']}`n"        f"- Normalized products: `{summary['products_total']}`n"        f"- RAG chunks: `{summary['rag_chunks_total']}`n"        f"- Link graph nodes: `{graph_stats['nodes']}`n"        f"- Link graph edges: `{graph_stats['edges']}`nn"        "## Output filesnn"        + "n".join(f"- `{k}`: `{v}`" for k, v in summary["outputs"].items())        + "n",        encoding="utf-8",    )    print("n=== 4) Analysis summary ===")    print(json.dumps(summary, indent=2, ensure_ascii=False))    try:        from IPython.display import display, Markdown, Image as IPImage        display(Markdown("## Crawlee crawl preview"))        if not crawl_df.empty:            preview_cols = [                col for col in ["source", "page_type", "title", "url"]                if col in crawl_df.columns            ]            display(crawl_df[preview_cols].head(12))        display(Markdown("## Normalized product catalog"))        if not product_df.empty:            display(product_df.head(20))        if price_plot_path.exists():            display(Markdown("## Product price chart"))            display(IPImage(filename=str(price_plot_path)))        screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"        if screenshot_path.exists():            display(Markdown("## Playwright screenshot of JavaScript-rendered page"))            display(IPImage(filename=str(screenshot_path)))        display(Markdown(f"## Output directoryn`{OUTPUT_DIR}`"))    except Exception as exc:        print("Notebook display skipped:", repr(exc))    return summary async def main():    httpd, base_url = start_local_server(SITE_DIR)    print(f"nLocal demo website is running at: {base_url}/index.html")    try:        bs4_rows = await run_beautifulsoup_crawl(base_url)        parsel_rows = await run_parsel_precision_crawl(base_url)        playwright_rows = await run_playwright_dynamic_crawl(base_url)        summary = analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows)        return summary    finally:        httpd.shutdown()        print("nLocal demo server shut down.") loop = asyncio.get_event_loop() summary = loop.run_until_complete(main()) print("nTutorial complete.") print(f"All outputs are in: {OUTPUT_DIR}") print("Key files:") for file_path in sorted(OUTPUT_DIR.rglob("*")):    if file_path.is_file():        print(" -", file_path) 

    We process the extracted crawl data into analysis-ready and AI-ready outputs. We create RAG-style JSONL chunks, combine all crawl results, build a normalized product catalog, generate a GraphML link graph, and visualize product prices with Matplotlib. Finally, we run the full pipeline end-to-end, display previews in the notebook, save all generated artifacts, and print the final output file paths.

    Conclusion

    In conclusion, we have a complete Crawlee-based pipeline for crawling and data engineering that converts a small website into structured, reusable datasets. We used crawl scoping, robots.txt handling, concurrency settings, link enqueuing, browser rendering, key-value storage, and dataset exports to simulate patterns used in production web crawling systems. We normalized the extracted product data, saved the crawl outputs as JSON and CSV, created GraphML link graphs with NetworkX, generated JSONL chunks for retrieval-augmented generation workflows, and visualized the extracted product prices with Matplotlib.


    Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

    Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.