A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction

Source: MarkTechPost

In this tutorial, we build a complete and practical Crawl4AI workflow and explore how modern web crawling goes far beyond simply downloading page HTML. We set up the full environment, configure browser behavior, and work through essential capabilities such as basic crawling, markdown generation, structured CSS-based extraction, JavaScript execution, session handling, screenshots, link analysis, concurrent crawling, and deep multi-page exploration. We also examine how Crawl4AI can be extended with LLM-based extraction to transform raw web content into structured, usable data. Throughout the tutorial, we focus on hands-on implementation to understand the major features of Crawl4AI v0.8.x and learn how to apply them to realistic data extraction and web automation tasks.

import subprocess import sys   print("📦 Installing system dependencies...") subprocess.run(['apt-get', 'update', '-qq'], capture_output=True) subprocess.run(['apt-get', 'install', '-y', '-qq',                'libnss3', 'libnspr4', 'libatk1.0-0', 'libatk-bridge2.0-0',                'libcups2', 'libdrm2', 'libxkbcommon0', 'libxcomposite1',                'libxdamage1', 'libxfixes3', 'libxrandr2', 'libgbm1',                'libasound2', 'libpango-1.0-0', 'libcairo2'], capture_output=True) print("✅ System dependencies installed!")   print("n📦 Installing Python packages...") subprocess.run([sys.executable, '-m', 'pip', 'install', '-U', 'crawl4ai', 'nest_asyncio', 'pydantic', '-q']) print("✅ Python packages installed!")   print("n📦 Installing Playwright browsers (this may take a minute)...") subprocess.run([sys.executable, '-m', 'playwright', 'install', 'chromium'], capture_output=True) subprocess.run([sys.executable, '-m', 'playwright', 'install-deps', 'chromium'], capture_output=True) print("✅ Playwright browsers installed!")   import nest_asyncio nest_asyncio.apply()   import asyncio import json from typing import List, Optional from pydantic import BaseModel, Field   print("n" + "="*60) print("✅ INSTALLATION COMPLETE! Ready to crawl!") print("="*60)   print("n" + "="*60) print("📖 PART 2: BASIC CRAWLING") print("="*60)   from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode   async def basic_crawl():    """The simplest possible crawl - fetch a webpage and get markdown."""    print("n🔍 Running basic crawl on example.com...")       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(url="https://example.com")               print(f"n✅ Crawl successful: {result.success}")        print(f"📄 Title: {result.metadata.get('title', 'N/A')}")        print(f"📝 Markdown length: {len(result.markdown.raw_markdown)} characters")        print(f"n--- First 500 chars of markdown ---")        print(result.markdown.raw_markdown[:500])           return result   result = asyncio.run(basic_crawl())   print("n" + "="*60) print("⚙️ PART 3: CONFIGURED CRAWLING") print("="*60)   async def configured_crawl():    """Crawling with custom browser and crawler configurations."""    print("n🔧 Running configured crawl with custom settings...")       browser_config = BrowserConfig(        headless=True,        verbose=True,        viewport_width=1920,        viewport_height=1080,        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"    )       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        word_count_threshold=10,        page_timeout=30000,        wait_until="networkidle",        verbose=True    )       async with AsyncWebCrawler(config=browser_config) as crawler:        result = await crawler.arun(            url="https://httpbin.org/html",            config=run_config        )               print(f"n✅ Success: {result.success}")        print(f"📊 Status code: {result.status_code}")        print(f"n--- Content Preview ---")        print(result.markdown.raw_markdown[:400])           return result   result = asyncio.run(configured_crawl())   print("n" + "="*60) print("📝 PART 4: MARKDOWN GENERATION") print("="*60)   from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator   async def markdown_generation_demo():    """Demonstrates raw vs fit markdown with content filtering."""    print("n🎯 Demonstrating markdown generation strategies...")       browser_config = BrowserConfig(headless=True, verbose=False)       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        markdown_generator=DefaultMarkdownGenerator(            content_filter=PruningContentFilter(                threshold=0.4,                threshold_type="fixed",                min_word_threshold=20            )        )    )       async with AsyncWebCrawler(config=browser_config) as crawler:        result = await crawler.arun(            url="https://en.wikipedia.org/wiki/Web_scraping",            config=run_config        )               raw_len = len(result.markdown.raw_markdown)        fit_len = len(result.markdown.fit_markdown) if result.markdown.fit_markdown else 0               print(f"n📊 Markdown Comparison:")        print(f"   Raw Markdown:  {raw_len:,} characters")        print(f"   Fit Markdown:  {fit_len:,} characters")        print(f"   Reduction:     {((raw_len - fit_len) / raw_len * 100):.1f}%")               print(f"n--- Fit Markdown Preview (first 600 chars) ---")        print(result.markdown.fit_markdown[:600] if result.markdown.fit_markdown else "N/A")           return result   result = asyncio.run(markdown_generation_demo())

We prepare the complete Google Colab environment required to run Crawl4AI smoothly, including system packages, Python dependencies, and the Playwright browser setup. We initialize the async-friendly notebook workflow with nest_asyncio, import the core libraries, and confirm that the environment is ready for crawling tasks. We then begin with foundational examples: a simple crawl, followed by a more configurable crawl that demonstrates how browser settings and runtime options affect page retrieval.

print("n" + "="*60) print("🔎 PART 5: BM25 QUERY-BASED FILTERING") print("="*60)   async def bm25_filtering_demo():    """Using BM25 algorithm to extract content relevant to a specific query."""    print("n🎯 Extracting content relevant to a specific query...")       query = "legal aspects privacy data protection"       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        markdown_generator=DefaultMarkdownGenerator(            content_filter=BM25ContentFilter(                user_query=query,                bm25_threshold=1.2            )        )    )       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(            url="https://en.wikipedia.org/wiki/Web_scraping",            config=run_config        )               print(f"n📝 Query: '{query}'")        print(f"📊 Fit markdown length: {len(result.markdown.fit_markdown or '')} chars")        print(f"n--- Query-Relevant Content Preview ---")        print(result.markdown.fit_markdown[:800] if result.markdown.fit_markdown else "No relevant content found")           return result   result = asyncio.run(bm25_filtering_demo())   print("n" + "="*60) print("🏗️ PART 6: CSS-BASED EXTRACTION (No LLM)") print("="*60)   from crawl4ai import JsonCssExtractionStrategy   async def css_extraction_demo():    """Extract structured data using CSS selectors - fast and reliable."""    print("n🔧 Extracting data using CSS selectors...")       schema = {        "name": "Wikipedia Headings",        "baseSelector": "div.mw-parser-output h2",        "fields": [            {                "name": "heading_text",                "selector": "span.mw-headline",                "type": "text"            },            {                "name": "heading_id",                "selector": "span.mw-headline",                "type": "attribute",                "attribute": "id"            }        ]    }       extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        extraction_strategy=extraction_strategy    )       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(            url="https://en.wikipedia.org/wiki/Python_(programming_language)",            config=run_config        )               if result.extracted_content:            data = json.loads(result.extracted_content)            print(f"n✅ Extracted {len(data)} section headings")            print(f"n--- Extracted Headings ---")            for item in data[:10]:                heading = item.get('heading_text', 'N/A')                heading_id = item.get('heading_id', 'N/A')                if heading:                    print(f"  • {heading} (#{heading_id})")        else:            print("❌ No data extracted")               return result   result = asyncio.run(css_extraction_demo())   print("n" + "="*60) print("🛒 PART 7: ADVANCED CSS EXTRACTION - Hacker News") print("="*60)   async def advanced_css_extraction():    """Extract stories from Hacker News with nested selectors."""    print("n🛍️ Extracting stories from Hacker News...")       schema = {        "name": "Hacker News Stories",        "baseSelector": "tr.athing",        "fields": [            {                "name": "rank",                "selector": "span.rank",                "type": "text"            },            {                "name": "title",                "selector": "span.titleline > a",                "type": "text"            },            {                "name": "url",                "selector": "span.titleline > a",                "type": "attribute",                "attribute": "href"            },            {                "name": "site",                "selector": "span.sitestr",                "type": "text"            }        ]    }       extraction_strategy = JsonCssExtractionStrategy(schema)       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        extraction_strategy=extraction_strategy    )       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(            url="https://news.ycombinator.com",            config=run_config        )               if result.extracted_content:            stories = json.loads(result.extracted_content)            print(f"n✅ Extracted {len(stories)} stories from Hacker News")            print(f"n--- Top 10 Stories ---")            for story in stories[:10]:                rank = story.get('rank', '?').strip('.') if story.get('rank') else '?'                title = story.get('title', 'N/A')[:55]                site = story.get('site', 'N/A')                print(f"  #{rank:<3} {title:<55} ({site})")                   return result   result = asyncio.run(advanced_css_extraction())

We focus on improving the quality and relevance of extracted content by exploring markdown generation and query-aware filtering. We compare raw markdown with fit markdown to see how pruning reduces noise, and we use BM25-based filtering to keep only the parts of a page that align with a specific query. We then move into CSS-based extraction, where we define a structured schema and use selectors to pull clean heading data from a Wikipedia page without relying on an LLM.

print("n" + "="*60) print("⚡ PART 8: JAVASCRIPT EXECUTION") print("="*60)   async def javascript_execution_demo():    """Execute JavaScript on pages before extraction."""    print("n🎭 Executing JavaScript before crawling...")       js_code = """    // Scroll down to trigger lazy loading    window.scrollTo(0, document.body.scrollHeight);       // Wait for content to load    await new Promise(r => setTimeout(r, 1000));       // Scroll back up    window.scrollTo(0, 0);       // Add a marker to verify JS ran    document.body.setAttribute('data-crawl4ai', 'executed');    """       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        js_code=[js_code],        wait_for="css:body",        delay_before_return_html=1.0    )       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(            url="https://httpbin.org/html",            config=run_config        )               print(f"n✅ Page crawled with JS execution")        print(f"📊 Status: {result.status_code}")        print(f"📝 Content length: {len(result.markdown.raw_markdown)} chars")           return result   result = asyncio.run(javascript_execution_demo())   print("n" + "="*60) print("🤖 PART 9: LLM-BASED EXTRACTION") print("="*60)   from crawl4ai import LLMExtractionStrategy, LLMConfig   class Article(BaseModel):    title: str = Field(description="The article title")    summary: str = Field(description="A brief summary")    topics: List[str] = Field(description="Main topics covered")   async def llm_extraction_demo():    """Use LLM to intelligently extract and structure data."""    print("n🤖 LLM-based extraction setup...")       import os    api_key = os.getenv('OPENAI_API_KEY')       if not api_key:        print("n⚠️ No OPENAI_API_KEY found. Showing setup code only.")        print("nTo enable LLM extraction, run:")        print("   import os")        print("   os.environ['OPENAI_API_KEY'] = 'sk-your-key-here'")        print("n--- Example Code ---")        example_code = ''' from crawl4ai import LLMExtractionStrategy, LLMConfig from pydantic import BaseModel, Field   class Product(BaseModel):    name: str = Field(description="Product name")    price: str = Field(description="Product price")   llm_strategy = LLMExtractionStrategy(    llm_config=LLMConfig(        provider="openai/gpt-4o-mini",  # or "ollama/llama3"        api_token=os.getenv('OPENAI_API_KEY')    ),    schema=Product.model_json_schema(),    extraction_type="schema",    instruction="Extract all products with prices." )   run_config = CrawlerRunConfig(    extraction_strategy=llm_strategy,    cache_mode=CacheMode.BYPASS )   async with AsyncWebCrawler() as crawler:    result = await crawler.arun(url="https://example.com", config=run_config)    products = json.loads(result.extracted_content) '''        print(example_code)        return None       llm_strategy = LLMExtractionStrategy(        llm_config=LLMConfig(            provider="openai/gpt-4o-mini",            api_token=api_key        ),        schema=Article.model_json_schema(),        extraction_type="schema",        instruction="Extract article titles and summaries."    )       run_config = CrawlerRunConfig(        extraction_strategy=llm_strategy,        cache_mode=CacheMode.BYPASS    )       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(            url="https://news.ycombinator.com",            config=run_config        )               if result.extracted_content:            data = json.loads(result.extracted_content)            print(f"n✅ LLM extracted:")            print(json.dumps(data, indent=2)[:1000])               return result   result = asyncio.run(llm_extraction_demo())

We continue structured extraction by applying nested CSS selectors to collect ranked story information from Hacker News in a clean JSON-like format. We then demonstrate JavaScript execution before extraction, which helps us interact with dynamic pages by scrolling, waiting for content, and modifying the DOM before processing. Finally, we introduce LLM-based extraction, define a schema with Pydantic, and show how Crawl4AI can convert unstructured web content into structured outputs using a language model.

print("n" + "="*60) print("🕸️ PART 10: DEEP CRAWLING") print("="*60)   from crawl4ai.deep_crawling import BFSDeepCrawlStrategy from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter, DomainFilter   async def deep_crawl_demo():    """Crawl multiple pages starting from a seed URL using BFS."""    print("n🕷️ Starting deep crawl with BFS strategy...")       filter_chain = FilterChain([        DomainFilter(            allowed_domains=["docs.crawl4ai.com"],            blocked_domains=[]        ),        URLPatternFilter(            patterns=["*quickstart*", "*installation*", "*examples*"]        )    ])       deep_crawl_strategy = BFSDeepCrawlStrategy(        max_depth=2,        max_pages=5,        filter_chain=filter_chain,        include_external=False    )       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        deep_crawl_strategy=deep_crawl_strategy    )       pages_crawled = []       async with AsyncWebCrawler() as crawler:        results = await crawler.arun(            url="https://docs.crawl4ai.com/",            config=run_config        )               if isinstance(results, list):            for result in results:                pages_crawled.append(result.url)                print(f"  ✅ Crawled: {result.url}")                print(f"     📄 Content: {len(result.markdown.raw_markdown)} chars")        else:            pages_crawled.append(results.url)            print(f"  ✅ Crawled: {results.url}")            print(f"     📄 Content: {len(results.markdown.raw_markdown)} chars")       print(f"n📊 Total pages crawled: {len(pages_crawled)}")    return pages_crawled   pages = asyncio.run(deep_crawl_demo())   print("n" + "="*60) print("🚀 PART 11: MULTI-URL CONCURRENT CRAWLING") print("="*60)   async def multi_url_crawl():    """Crawl multiple URLs concurrently for maximum efficiency."""    print("n⚡ Crawling multiple URLs concurrently...")       urls = [        "https://httpbin.org/html",        "https://httpbin.org/robots.txt",        "https://httpbin.org/json",        "https://example.com",        "https://httpbin.org/headers"    ]       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        verbose=False    )       async with AsyncWebCrawler() as crawler:        results = await crawler.arun_many(            urls=urls,            config=run_config        )               print(f"n📊 Results Summary:")        print(f"{'URL':<40} {'Status':<10} {'Content':<15}")        print("-" * 65)               for result in results:            url_short = result.url[:38] + ".." if len(result.url) > 40 else result.url            status = "✅" if result.success else "❌"            content_len = f"{len(result.markdown.raw_markdown):,} chars" if result.success else "N/A"            print(f"{url_short:<40} {status:<10} {content_len:<15}")               return results   results = asyncio.run(multi_url_crawl())   print("n" + "="*60) print("📸 PART 12: SCREENSHOTS & MEDIA") print("="*60)   async def screenshot_demo():    """Capture screenshots and extract media from pages."""    print("n📷 Capturing screenshot and extracting media...")       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        screenshot=True,        pdf=False,    )       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(            url="https://en.wikipedia.org/wiki/Web_scraping",            config=run_config        )               print(f"n✅ Crawl complete!")        print(f"📸 Screenshot captured: {result.screenshot is not None}")               if result.screenshot:            print(f"   Screenshot size: {len(result.screenshot)} bytes (base64)")                   if result.media and 'images' in result.media:            images = result.media['images']            print(f"n🖼️ Found {len(images)} images:")            for img in images[:5]:                print(f"   • {img.get('src', 'N/A')[:60]}...")                   return result   result = asyncio.run(screenshot_demo())

We expand from single-page crawling to deeper and broader workflows by introducing BFS-based deep crawling across multiple related pages. We configure a filter chain to control which domains and URL patterns are allowed, making the crawl targeted and efficient rather than uncontrolled. We also demonstrate concurrent multi-URL crawling and screenshot/media extraction, showing how Crawl4AI can scale across several pages while also collecting visual and embedded content.

print("n" + "="*60) print("🔗 PART 13: LINK EXTRACTION") print("="*60)   async def link_extraction_demo():    """Extract and analyze all links from a page."""    print("n🔗 Extracting and analyzing links...")       run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(            url="https://docs.crawl4ai.com/",            config=run_config        )               internal_links = result.links.get('internal', [])        external_links = result.links.get('external', [])               print(f"n📊 Link Analysis:")        print(f"   Internal links: {len(internal_links)}")        print(f"   External links: {len(external_links)}")               print(f"n--- Sample Internal Links (first 5) ---")        for link in internal_links[:5]:            print(f"   • {link.get('href', 'N/A')[:60]}")                   print(f"n--- Sample External Links (first 5) ---")        for link in external_links[:5]:            print(f"   • {link.get('href', 'N/A')[:60]}")               return result   result = asyncio.run(link_extraction_demo())   print("n" + "="*60) print("🎯 PART 14: CONTENT SELECTION") print("="*60)   async def content_selection_demo():    """Target specific content using CSS selectors."""    print("n🎯 Targeting specific content with CSS selectors...")       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        css_selector="article, main, .content, #content, #mw-content-text",        excluded_tags=["nav", "footer", "header", "aside"],        remove_overlay_elements=True    )       async with AsyncWebCrawler() as crawler:        result = await crawler.arun(            url="https://en.wikipedia.org/wiki/Web_scraping",            config=run_config        )               print(f"n✅ Content extracted with targeting")        print(f"📝 Markdown length: {len(result.markdown.raw_markdown):,} chars")        print(f"n--- Preview (first 500 chars) ---")        print(result.markdown.raw_markdown[:500])           return result   result = asyncio.run(content_selection_demo())   print("n" + "="*60) print("🔐 PART 15: SESSION MANAGEMENT") print("="*60)   async def session_management_demo():    """Maintain browser sessions across multiple requests."""    print("n🔐 Demonstrating session management...")       browser_config = BrowserConfig(headless=True)       async with AsyncWebCrawler(config=browser_config) as crawler:        session_id = "my_session"               result1 = await crawler.arun(            url="https://httpbin.org/cookies/set?session=demo123",            config=CrawlerRunConfig(                cache_mode=CacheMode.BYPASS,                session_id=session_id            )        )        print(f"  Step 1: Set cookies - Success: {result1.success}")               result2 = await crawler.arun(            url="https://httpbin.org/cookies",            config=CrawlerRunConfig(                cache_mode=CacheMode.BYPASS,                session_id=session_id            )        )        print(f"  Step 2: Read cookies - Success: {result2.success}")        print(f"n📝 Cookie Response:")        print(result2.markdown.raw_markdown[:300])           return result2   result = asyncio.run(session_management_demo())

We analyze the structure and navigability of a site by extracting both internal and external links from a page and summarizing them for inspection. We then demonstrate content targeting with CSS selectors and excluded tags, focusing extraction on the most meaningful sections of a page while avoiding navigation or layout noise. After that, we show session management, where we preserve browser state across requests and verify that cookies persist between sequential crawls.

print("n" + "="*60) print("🌟 PART 16: COMPLETE REAL-WORLD EXAMPLE") print("="*60)   async def complete_example():    """Complete example combining CSS extraction with content filtering."""    print("n🌟 Running complete example: Hacker News scraper with filtering")       schema = {        "name": "HN Stories",        "baseSelector": "tr.athing",        "fields": [            {"name": "rank", "selector": "span.rank", "type": "text"},            {"name": "title", "selector": "span.titleline > a", "type": "text"},            {"name": "url", "selector": "span.titleline > a", "type": "attribute", "attribute": "href"},            {"name": "site", "selector": "span.sitestr", "type": "text"}        ]    }       browser_config = BrowserConfig(        headless=True,        viewport_width=1920,        viewport_height=1080    )       run_config = CrawlerRunConfig(        cache_mode=CacheMode.BYPASS,        extraction_strategy=JsonCssExtractionStrategy(schema),        markdown_generator=DefaultMarkdownGenerator(            content_filter=PruningContentFilter(threshold=0.4)        )    )       async with AsyncWebCrawler(config=browser_config) as crawler:        result = await crawler.arun(            url="https://news.ycombinator.com",            config=run_config        )               if result.extracted_content:            stories = json.loads(result.extracted_content)                       print(f"n✅ Successfully extracted {len(stories)} stories!")            print(f"n{'='*70}")            print("📰 TOP HACKER NEWS STORIES")            print("="*70)                       for story in stories[:15]:                rank = story.get('rank', '?').strip('.') if story.get('rank') else '?'                title = story.get('title', 'No title')[:50]                site = story.get('site', 'N/A')                url = story.get('url', '')[:30]                print(f"  #{rank:<3} {title:<50} ({site})")                           print("="*70)                       return stories       return []   stories = asyncio.run(complete_example())   print("n" + "="*60) print("💾 BONUS: SAVING RESULTS") print("="*60)   if stories:    with open('hacker_news_stories.json', 'w') as f:        json.dump(stories, f, indent=2)    print(f"✅ Saved {len(stories)} stories to 'hacker_news_stories.json'")    print("nTo download in Colab:")    print("   from google.colab import files")    print("   files.download('hacker_news_stories.json')")   print("n" + "="*60) print("📚 TUTORIAL COMPLETE!") print("="*60)   print(""" ✅ What you learned:   1. Basic crawling with AsyncWebCrawler 2. Browser & crawler configuration 3. Markdown generation (raw vs fit) 4. BM25 query-based content filtering 5. CSS-based structured data extraction 6. Advanced CSS extraction (Hacker News) 7. JavaScript execution for dynamic content 8. LLM-based extraction setup 9. Deep crawling with BFS strategy 10. Multi-URL concurrent crawling 11. Screenshots & media extraction 12. Link extraction & analysis 13. Content targeting with CSS selectors 14. Session management 15. Complete real-world scraping example   📖 RESOURCES:  • Docs: https://docs.crawl4ai.com/  • GitHub: https://github.com/unclecode/crawl4ai  • Discord: https://discord.gg/jP8KfhDhyN   🚀 Happy Crawling with Crawl4AI! """)

We combine several ideas from the tutorial into a complete real-world example that extracts and filters Hacker News stories using structured CSS extraction and Markdown pruning. We format the results into a readable output, demonstrating how Crawl4AI can support a practical scraping workflow from collection to presentation. Finally, we save the extracted stories to a JSON file and close the tutorial with a clear summary of the major concepts and capabilities we have implemented throughout the notebook.

In conclusion, we developed a strong end-to-end understanding of how to use Crawl4AI for both simple and advanced crawling tasks. We moved from straightforward page extraction to more refined workflows involving content filtering, targeted element selection, structured data extraction, dynamic-page interaction, multi-URL concurrency, and deep crawling across linked pages. We also saw how the framework supports richer automation through media capture, persistent sessions, and optional LLM-powered schema extraction. As a result, we finished with a practical foundation for building reliable, efficient, and flexible scraping and crawling pipelines that are ready to support real-world research, monitoring, and intelligent data processing workflows.

Check out the Full Implementation Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Michal Sutter

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.