How to Design Production-Grade Mock Data Pipelines Using Polyfactory with Dataclasses, Pydantic, Attrs, and Nested Models

how-to-design-production-grade-mock-data-pipelines-using-polyfactory-with-dataclasses,-pydantic,-attrs,-and-nested-models

Source: MarkTechPost

In this tutorial, we walk through an advanced, end-to-end exploration of Polyfactory, focusing on how we can generate rich, realistic mock data directly from Python type hints. We start by setting up the environment and progressively build factories for data classes, Pydantic models, and attrs-based classes, while demonstrating customization, overrides, calculated fields, and the generation of nested objects. As we move through each snippet, we show how we can control randomness, enforce constraints, and model real-world structures, making this tutorial directly applicable to testing, prototyping, and data-driven development workflows. Check out the FULL CODES here.

import subprocess import sys   def install_package(package):    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])   packages = [    "polyfactory",    "pydantic",    "email-validator",    "faker",    "msgspec",    "attrs" ]   for package in packages:    try:        install_package(package)        print(f"✓ Installed {package}")    except Exception as e:        print(f"✗ Failed to install {package}: {e}")   print("n")   print("=" * 80) print("SECTION 2: Basic Dataclass Factories") print("=" * 80)   from dataclasses import dataclass from typing import List, Optional from datetime import datetime, date from uuid import UUID from polyfactory.factories import DataclassFactory   @dataclass class Address:    street: str    city: str    country: str    zip_code: str   @dataclass class Person:    id: UUID    name: str    email: str    age: int    birth_date: date    is_active: bool    address: Address    phone_numbers: List[str]    bio: Optional[str] = None   class PersonFactory(DataclassFactory[Person]):    pass   person = PersonFactory.build() print(f"Generated Person:") print(f"  ID: {person.id}") print(f"  Name: {person.name}") print(f"  Email: {person.email}") print(f"  Age: {person.age}") print(f"  Address: {person.address.city}, {person.address.country}") print(f"  Phone Numbers: {person.phone_numbers[:2]}") print()   people = PersonFactory.batch(5) print(f"Generated {len(people)} people:") for i, p in enumerate(people, 1):    print(f"  {i}. {p.name} - {p.email}") print("n")

We set up the environment and ensure all required dependencies are installed. We also introduce the core idea of using Polyfactory to generate mock data from type hints. By initializing the basic dataclass factories, we establish the foundation for all subsequent examples.

print("=" * 80) print("SECTION 3: Customizing Factory Behavior") print("=" * 80)   from faker import Faker from polyfactory.fields import Use, Ignore   @dataclass class Employee:    employee_id: str    full_name: str    department: str    salary: float    hire_date: date    is_manager: bool    email: str    internal_notes: Optional[str] = None   class EmployeeFactory(DataclassFactory[Employee]):    __faker__ = Faker(locale="en_US")    __random_seed__ = 42      @classmethod    def employee_id(cls) -> str:        return f"EMP-{cls.__random__.randint(10000, 99999)}"      @classmethod    def full_name(cls) -> str:        return cls.__faker__.name()      @classmethod    def department(cls) -> str:        departments = ["Engineering", "Marketing", "Sales", "HR", "Finance"]        return cls.__random__.choice(departments)      @classmethod    def salary(cls) -> float:        return round(cls.__random__.uniform(50000, 150000), 2)      @classmethod    def email(cls) -> str:        return cls.__faker__.company_email()   employees = EmployeeFactory.batch(3) print("Generated Employees:") for emp in employees:    print(f"  {emp.employee_id}: {emp.full_name}")    print(f"    Department: {emp.department}")    print(f"    Salary: ${emp.salary:,.2f}")    print(f"    Email: {emp.email}")    print() print()   print("=" * 80) print("SECTION 4: Field Constraints and Calculated Fields") print("=" * 80)   @dataclass class Product:    product_id: str    name: str    description: str    price: float    discount_percentage: float    stock_quantity: int    final_price: Optional[float] = None    sku: Optional[str] = None   class ProductFactory(DataclassFactory[Product]):    @classmethod    def product_id(cls) -> str:        return f"PROD-{cls.__random__.randint(1000, 9999)}"      @classmethod    def name(cls) -> str:        adjectives = ["Premium", "Deluxe", "Classic", "Modern", "Eco"]        nouns = ["Widget", "Gadget", "Device", "Tool", "Appliance"]        return f"{cls.__random__.choice(adjectives)} {cls.__random__.choice(nouns)}"      @classmethod    def price(cls) -> float:        return round(cls.__random__.uniform(10.0, 1000.0), 2)      @classmethod    def discount_percentage(cls) -> float:        return round(cls.__random__.uniform(0, 30), 2)      @classmethod    def stock_quantity(cls) -> int:        return cls.__random__.randint(0, 500)      @classmethod    def build(cls, **kwargs):        instance = super().build(**kwargs)        if instance.final_price is None:            instance.final_price = round(                instance.price * (1 - instance.discount_percentage / 100), 2            )        if instance.sku is None:            name_part = instance.name.replace(" ", "-").upper()[:10]            instance.sku = f"{instance.product_id}-{name_part}"        return instance   products = ProductFactory.batch(3) print("Generated Products:") for prod in products:    print(f"  {prod.sku}")    print(f"    Name: {prod.name}")    print(f"    Price: ${prod.price:.2f}")    print(f"    Discount: {prod.discount_percentage}%")    print(f"    Final Price: ${prod.final_price:.2f}")    print(f"    Stock: {prod.stock_quantity} units")    print() print()

We focus on generating simple but realistic mock data using dataclasses and default Polyfactory behavior. We show how to quickly create single instances and batches without writing any custom logic. It helps us validate how Polyfactory automatically interprets type hints to populate nested structures.

print("=" * 80) print("SECTION 6: Complex Nested Structures") print("=" * 80)   from enum import Enum   class OrderStatus(str, Enum):    PENDING = "pending"    PROCESSING = "processing"    SHIPPED = "shipped"    DELIVERED = "delivered"    CANCELLED = "cancelled"   @dataclass class OrderItem:    product_name: str    quantity: int    unit_price: float    total_price: Optional[float] = None   @dataclass class ShippingInfo:    carrier: str    tracking_number: str    estimated_delivery: date   @dataclass class Order:    order_id: str    customer_name: str    customer_email: str    status: OrderStatus    items: List[OrderItem]    order_date: datetime    shipping_info: Optional[ShippingInfo] = None    total_amount: Optional[float] = None    notes: Optional[str] = None   class OrderItemFactory(DataclassFactory[OrderItem]):    @classmethod    def product_name(cls) -> str:        products = ["Laptop", "Mouse", "Keyboard", "Monitor", "Headphones",                   "Webcam", "USB Cable", "Phone Case", "Charger", "Tablet"]        return cls.__random__.choice(products)      @classmethod    def quantity(cls) -> int:        return cls.__random__.randint(1, 5)      @classmethod    def unit_price(cls) -> float:        return round(cls.__random__.uniform(5.0, 500.0), 2)      @classmethod    def build(cls, **kwargs):        instance = super().build(**kwargs)        if instance.total_price is None:            instance.total_price = round(instance.quantity * instance.unit_price, 2)        return instance   class ShippingInfoFactory(DataclassFactory[ShippingInfo]):    @classmethod    def carrier(cls) -> str:        carriers = ["FedEx", "UPS", "DHL", "USPS"]        return cls.__random__.choice(carriers)      @classmethod    def tracking_number(cls) -> str:        return ''.join(cls.__random__.choices('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=12))   class OrderFactory(DataclassFactory[Order]):    @classmethod    def order_id(cls) -> str:        return f"ORD-{datetime.now().year}-{cls.__random__.randint(100000, 999999)}"      @classmethod    def items(cls) -> List[OrderItem]:        return OrderItemFactory.batch(cls.__random__.randint(1, 5))      @classmethod    def build(cls, **kwargs):        instance = super().build(**kwargs)        if instance.total_amount is None:            instance.total_amount = round(sum(item.total_price for item in instance.items), 2)        if instance.shipping_info is None and instance.status in [OrderStatus.SHIPPED, OrderStatus.DELIVERED]:            instance.shipping_info = ShippingInfoFactory.build()        return instance   orders = OrderFactory.batch(2) print("Generated Orders:") for order in orders:    print(f"n  Order {order.order_id}")    print(f"    Customer: {order.customer_name} ({order.customer_email})")    print(f"    Status: {order.status.value}")    print(f"    Items ({len(order.items)}):")    for item in order.items:        print(f"      - {item.quantity}x {item.product_name} @ ${item.unit_price:.2f} = ${item.total_price:.2f}")    print(f"    Total: ${order.total_amount:.2f}")    if order.shipping_info:        print(f"    Shipping: {order.shipping_info.carrier} - {order.shipping_info.tracking_number}") print("n")

We build more complex domain logic by introducing calculated and dependent fields within factories. We show how we can derive values such as final prices, totals, and shipping details after object creation. This allows us to model realistic business rules directly inside our test data generators.

print("=" * 80) print("SECTION 7: Attrs Integration") print("=" * 80)   import attrs from polyfactory.factories.attrs_factory import AttrsFactory   @attrs.define class BlogPost:    title: str    author: str    content: str    views: int = 0    likes: int = 0    published: bool = False    published_at: Optional[datetime] = None    tags: List[str] = attrs.field(factory=list)   class BlogPostFactory(AttrsFactory[BlogPost]):    @classmethod    def title(cls) -> str:        templates = [            "10 Tips for {}",            "Understanding {}",            "The Complete Guide to {}",            "Why {} Matters",            "Getting Started with {}"        ]        topics = ["Python", "Data Science", "Machine Learning", "Web Development", "DevOps"]        template = cls.__random__.choice(templates)        topic = cls.__random__.choice(topics)        return template.format(topic)      @classmethod    def content(cls) -> str:        return " ".join(Faker().sentences(nb=cls.__random__.randint(3, 8)))      @classmethod    def views(cls) -> int:        return cls.__random__.randint(0, 10000)      @classmethod    def likes(cls) -> int:        return cls.__random__.randint(0, 1000)      @classmethod    def tags(cls) -> List[str]:        all_tags = ["python", "tutorial", "beginner", "advanced", "guide",                   "tips", "best-practices", "2024"]        return cls.__random__.sample(all_tags, k=cls.__random__.randint(2, 5))   posts = BlogPostFactory.batch(3) print("Generated Blog Posts:") for post in posts:    print(f"n  '{post.title}'")    print(f"    Author: {post.author}")    print(f"    Views: {post.views:,} | Likes: {post.likes:,}")    print(f"    Published: {post.published}")    print(f"    Tags: {', '.join(post.tags)}")    print(f"    Preview: {post.content[:100]}...") print("n")   print("=" * 80) print("SECTION 8: Building with Specific Overrides") print("=" * 80)   custom_person = PersonFactory.build(    name="Alice Johnson",    age=30,    email="[email protected]" ) print(f"Custom Person:") print(f"  Name: {custom_person.name}") print(f"  Age: {custom_person.age}") print(f"  Email: {custom_person.email}") print(f"  ID (auto-generated): {custom_person.id}") print()   vip_customers = PersonFactory.batch(    3,    bio="VIP Customer" ) print("VIP Customers:") for customer in vip_customers:    print(f"  {customer.name}: {customer.bio}") print("n")

We extend Polyfactory usage to validated Pydantic models and attrs-based classes. We demonstrate how we can respect field constraints, validators, and default behaviors while still generating valid data at scale. It ensures our mock data remains compatible with real application schemas.

print("=" * 80) print("SECTION 9: Field-Level Control with Use and Ignore") print("=" * 80)   from polyfactory.fields import Use, Ignore   @dataclass class Configuration:    app_name: str    version: str    debug: bool    created_at: datetime    api_key: str    secret_key: str   class ConfigFactory(DataclassFactory[Configuration]):    app_name = Use(lambda: "MyAwesomeApp")    version = Use(lambda: "1.0.0")    debug = Use(lambda: False)      @classmethod    def api_key(cls) -> str:        return f"api_key_{''.join(cls.__random__.choices('0123456789abcdef', k=32))}"      @classmethod    def secret_key(cls) -> str:        return f"secret_{''.join(cls.__random__.choices('0123456789abcdef', k=64))}"   configs = ConfigFactory.batch(2) print("Generated Configurations:") for config in configs:    print(f"  App: {config.app_name} v{config.version}")    print(f"    Debug: {config.debug}")    print(f"    API Key: {config.api_key[:20]}...")    print(f"    Created: {config.created_at}")    print() print()   print("=" * 80) print("SECTION 10: Model Coverage Testing") print("=" * 80)   from pydantic import BaseModel, ConfigDict from typing import Union   class PaymentMethod(BaseModel):    model_config = ConfigDict(use_enum_values=True)    type: str    card_number: Optional[str] = None    bank_name: Optional[str] = None    verified: bool = False   class PaymentMethodFactory(ModelFactory[PaymentMethod]):    __model__ = PaymentMethod   payment_methods = [    PaymentMethodFactory.build(type="card", card_number="4111111111111111"),    PaymentMethodFactory.build(type="bank", bank_name="Chase Bank"),    PaymentMethodFactory.build(verified=True), ]   print("Payment Method Coverage:") for i, pm in enumerate(payment_methods, 1):    print(f"  {i}. Type: {pm.type}")    if pm.card_number:        print(f"     Card: {pm.card_number}")    if pm.bank_name:        print(f"     Bank: {pm.bank_name}")    print(f"     Verified: {pm.verified}") print("n")   print("=" * 80) print("TUTORIAL SUMMARY") print("=" * 80) print(""" This tutorial covered:   1. ✓ Basic Dataclass Factories - Simple mock data generation 2. ✓ Custom Field Generators - Controlling individual field values 3. ✓ Field Constraints - Using PostGenerated for calculated fields 4. ✓ Pydantic Integration - Working with validated models 5. ✓ Complex Nested Structures - Building related objects 6. ✓ Attrs Support - Alternative to dataclasses 7. ✓ Build Overrides - Customizing specific instances 8. ✓ Use and Ignore - Explicit field control 9. ✓ Coverage Testing - Ensuring comprehensive test data   Key Takeaways: - Polyfactory automatically generates mock data from type hints - Customize generation with classmethods and decorators - Supports multiple libraries: dataclasses, Pydantic, attrs, msgspec - Use PostGenerated for calculated/dependent fields - Override specific values while keeping others random - Perfect for testing, development, and prototyping   For more information: - Documentation: https://polyfactory.litestar.dev/ - GitHub: https://github.com/litestar-org/polyfactory """) print("=" * 80)

We cover advanced usage patterns such as explicit overrides, constant field values, and coverage testing scenarios. We show how we can intentionally construct edge cases and variant instances for robust testing. This final step ties everything together by demonstrating how Polyfactory supports comprehensive and production-grade test data strategies.

In conclusion, we demonstrated how Polyfactory enables us to create comprehensive, flexible test data with minimal boilerplate while still retaining fine-grained control over every field. We showed how to handle simple entities, complex nested structures, and Pydantic model validation, as well as explicit field overrides, within a single, consistent factory-based approach. Overall, we found that Polyfactory enables us to move faster and test more confidently, as it reliably generates realistic datasets that closely mirror production-like scenarios without sacrificing clarity or maintainability.


Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.