PDF Processing Pipeline - Complete Technical Guide

14-stage intelligent pipeline for transforming material catalogs into searchable knowledge.

๐Ÿ“š Related Documentation:


๐ŸŽฏ Pipeline Overview - Product-Centric Architecture

Key Concept: After Stage 0 discovers products, Stages 1-5 process EACH product individually, extracting and linking all related data (chunks, images, tables) before moving to the next product.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 0A: Product Discovery (0-10%) โ”‚ โ”‚ AI Model: Claude Sonnet 4.5 / GPT-4o โ”‚ โ”‚ Purpose: Extract products with ALL metadata (inseparable) โ”‚ โ”‚ Output: Products with metadata JSONB (factory, specs, etc.) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 0B: Document Entity Discovery (10-15%) - OPTIONAL โ”‚ โ”‚ AI Model: Claude Sonnet 4.5 / GPT-4o โ”‚ โ”‚ Purpose: Extract certificates, logos, specifications โ”‚ โ”‚ Output: Document entities stored separately with relationships โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FOR EACH PRODUCT (Product-Centric Loop) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 1: Extract Product Pages (15-25%) โ”‚ โ”‚ Tool: PyMuPDF โ”‚ โ”‚ Process: Extract pages for THIS product only โ”‚ โ”‚ Output: Product pages ready for processing โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 2: Product-Centric Text Extraction (25-35%) โ”‚ โ”‚ Tool: PyMuPDF4LLM + UnifiedChunkingService โ”‚ โ”‚ Process: Extract text for THIS product only โ”‚ โ”‚ Output: Text chunks with product_id โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ“ PRODUCT-AWARE CHUNKING: โ”‚ โ”‚ - Only process pages in product's page range โ”‚ โ”‚ - Add product_id and product_name to each chunk โ”‚ โ”‚ - Respect semantic boundaries (paragraphs, sentences) โ”‚ โ”‚ โ”‚ โ”‚ Database: chunks (with product_id foreign key) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 3: Product-Centric Image Extraction (35-45%) โ”‚ โ”‚ Tool: VisionGuidedImageExtractor โ”‚ โ”‚ Process: Extract images for THIS product only โ”‚ โ”‚ Output: Images with product_id โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ–ผ๏ธ IMAGE EXTRACTION: โ”‚ โ”‚ - Only process pages in product's page range โ”‚ โ”‚ - Upload to Supabase Storage immediately โ”‚ โ”‚ - Link to product via product_id โ”‚ โ”‚ โ”‚ โ”‚ Database: product_images (with product_id foreign key) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 4: Product Creation (45-50%) โ”‚ โ”‚ Service: ProductService + Database Queries โ”‚ โ”‚ Process: Create product record in database โ”‚ โ”‚ Output: Product with UUID (product_id) โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿญ PRODUCT CREATION: โ”‚ โ”‚ - Create product record in database โ”‚ โ”‚ - Generate UUID (product_id) โ”‚ โ”‚ - Store metadata JSONB (factory, specs, etc.) โ”‚ โ”‚ - Consolidate visual metadata from associated images โ”‚ โ”‚ โ”‚ โ”‚ Database: products โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ END OF PRODUCT LOOP โ€” ALL PRODUCTS โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 4.5: Cross-Product Field Propagation (68-72%) โ”‚ โ”‚ File: app/api/pdf_processing/stage_4_products.py โ”‚ โ”‚ Process: Share common catalog-level fields across siblings โ”‚ โ”‚ Progress: 70 โ†’ 72% Monitor stage: "field_propagation" โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ”„ FIELDS PROPAGATED (first non-empty sibling wins): โ”‚ โ”‚ Top-level: โ”‚ โ”‚ - factory_name / factory_group_name โ”‚ โ”‚ - country_of_origin / origin โ”‚ โ”‚ - material_category (upload override always wins) โ”‚ โ”‚ - manufacturing_location / process / country โ”‚ โ”‚ - available_sizes โ† shared across catalog siblings โ”‚ โ”‚ Nested (material_properties): โ”‚ โ”‚ - thickness, body_type, composition โ”‚ โ”‚ โ”‚ โ”‚ โš ๏ธ Only fills EMPTY fields โ€” existing values never overwritten โ”‚ โ”‚ DB sync: tracker._sync_to_database("field_propagation") โ”‚ โ”‚ Timeout: 2 min (DB reads/writes only, no AI calls) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 4.6: Dimension Extraction from Text Chunks (72-76%) โ”‚ โ”‚ File: app/api/pdf_processing/stage_4_products.py โ”‚ โ”‚ Process: Regex scan of extracted text for sizes/thickness โ”‚ โ”‚ Progress: 74 โ†’ 76% Monitor stage: "dimension_extraction" โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ“ WHAT IT DOES (no AI calls โ€” pure regex): โ”‚ โ”‚ 1. Merges all document_chunks.content for this document โ”‚ โ”‚ 2. Extracts size patterns: WxH cm (5โ€“300 cm sanity check) โ”‚ โ”‚ 3. Extracts thickness near keywords (thickness/spessore/ โ”‚ โ”‚ รฉpaisseur/Stรคrke) or bare "X.Ymm" fallback โ”‚ โ”‚ 4. Fills products still missing these fields after 4.5 โ”‚ โ”‚ - available_sizes: list of found sizes โ”‚ โ”‚ - material_properties.thickness: {value, confidence: โ”‚ โ”‚ 0.65, source: "document_text"} โ”‚ โ”‚ โ”‚ โ”‚ Timeout: 2 min โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 5: Entity Linking (65-70%) โ”‚ โ”‚ Service: EntityLinkingService โ”‚ โ”‚ Process: Link all entities to product โ”‚ โ”‚ Output: Complete product with all relationships โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ”— ENTITY LINKING: โ”‚ โ”‚ - Link chunks via product_id foreign key โ”‚ โ”‚ - Link images via product_id foreign key โ”‚ โ”‚ - Link tables via product_id foreign key โ”‚ โ”‚ - Link layout regions via product_id foreign key โ”‚ โ”‚ โ”‚ โ”‚ Database: products, chunks, product_images, product_tables, โ”‚ โ”‚ product_layout_regions โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

                          โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  REPEAT FOR NEXT PRODUCT                 โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ†“

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 6: AI Classification (70-75%) - URL-BASED PROCESSING โ”‚ โ”‚ Model: Qwen3-VL 17B Vision โ”‚ โ”‚ Process: Download from Supabase URLs โ†’ Classify โ†’ Delete โ”‚ โ”‚ Output: Material vs non-material classification โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿš€ URL-BASED ARCHITECTURE: โ”‚ โ”‚ 1. Download image from Supabase URL to RAM โ”‚ โ”‚ 2. Convert to base64 on-the-fly โ”‚ โ”‚ 3. Classify with Qwen Vision (material/non-material) โ”‚ โ”‚ 4. Delete from RAM immediately โ”‚ โ”‚ 5. Delete non-material images from Supabase โ”‚ โ”‚ โ”‚ โ”‚ Memory: ~1-2MB per image (temporary download) โ”‚ โ”‚ Time: ~2-3 seconds per image โ”‚ โ”‚ Disk: 0 images (everything in RAM) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 7: SLIG Embeddings (75-85%) - URL-BASED PROCESSING โ”‚ โ”‚ Models: SigLIP2 via SLIG cloud endpoint (768D, 5 types) โ”‚ โ”‚ Process: Use Supabase URLs directly (NO download!) โ”‚ โ”‚ Output: 5 SLIG 768D embeddings per material image โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿš€ ZERO-DOWNLOAD ARCHITECTURE: โ”‚ โ”‚ 1. Pass Supabase URL to SLIG cloud endpoint โ”‚ โ”‚ 2. SLIG fetches internally โ”‚ โ”‚ 3. Generate 5 embeddings (visual, color, texture, style, mat)โ”‚ โ”‚ 4. Save directly to VECS collections (updated 2026-04) โ”‚ โ”‚ 5. Auto-cleanup (no manual deletion needed) โ”‚ โ”‚ โ”‚ โ”‚ Memory: ~100MB per batch โ”‚ โ”‚ Time: ~2-3 seconds per image โ”‚ โ”‚ Disk: 0 images (URL-based) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 8: Qwen Vision Analysis (85-90%) - URL-BASED PROCESSING โ”‚ โ”‚ Model: Qwen3-VL 17B Vision โ”‚ โ”‚ Process: Download from Supabase URLs โ†’ Analyze โ†’ Delete โ”‚ โ”‚ Output: Quality scores, material properties, confidence โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿš€ ON-DEMAND DOWNLOAD ARCHITECTURE: โ”‚ โ”‚ 1. Download image from Supabase URL to RAM โ”‚ โ”‚ 2. Convert to base64 on-the-fly โ”‚ โ”‚ 3. Analyze with Qwen Vision (quality, properties) โ”‚ โ”‚ 4. Delete from RAM immediately โ”‚ โ”‚ 5. Batch cleanup after every 10 images โ”‚ โ”‚ โ”‚ โ”‚ Memory: ~1-2MB per image (temporary download) โ”‚ โ”‚ Time: ~3-5 seconds per image โ”‚ โ”‚ Disk: 0 images (everything in RAM) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 9: Product Creation (90-95%) โ”‚ โ”‚ Models: Claude Haiku 4.5 โ†’ Claude Sonnet 4.5 โ”‚ โ”‚ Output: Product records with relationships โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 10: Entity Linking (95-98%) โ”‚ โ”‚ Process: Link products, chunks, images, document entities โ”‚ โ”‚ Output: Relationships with relevance scores โ”‚ โ”‚ โ”‚ โ”‚ Relationships Created: โ”‚ โ”‚ - Product โ†’ Image (relevance scores) โ”‚ โ”‚ - Chunk โ†’ Image (relevance scores) โ”‚ โ”‚ - Chunk โ†’ Product (relevance scores) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ STAGE 11: Completion (98-100%) โ”‚ โ”‚ Process: Final validation and cleanup โ”‚ โ”‚ Output: Complete processed document โ”‚ โ”‚ Note: All images stored in Supabase, 0 local files โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜


๐Ÿ—๏ธ Product-Centric Architecture

Why Product-Centric?

Traditional Approach (Document-Centric):

  1. Extract ALL text from entire document
  2. Extract ALL images from entire document
  3. Extract ALL tables from entire document
  4. Try to link everything together at the end
  5. Problem: Hard to maintain relationships, data scattered

Product-Centric Approach (Current):

  1. Discover products first (Stage 0)
  2. For each product individually:
    • Extract ONLY its pages (Stage 1)
    • Extract ONLY its text chunks (Stage 2)
    • Extract ONLY its images (Stage 3)
    • Create product record (Stage 4)
    • Validate all relationships (Stage 5)
  3. Benefit: All data linked via product_id foreign key from the start

Data Flow Example

Product: "NOVA" (Pages 12-14)

Stage 0 discovers: name="NOVA", page_range=[12,13,14]

Stage 1 (Layout + Tables, FOR NOVA ONLY): YOLO detects 15 regions on pages 12-14; Camelot extracts 3 tables from page 13; tables stored with product_id = NOVA's ID.

Stage 2 (Text Chunking, FOR NOVA ONLY): Extracts text from pages 12-14 only, creates 45 chunks, each with product_id = NOVA's ID.

Stage 3 (Image Extraction, FOR NOVA ONLY): Extracts images from pages 12-14 only, uploads 12 images to Supabase, each with product_id = NOVA's ID.

Stage 4 (Product Creation): Creates product record for NOVA with metadata stored in JSONB.

Stage 5 (Validation): Counts 45 chunks, 12 images, 3 tables โ€” all linked via product_id foreign key. Logs: "Product 'NOVA' entities linked: 45 chunks, 12 images, 3 tables".

Database Relationships

Foreign Key Architecture: All entities link to products via product_id. Specifically:

No Separate Relationship Tables Needed:

Benefits

  1. Data Integrity: All entities linked from creation
  2. Simple Queries: SELECT * FROM chunks WHERE product_id = ?
  3. Easy Validation: Count entities per product
  4. Clean Architecture: No orphaned data
  5. Efficient Processing: Process one product at a time

๐Ÿ“‹ Detailed Stage Breakdown

Stage 0A: Product Discovery (0-10%)

Purpose: Extract products with ALL metadata (Products + Metadata = Inseparable)

AI Model: Claude Sonnet 4.5 or GPT-4o

Process:

  1. Extract full PDF text
  2. Analyze content structure
  3. Identify product boundaries
  4. Extract products WITH all metadata in one pass
  5. Store in products table with metadata JSONB

Output: A JSON structure with a products array, each entry containing fields like name, description, page_range, metadata (with designer, studio, category, dimensions, variants, factory, factory_group, manufacturer, country_of_origin, slip_resistance, fire_rating, thickness, water_absorption, finish, material), image_indices, and confidence. Also includes total_products and confidence_score at the top level.

Database Storage:

Example (Harmony PDF):


Stage 0B: Document Entity Discovery (10-15%) - OPTIONAL

Purpose: Extract certificates, logos, specifications as separate knowledge base

AI Model: Claude Sonnet 4.5 or GPT-4o

Process:

  1. Analyze PDF for document entities
  2. Extract certificates (ISO, CE, quality certifications)
  3. Extract logos (company, brand, certification marks)
  4. Extract specifications (technical docs, installation guides)
  5. Identify factory/group for each entity
  6. Store in document_entities table

Output: A JSON structure with certificates, logos, and specifications arrays. Each certificate includes name, certificate_type, issuer, issue_date, expiry_date, standards, page_range, factory_name, factory_group, and confidence. Logos include name, logo_type, description, page_range, and confidence. Specifications include name, spec_type, description, page_range, and confidence.

Database Storage:

Agentic Query Examples:


Stage 1: Focused Extraction + YOLO Layout Detection (15-30%)

File: app/api/pdf_processing/stage_1_focused_extraction.py

Purpose: Extract product pages, detect layout regions, and extract tables

Process:

  1. Page Mapping: Map catalog pages to physical PDF pages
  2. YOLO Layout Detection: Detect TEXT, IMAGE, TABLE, TITLE, CAPTION regions
  3. Table Extraction: Extract structured tables using Camelot (NEW!)
  4. Layout Region Storage: Store detected regions for intelligent chunking

Services Used:

Data Extracted:

  1. Page Mapping

    • Catalog page numbers โ†’ PDF page indices
    • Handles 2-page spreads and standard layouts
  2. YOLO Layout Regions

    • TEXT regions - Body text, paragraphs
    • IMAGE regions - Product images, diagrams
    • TABLE regions - Specification tables, data grids
    • TITLE regions - Headers, section titles
    • CAPTION regions - Image captions, labels
  3. Tables (NEW!)

    • Structured table data with headers and rows
    • Table type classification (specifications, dimensions, etc.)
    • Confidence scores for extraction quality
    • Page numbers for linking to products

Database Storage:

Returns: A dict with product_pages (set of physical PDF page indices), layout_regions (list of YOLO-detected LayoutRegion objects), layout_stats (total_regions, text_regions, image_regions, table_regions, title_regions counts), and tables_extracted (integer count).

Benefits:

Output:


Stage 2: Product-Centric Text Extraction (30-40%)

File: app/api/pdf_processing/stage_2_chunking.py

Tool: PyMuPDF4LLM + Product-Aware Chunking

Process:

  1. Per-Product Processing: Extract text for EACH product individually
  2. Page Range Filtering: Only process pages in product's page range
  3. Layout-Aware Chunking: Use YOLO regions to guide chunking
  4. Semantic Boundaries: Respect paragraph/sentence structure
  5. Product Context: Add product_id and product_name to each chunk

Services Used:

Data Created:

  1. Text Chunks

    • Content: Extracted text segments
    • Product ID: Links chunk to specific product
    • Product Name: For context and filtering
    • Page Numbers: Source pages for chunk
    • Quality Score: Semantic completeness score
  2. Chunk Metadata

    • Layout regions used (TEXT, TITLE, CAPTION)
    • Semantic boundaries (paragraph, sentence)
    • Product context (name, ID)

Database Storage:

Returns: A dict with chunks_created (count), total_characters, avg_chunk_size, and quality_scores (avg, min, max).

Output:


Stage 3: Product-Centric Image Extraction (40-50%)

File: app/api/pdf_processing/stage_3_images.py

Tool: PyMuPDF + YOLO-Guided Extraction

Process:

  1. Per-Product Processing: Extract images for EACH product individually
  2. YOLO Region Filtering: Only extract IMAGE regions detected by YOLO
  3. Page Range Filtering: Only process pages in product's page range
  4. Immediate Upload: Upload to Supabase Storage immediately
  5. Product Linking: Link images to product via product_id

Services Used:

Data Created:

  1. Product Images

    • Image file (PNG/JPEG)
    • Product ID: Links image to specific product
    • Page Number: Source page for image
    • Bounding Box: YOLO-detected region coordinates
    • Image Type: Product image, diagram, detail shot
    • Supabase URL: Public URL for image access
  2. Image Metadata

    • Dimensions (width, height)
    • File size
    • Format (PNG, JPEG)
    • Extraction confidence score

Database Storage:

Returns: A dict with images_extracted, images_uploaded, total_size_mb, avg_confidence, and image_types (product, detail, diagram counts).

Output:


Stage 4: Product Creation & Entity Linking (50-60%)

File: app/api/pdf_processing/stage_4_products.py

Purpose: Create product records and link all extracted entities

Process:

  1. Product Record Creation: Create product in database
  2. Chunk Linking: Link all chunks to product
  3. Image Linking: Link all images to product
  4. Table Linking: Link all tables to product (NEW!)
  5. Metadata Storage: Store product metadata (specs, factory, etc.)

Services Used:

Data Created:

  1. Product Record

    • Product name, description
    • Page range (start, end)
    • Metadata JSONB (factory, specs, dimensions, etc.)
    • Document ID (parent document)
  2. Entity Relationships

    • Chunks โ†’ Product (via product_id foreign key)
    • Images โ†’ Product (via product_id foreign key)
    • Tables โ†’ Product (via product_id foreign key) (NEW!)

Database Storage:

Returns: A dict with product_id, product_name, chunks_linked, images_linked, tables_linked, and metadata_fields counts.

Output:


Stage 4.5: Cross-Product Field Propagation (68-72%)

File: app/api/pdf_processing/stage_4_products.py Function: propagate_common_fields_to_products() Monitor stage: field_propagation Timeout: 2 min (DB reads/writes only โ€” no AI calls)

Purpose: After all products are created, fill empty metadata fields by borrowing values from sibling products in the same document. Catalog-level attributes (factory, origin, available sizes, etc.) are typically the same for every product in a catalog โ€” this stage enforces that uniformity without overwriting any values that were already extracted.

Process:

  1. Fetch all products for the document
  2. For each propagatable field, find the first sibling that has a non-empty value
  3. Write that value to every sibling that still has it empty
  4. Update progress_monitor and tracker before and after

Fields Propagated (first non-empty sibling wins):

Top-level metadata fields:

Nested fields (under material_properties):

Safety rule: Only empty/null/empty-list/empty-dict fields are touched. Existing values are never overwritten.

Returns: A dict with products_updated, total_products, fields_propagated (list of field names), and source: "stage_4_5_propagation".


Stage 4.6: Dimension Extraction from Text Chunks (72-76%)

File: app/api/pdf_processing/stage_4_products.py Function: extract_dimensions_from_document_chunks() Monitor stage: dimension_extraction Timeout: 2 min (pure regex โ€” no AI calls)

Purpose: After Stage 4.5 sibling propagation, some products may still have empty available_sizes or material_properties.thickness. This stage merges all text chunks for the document and runs regex patterns to extract dimensions and thickness values, filling the remaining gaps.

Process:

  1. Fetch all document_chunks.content for the document, merge into one text blob
  2. Run size regex: (\d+)[xXร—](\d+)\s*(?:cm|CM) with sanity check (5โ€“300 cm per axis)
  3. Run thickness regex: near keywords (thickness|spessore|รฉpaisseur|Stรคrke) or bare X.Ymm pattern
  4. For each product still missing available_sizes โ†’ insert found sizes
  5. For each product still missing material_properties.thickness โ†’ insert {value, confidence: 0.65, source: "document_text"}

Coverage: This stage acts as a safety net โ€” even if Stage 0 AI extraction and Stage 4.5 sibling propagation both missed a dimension, the raw text almost always contains it somewhere.

Returns: A dict with products_updated, sizes_found (list), thickness_found, and source: "document_text".


Image OCR Dimension Extraction (inline, updated 2026-04)

Note (2026-04): The former asynchronous "Phase 2 background image processor" (app/services/images/background_image_processor.py) was deleted โ€” it called a non-existent generate_material_embeddings method and produced no output. Image-text dimension extraction is now handled inline during Phase 1 image processing by _enrich_product_metadata_from_spec_image() in the image processing service, using the same regex patterns against Qwen vision output.

Logic:

Regex patterns applied (same as Stage 4.6 but on Qwen's OCR output):

Confidence: 0.70, source: "image_text"

Three-Layer Coverage Summary:

Layer Stage Source Confidence AI?
Sibling propagation 4.5 Sibling product DB 0.75 No
Text chunk regex 4.6 document_chunks text 0.65 No
Image OCR regex Phase 1 (inline) Qwen raw_qwen_output 0.70 Yes (Qwen)

Stage 5: Entity Linking & Relationship Mapping (60-70%)

File: app/services/discovery/entity_linking_service.py

Purpose: Link all extracted entities to products and create relationships

Process:

  1. Query All Entities: Fetch chunks, images, tables by product_id
  2. Count Statistics: Count linked entities for each product
  3. Validate Relationships: Ensure all entities are properly linked
  4. Update Product Stats: Store entity counts in product metadata

Services Used:

Data Validated:

  1. Chunks โ†’ Product: Query counts chunks by product_id, validates foreign key linkage.
  2. Images โ†’ Product: Query counts product_images by product_id, validates foreign key linkage.
  3. Tables โ†’ Product (NEW!): Query counts product_tables by product_id, validates foreign key linkage.

Database Storage:

Returns: A dict with product_id, chunks_linked, images_linked, tables_linked, total_entities, and validation_passed.

Logging Output: Produces a structured log entry like "Product 'NOVA' entities linked: Chunks: 45, Images: 12, Tables: 3, Total: 60 entities".

Output:


Stage 6: AI Classification (70-75%) - URL-Based Processing

๐Ÿš€ ON-DEMAND DOWNLOAD ARCHITECTURE

Model: Qwen3-VL 17B Vision

Process (Per Image):

  1. Download image from Supabase URL to RAM
  2. Convert to base64 on-the-fly (no disk I/O)
  3. Classify with Qwen Vision (material vs non-material)
  4. Delete from RAM immediately
  5. Delete non-material images from Supabase Storage

Why URL-Based?

Output: A JSON result with total_images_classified, material_images, non_material_images, classification_errors, non_material_deleted_from_supabase, memory_usage, and processing_time.

Performance Metrics:


Stage 7: SLIG Embeddings (75-85%) - URL-Based Processing (updated 2026-04)

๐Ÿš€ ZERO-DOWNLOAD ARCHITECTURE

Model: SigLIP2 via SLIG cloud endpoint (768D). Legacy SigLIP ViT-SO400M (1152D) was retired in 2026-04 โ€” its collections were 100% orphans.

Process (Per Image):

  1. Pass Supabase URL to SLIG service (no manual download!)
  2. SLIG fetches internally
  3. Generate 5 embedding types (all 768D)
  4. Save directly to VECS collections (image_slig_embeddings, image_color_embeddings, image_texture_embeddings, image_style_embeddings, image_material_embeddings)
  5. Auto-cleanup

Why Zero-Download?

5 SLIG Embedding Types Generated Per Image (SigLIP2 via SLIG cloud endpoint, 768D each):

  1. Visual Embeddings (768D) โ€” Overall visual appearance, enables visual similarity search. Collection: image_slig_embeddings. Producer key: visual_768.
  2. Color Embeddings (768D) โ€” Text-guided color similarity. Collection: image_color_embeddings. Producer key: color_slig_768.
  3. Texture Embeddings (768D) โ€” Text-guided texture similarity. Collection: image_texture_embeddings. Producer key: texture_slig_768.
  4. Style Embeddings (768D) โ€” Text-guided style similarity. Collection: image_style_embeddings. Producer key: style_slig_768.
  5. Material Embeddings (768D) โ€” Text-guided material similarity. Collection: image_material_embeddings. Producer key: material_slig_768.

Output: A JSON result with material_images_processed, slig_embeddings_generated, total_embeddings, memory_usage, processing_time, and embeddings_by_type (visual, color, texture, style, material counts).

Performance Metrics:


Stage 8: Qwen Vision Analysis (85-90%) - URL-Based Processing

๐Ÿš€ ON-DEMAND DOWNLOAD ARCHITECTURE

Model: Qwen3-VL 17B Vision

Process (Per Image):

  1. Download image from Supabase URL to RAM
  2. Convert to base64 on-the-fly
  3. Analyze with Qwen Vision (quality, properties)
  4. Delete from RAM immediately
  5. Batch cleanup after every 10 images

Why On-Demand Download?

Output: A JSON result with images_analyzed, quality_scores_generated, material_properties_extracted, memory_usage, and processing_time.

Performance Metrics:


Stage 6 (alternate): Image Analysis (80-85%) - ASYNC JOB

Model: Qwen3-VL 17B Vision

Process:

  1. Runs as background job (non-blocking)
  2. Analyze each image for OCR
  3. Extract material properties
  4. Calculate quality scores

Output: A JSON structure per image with image_id, ocr_text, materials (list), properties (dict with fields like weight and weave), and quality_score.

Quality Scoring:

Note: This stage runs asynchronously and does not block pipeline completion


Stage 7 (alternate): Product Creation (85-92%)

Models: Claude Haiku 4.5 โ†’ Claude Sonnet 4.5

Two-Stage Validation:

Stage 1 (Haiku - Fast):

Stage 2 (Sonnet - Deep):

Output: A JSON structure per product with product_id, name, description, metadata (factory, dimensions, material), chunks (list of IDs), images (list of IDs), and confidence_score.


Stage 8 (alternate): Entity Linking (92-97%)

Process:

  1. Link products to images (relevance scores)
  2. Link chunks to images (relevance scores)
  3. Link chunks to products (relevance scores)
  4. Create relationship records

Relevance Algorithm:

Output: A JSON result with product_image_relationships, chunk_image_relationships, chunk_product_relationships, and total_relationships counts.

Database Tables:


Stage 9: Completion (97-100%)

Process:

  1. Final validation
  2. Update job status
  3. Generate completion summary
  4. Trigger async jobs (if any)

Output: Complete processed document with all relationships


๐Ÿ”„ Checkpoint Recovery

9 checkpoints for failure recovery:

  1. INITIALIZED - Job created
  2. PDF_EXTRACTED - PDF analysis complete
  3. CHUNKS_CREATED - Text chunking complete
  4. TEXT_EMBEDDINGS_GENERATED - Text embeddings complete
  5. IMAGES_EXTRACTED - Images uploaded to Supabase Storage โœ… UPDATED
  6. IMAGE_EMBEDDINGS_GENERATED - SLIG 768D embeddings + Qwen Vision complete โœ… UPDATED
  7. PRODUCTS_DETECTED - Products identified
  8. PRODUCTS_CREATED - Product creation complete
  9. COMPLETED - All processing complete

Recovery Process: On startup, if a job has a saved checkpoint_stage, the pipeline resumes from that checkpoint rather than starting from the beginning.

Note:


๐Ÿ“Š Performance Metrics

NOVA PDF Example (71 pages, 249 images):

Accuracy Metrics:

URL-Based Architecture Impact:


๐Ÿ—๏ธ Modular Architecture (Refactored)

The pipeline has been refactored from a monolithic 2900+ line function into modular services and API endpoints for better debugging, testing, and retry capabilities.

Service Layer

ImageProcessingService (app/services/image_processing_service.py)

UnifiedChunkingService (app/services/unified_chunking_service.py)

RelevancyService (app/services/relevancy_service.py) (updated 2026-04)

Internal API Endpoints

Each pipeline stage has a dedicated endpoint for independent testing and retry. The available internal endpoints are:

Main Orchestrator Endpoint

The main orchestrator is POST /api/rag/documents/upload accepting multipart/form-data with parameters: file (PDF), workspace_id (UUID), category (default: "products"), and focused_extraction (default: true). It returns a JSON response with job_id, document_id, status: "processing", progress: 0, and current_stage: "INITIALIZED".

Orchestrator Flow:

  1. Upload PDF and create job
  2. Call /api/internal/classify-images/{job_id}
  3. Call /api/internal/upload-images/{job_id}
  4. Call /api/internal/save-images-db/{job_id}
  5. Call /api/internal/create-chunks/{job_id}
  6. Call /api/internal/create-relationships/{job_id}
  7. Update job status to COMPLETED

Benefits:


๐Ÿ›ก๏ธ Production Hardening

PDF processing implements complete production hardening for reliability and monitoring:

Source Tracking โœ…

Every product, chunk, image, and embedding is tagged with source_type: 'pdf_processing' and source_job_id: job_id at insert time. This applies to the products, document_chunks, document_images, and embeddings tables.

Benefits:


Heartbeat Monitoring โœ…

Updates last_heartbeat field every stage to detect stuck jobs. The heartbeat update sets last_heartbeat, current_stage, and progress_percent on the background_jobs record for the active job.

Stuck Job Detection:


Sentry Error Tracking โœ…

Comprehensive error tracking and performance monitoring using sentry_sdk.start_transaction with op="pdf_processing" and name="process_stage". Tags include job_id and stage, and data includes total_pages. Breadcrumbs are added at each processing step. On success, the transaction status is set to "ok". On exception, sentry_sdk.capture_exception() is called and status set to "internal_error" before re-raising.

Features:


Production Hardening Status

Feature Status Details
Source Tracking โœ… COMPLETE All tables have source_type and source_job_id
Heartbeat Monitoring โœ… COMPLETE Updates every stage, 10-minute stuck threshold
Sentry Tracking โœ… COMPLETE Transactions, breadcrumbs, exception capture
Error Handling โœ… COMPLETE Comprehensive try-catch with Sentry integration
Progress Tracking โœ… COMPLETE Real-time progress updates via job_progress table
Checkpoint Recovery โœ… COMPLETE Resume from last successful stage
Auto-Recovery โœ… COMPLETE Automatic retry of stuck/failed jobs

๐Ÿ“ก Product API Endpoint

GET /api/rag/products

Purpose: Retrieve products with all linked entities (chunks, images, tables)

File: mivaa-pdf-extractor/app/api/rag_routes.py

Query Parameters:

Response Format: A JSON object with a products array and a total count. Each product entry contains id, name, description, page_range, metadata (with factory, dimensions, material), a chunks array (each with id, content, page_number), an images array (each with id, url, page_number), and a tables array (each with id, page_number, table_type, headers, and table_data containing a rows array).

Implementation Details:

  1. Efficient Batch Query: Single query fetches all tables for all products
  2. Grouping: Tables grouped by product_id for efficient lookup
  3. Backward Compatible: Can disable tables with include_tables=false
  4. Consistent Pattern: Follows same pattern as chunks and images

Usage: GET /api/rag/products?document_id=YOUR_DOC_ID (with tables, default) or append &include_tables=false to exclude tables.

Benefits:


๐ŸŽฏ YOLO Layout-Aware Chunking

Overview

The YOLO Layout-Aware Chunking system uses detected layout regions to create intelligent, boundary-respecting chunks that preserve document structure and semantic meaning.

How It Works

Stage 1 (YOLO Detection) โ†’ Stage 2 (Layout-Aware Chunking)

  1. YOLO detects layout regions (Stage 4.5)

    • Stores regions in product_layout_regions table
    • Each region has: type, bbox, confidence, reading_order, text_content
  2. Chunking service reads regions (Stage 2)

    • Fetches regions for current product
    • Sorts by reading_order
    • Creates chunks based on region types

Chunking Strategy by Region Type

1. TABLE Regions ๐Ÿ“Š

A TABLE region chunk contains the full table text (headers and rows), a region_type of "TABLE", and a reading_order value.

2. TITLE + TEXT Regions ๐Ÿ“

A TITLE+TEXT chunk combines the section heading with its body paragraph, with region_type: "TITLE+TEXT" and a reading_order value.

3. TEXT Regions ๐Ÿ“„

A TEXT region chunk contains body text, with region_type: "TEXT" and a reading_order value.

4. IMAGE + CAPTION Regions ๐Ÿ–ผ๏ธ

A CAPTION region chunk contains the caption text, region_type: "CAPTION", a reading_order value, and a linked_image_bbox reference.

Configuration

Layout-aware chunking is enabled by setting strategy=ChunkingStrategy.LAYOUT_AWARE in the ChunkingConfig passed to UnifiedChunkingService, with max_chunk_size=1000 and min_chunk_size=100.

Fallback Behavior:

Benefits

โœ… Preserves Document Structure

โœ… Improves Search Quality

โœ… Reduces Fragmentation

Performance


๐Ÿš€ Future Enhancements

1. Sophisticated Title-Content Relationships

Current Implementation:

Planned Enhancements:

Multi-Level Title Hierarchy

The system would detect H1, H2, and H3 heading levels and associate body content with its full parent hierarchy (e.g., "Outdoor Furniture โ†’ Chairs โ†’ Ergonomic Series").

Title Propagation

A chunk would include a hierarchy object with h1, h2, and h3 keys alongside its content.

Smart Boundary Detection

Benefits:


2. Monitoring & Metrics for YOLO Performance

Planned Metrics:

Processing Metrics

Metrics to track per-page: yolo_processing_time_per_page, regions_detected_per_page, confidence_score_avg, confidence_score_min, and table_extraction_success_rate.

Region Distribution

Region counts by type: TEXT, TITLE, TABLE, IMAGE, CAPTION, FORMULA counts per document.

Performance Tracking

Benefits:

Metrics would be stored in the job_progress table with a stage: 'yolo_detection' key.


3. GPU Acceleration

Current Performance:

With GPU Acceleration:

Implementation Plan:

The system would auto-detect GPU availability using torch.cuda.is_available(). With GPU, it would process multiple pages in parallel batches (batch_size=4 for CUDA vs 1 for CPU), clearing GPU cache between batches with torch.cuda.empty_cache(). The device and batch size would be configurable via YOLO_DEVICE and YOLO_BATCH_SIZE environment variables.

Benefits:


4. Advanced Chunking Rules

Beyond Title-Content Relationships:

List Detection

Keep bullet and numbered lists together as atomic units with region_type: "LIST" and list_type: "bullet".

Table Context

Include surrounding prose context (introductory text before a table and notes after) in the same chunk as the table data, with region_type: "TABLE_WITH_CONTEXT".

Image-Caption Linking

Link captions to their specific images by including a linked_image_id field in CAPTION chunks.

Cross-Reference Detection

Detect phrases like "see Figure 3" and record cross_references and linked_chunks in the chunk metadata.

Section Boundaries

Never split across major sections. Track section and subsection fields in chunk metadata.

Formula Preservation

Keep mathematical formulas intact with region_type: "FORMULA" and a formula_type descriptor.

Benefits:


Last Updated: February 20, 2026 Pipeline Version: Product-Centric Architecture with YOLO Layout Detection, Table Extraction & Cross-Product Field Propagation Status: Production

Major Features:

Future Enhancements (Planned):


๐Ÿฉบ Document Health Panel (Admin Observability)

A per-document health view, surfaced as a third tab on completed jobs in the Admin โ†’ Async Job Queue Monitor (alongside "Product Extraction Pipeline" and "Technical Logs"). It only appears when the job's status === 'completed'.

Frontend: src/components/Admin/AsyncJobQueueMonitor/DocumentHealthPanel.tsx Backend: GET /api/internal/document-extraction-status/{document_id} (MIVAA) Re-run action: POST /api/internal/run-catalog-knowledge/{document_id}?force=true

What it shows

Section Detail
Average coverage % Big number, color-coded by health (green โ‰ฅ75%, amber 50โ€“75%, red <50%)
Layer 1 โ€” Catalog Layout Run state + page-type breakdown (legend_pages, product_spec_pages, product_photo_pages, named_products_detected)
Layer 2 โ€” Catalog Legends Run state + legend_types_found + global_certifications propagated catalog-wide
Coverage bucket bar chart Distribution of products across 0โ€“25% / 25โ€“50% / 50โ€“75% / 75โ€“100% buckets
Per-product drilldown Sample of products with their missing_critical fields and a source-breakdown chip set per product
Issues banner Detected problems + one-click "Re-run Catalog Knowledge" remediation

Source-breakdown chips (extraction tier provenance)

Each chip indicates which tier produced a given field on that product. The same labels are used throughout the admin UI:

Source key Tier label
pymupdf_text_dict PyMuPDF Tier A
claude_sonnet_vision / claude_spec_vision Claude Sonnet Tier B
catalog_legend Catalog Legend Tier C
chunk_regex Chunk Regex
vision_rollup Image Vision Rollup
ai_text_extraction AI Text (Stage 0)

Why this exists

Coverage and source mix are the two best signals for "did this catalog actually parse well?" The bucket chart spots catalogs where average coverage is fine but the long tail is empty; the source chips spot catalogs where one tier silently failed and another is doing all the work. The "Re-run Catalog Knowledge" button is the standard one-click fix when Layer 2 needs to retry.