14-stage intelligent pipeline for transforming material catalogs into searchable knowledge.
๐ Related Documentation:
- Async Processing & Limits - Concurrency limits and async architecture
- Product Discovery Architecture - AI-powered product extraction
- System Architecture - Overall platform architecture
Key Concept: After Stage 0 discovers products, Stages 1-5 process EACH product individually, extracting and linking all related data (chunks, images, tables) before moving to the next product.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 0A: Product Discovery (0-10%) โ โ AI Model: Claude Sonnet 4.5 / GPT-4o โ โ Purpose: Extract products with ALL metadata (inseparable) โ โ Output: Products with metadata JSONB (factory, specs, etc.) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 0B: Document Entity Discovery (10-15%) - OPTIONAL โ โ AI Model: Claude Sonnet 4.5 / GPT-4o โ โ Purpose: Extract certificates, logos, specifications โ โ Output: Document entities stored separately with relationships โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ FOR EACH PRODUCT (Product-Centric Loop) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 1: Extract Product Pages (15-25%) โ โ Tool: PyMuPDF โ โ Process: Extract pages for THIS product only โ โ Output: Product pages ready for processing โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 2: Product-Centric Text Extraction (25-35%) โ โ Tool: PyMuPDF4LLM + UnifiedChunkingService โ โ Process: Extract text for THIS product only โ โ Output: Text chunks with product_id โ โ โ โ ๐ PRODUCT-AWARE CHUNKING: โ โ - Only process pages in product's page range โ โ - Add product_id and product_name to each chunk โ โ - Respect semantic boundaries (paragraphs, sentences) โ โ โ โ Database: chunks (with product_id foreign key) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 3: Product-Centric Image Extraction (35-45%) โ โ Tool: VisionGuidedImageExtractor โ โ Process: Extract images for THIS product only โ โ Output: Images with product_id โ โ โ โ ๐ผ๏ธ IMAGE EXTRACTION: โ โ - Only process pages in product's page range โ โ - Upload to Supabase Storage immediately โ โ - Link to product via product_id โ โ โ โ Database: product_images (with product_id foreign key) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 4: Product Creation (45-50%) โ โ Service: ProductService + Database Queries โ โ Process: Create product record in database โ โ Output: Product with UUID (product_id) โ โ โ โ ๐ญ PRODUCT CREATION: โ โ - Create product record in database โ โ - Generate UUID (product_id) โ โ - Store metadata JSONB (factory, specs, etc.) โ โ - Consolidate visual metadata from associated images โ โ โ โ Database: products โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ END OF PRODUCT LOOP โ ALL PRODUCTS โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 4.5: Cross-Product Field Propagation (68-72%) โ โ File: app/api/pdf_processing/stage_4_products.py โ โ Process: Share common catalog-level fields across siblings โ โ Progress: 70 โ 72% Monitor stage: "field_propagation" โ โ โ โ ๐ FIELDS PROPAGATED (first non-empty sibling wins): โ โ Top-level: โ โ - factory_name / factory_group_name โ โ - country_of_origin / origin โ โ - material_category (upload override always wins) โ โ - manufacturing_location / process / country โ โ - available_sizes โ shared across catalog siblings โ โ Nested (material_properties): โ โ - thickness, body_type, composition โ โ โ โ โ ๏ธ Only fills EMPTY fields โ existing values never overwritten โ โ DB sync: tracker._sync_to_database("field_propagation") โ โ Timeout: 2 min (DB reads/writes only, no AI calls) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 4.6: Dimension Extraction from Text Chunks (72-76%) โ โ File: app/api/pdf_processing/stage_4_products.py โ โ Process: Regex scan of extracted text for sizes/thickness โ โ Progress: 74 โ 76% Monitor stage: "dimension_extraction" โ โ โ โ ๐ WHAT IT DOES (no AI calls โ pure regex): โ โ 1. Merges all document_chunks.content for this document โ โ 2. Extracts size patterns: WxH cm (5โ300 cm sanity check) โ โ 3. Extracts thickness near keywords (thickness/spessore/ โ โ รฉpaisseur/Stรคrke) or bare "X.Ymm" fallback โ โ 4. Fills products still missing these fields after 4.5 โ โ - available_sizes: list of found sizes โ โ - material_properties.thickness: {value, confidence: โ โ 0.65, source: "document_text"} โ โ โ โ Timeout: 2 min โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 5: Entity Linking (65-70%) โ โ Service: EntityLinkingService โ โ Process: Link all entities to product โ โ Output: Complete product with all relationships โ โ โ โ ๐ ENTITY LINKING: โ โ - Link chunks via product_id foreign key โ โ - Link images via product_id foreign key โ โ - Link tables via product_id foreign key โ โ - Link layout regions via product_id foreign key โ โ โ โ Database: products, chunks, product_images, product_tables, โ โ product_layout_regions โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ REPEAT FOR NEXT PRODUCT โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 6: AI Classification (70-75%) - URL-BASED PROCESSING โ โ Model: Qwen3-VL 17B Vision โ โ Process: Download from Supabase URLs โ Classify โ Delete โ โ Output: Material vs non-material classification โ โ โ โ ๐ URL-BASED ARCHITECTURE: โ โ 1. Download image from Supabase URL to RAM โ โ 2. Convert to base64 on-the-fly โ โ 3. Classify with Qwen Vision (material/non-material) โ โ 4. Delete from RAM immediately โ โ 5. Delete non-material images from Supabase โ โ โ โ Memory: ~1-2MB per image (temporary download) โ โ Time: ~2-3 seconds per image โ โ Disk: 0 images (everything in RAM) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 7: SLIG Embeddings (75-85%) - URL-BASED PROCESSING โ โ Models: SigLIP2 via SLIG cloud endpoint (768D, 5 types) โ โ Process: Use Supabase URLs directly (NO download!) โ โ Output: 5 SLIG 768D embeddings per material image โ โ โ โ ๐ ZERO-DOWNLOAD ARCHITECTURE: โ โ 1. Pass Supabase URL to SLIG cloud endpoint โ โ 2. SLIG fetches internally โ โ 3. Generate 5 embeddings (visual, color, texture, style, mat)โ โ 4. Save directly to VECS collections (updated 2026-04) โ โ 5. Auto-cleanup (no manual deletion needed) โ โ โ โ Memory: ~100MB per batch โ โ Time: ~2-3 seconds per image โ โ Disk: 0 images (URL-based) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 8: Qwen Vision Analysis (85-90%) - URL-BASED PROCESSING โ โ Model: Qwen3-VL 17B Vision โ โ Process: Download from Supabase URLs โ Analyze โ Delete โ โ Output: Quality scores, material properties, confidence โ โ โ โ ๐ ON-DEMAND DOWNLOAD ARCHITECTURE: โ โ 1. Download image from Supabase URL to RAM โ โ 2. Convert to base64 on-the-fly โ โ 3. Analyze with Qwen Vision (quality, properties) โ โ 4. Delete from RAM immediately โ โ 5. Batch cleanup after every 10 images โ โ โ โ Memory: ~1-2MB per image (temporary download) โ โ Time: ~3-5 seconds per image โ โ Disk: 0 images (everything in RAM) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 9: Product Creation (90-95%) โ โ Models: Claude Haiku 4.5 โ Claude Sonnet 4.5 โ โ Output: Product records with relationships โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 10: Entity Linking (95-98%) โ โ Process: Link products, chunks, images, document entities โ โ Output: Relationships with relevance scores โ โ โ โ Relationships Created: โ โ - Product โ Image (relevance scores) โ โ - Chunk โ Image (relevance scores) โ โ - Chunk โ Product (relevance scores) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ STAGE 11: Completion (98-100%) โ โ Process: Final validation and cleanup โ โ Output: Complete processed document โ โ Note: All images stored in Supabase, 0 local files โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Traditional Approach (Document-Centric):
Product-Centric Approach (Current):
product_id foreign key from the startProduct: "NOVA" (Pages 12-14)
Stage 0 discovers: name="NOVA", page_range=[12,13,14]
Stage 1 (Layout + Tables, FOR NOVA ONLY): YOLO detects 15 regions on pages 12-14; Camelot extracts 3 tables from page 13; tables stored with product_id = NOVA's ID.
Stage 2 (Text Chunking, FOR NOVA ONLY): Extracts text from pages 12-14 only, creates 45 chunks, each with product_id = NOVA's ID.
Stage 3 (Image Extraction, FOR NOVA ONLY): Extracts images from pages 12-14 only, uploads 12 images to Supabase, each with product_id = NOVA's ID.
Stage 4 (Product Creation): Creates product record for NOVA with metadata stored in JSONB.
Stage 5 (Validation): Counts 45 chunks, 12 images, 3 tables โ all linked via product_id foreign key. Logs: "Product 'NOVA' entities linked: 45 chunks, 12 images, 3 tables".
Foreign Key Architecture: All entities link to products via product_id. Specifically:
chunks.product_id โ products.idproduct_images.product_id โ products.idproduct_tables.product_id โ products.idproduct_layout_regions.product_id โ products.idNo Separate Relationship Tables Needed:
product_chunk_relationships tableproduct_image_relationships tableproduct_table_relationships tableSELECT * FROM chunks WHERE product_id = ?Purpose: Extract products with ALL metadata (Products + Metadata = Inseparable)
AI Model: Claude Sonnet 4.5 or GPT-4o
Process:
Output: A JSON structure with a products array, each entry containing fields like name, description, page_range, metadata (with designer, studio, category, dimensions, variants, factory, factory_group, manufacturer, country_of_origin, slip_resistance, fire_rating, thickness, water_absorption, finish, material), image_indices, and confidence. Also includes total_products and confidence_score at the top level.
Database Storage:
productsmetadata JSONB columnExample (Harmony PDF):
Purpose: Extract certificates, logos, specifications as separate knowledge base
AI Model: Claude Sonnet 4.5 or GPT-4o
Process:
Output: A JSON structure with certificates, logos, and specifications arrays. Each certificate includes name, certificate_type, issuer, issue_date, expiry_date, standards, page_range, factory_name, factory_group, and confidence. Logos include name, logo_type, description, page_range, and confidence. Specifications include name, spec_type, description, page_range, and confidence.
Database Storage:
document_entitiesproduct_document_relationshipsAgentic Query Examples:
File: app/api/pdf_processing/stage_1_focused_extraction.py
Purpose: Extract product pages, detect layout regions, and extract tables
Process:
Services Used:
YOLOLayoutDetector - Layout detection using YOLO modelTableExtractor - Table extraction using Camelot (guided by YOLO)Data Extracted:
Page Mapping
YOLO Layout Regions
Tables (NEW!)
Database Storage:
product_layout_regions table - YOLO-detected regionsproduct_tables table - Extracted tables with metadataReturns: A dict with product_pages (set of physical PDF page indices), layout_regions (list of YOLO-detected LayoutRegion objects), layout_stats (total_regions, text_regions, image_regions, table_regions, title_regions counts), and tables_extracted (integer count).
Benefits:
Output:
File: app/api/pdf_processing/stage_2_chunking.py
Tool: PyMuPDF4LLM + Product-Aware Chunking
Process:
Services Used:
UnifiedChunkingService - Product-aware semantic chunkingPyMuPDF4LLM - Text extraction with layout preservationData Created:
Text Chunks
Chunk Metadata
Database Storage:
chunks table - Text chunks with product_id foreign keychunk_metadata - Additional metadata and quality scoresReturns: A dict with chunks_created (count), total_characters, avg_chunk_size, and quality_scores (avg, min, max).
Output:
File: app/api/pdf_processing/stage_3_images.py
Tool: PyMuPDF + YOLO-Guided Extraction
Process:
Services Used:
VisionGuidedImageExtractor - YOLO-guided image extractionPyMuPDF - Image extraction from PDFSupabase Storage - Cloud storage for imagesData Created:
Product Images
Image Metadata
Database Storage:
product_images table - Images with product_id foreign keyimage_metadata - Additional metadata and quality scoresReturns: A dict with images_extracted, images_uploaded, total_size_mb, avg_confidence, and image_types (product, detail, diagram counts).
Output:
File: app/api/pdf_processing/stage_4_products.py
Purpose: Create product records and link all extracted entities
Process:
Services Used:
ProductService - Product CRUD operationsData Created:
Product Record
Entity Relationships
Database Storage:
products table - Product recordschunks table - Chunks with product_idproduct_images table - Images with product_idproduct_tables table - Tables with product_id (NEW!)Returns: A dict with product_id, product_name, chunks_linked, images_linked, tables_linked, and metadata_fields counts.
Output:
File: app/api/pdf_processing/stage_4_products.py
Function: propagate_common_fields_to_products()
Monitor stage: field_propagation
Timeout: 2 min (DB reads/writes only โ no AI calls)
Purpose: After all products are created, fill empty metadata fields by borrowing values from sibling products in the same document. Catalog-level attributes (factory, origin, available sizes, etc.) are typically the same for every product in a catalog โ this stage enforces that uniformity without overwriting any values that were already extracted.
Process:
progress_monitor and tracker before and afterFields Propagated (first non-empty sibling wins):
Top-level metadata fields:
factory_name, factory_group_namecountry_of_origin, originmaterial_category (upload category always wins over propagated value)manufacturing_location, manufacturing_process, manufacturing_countryavailable_sizes โ list of sizes shared across the catalogNested fields (under material_properties):
thickness โ propagated with {value, confidence: 0.75, source: "sibling_product"}body_type โ e.g., "porcelain", "ceramic"composition โ material composition stringSafety rule: Only empty/null/empty-list/empty-dict fields are touched. Existing values are never overwritten.
Returns: A dict with products_updated, total_products, fields_propagated (list of field names), and source: "stage_4_5_propagation".
File: app/api/pdf_processing/stage_4_products.py
Function: extract_dimensions_from_document_chunks()
Monitor stage: dimension_extraction
Timeout: 2 min (pure regex โ no AI calls)
Purpose: After Stage 4.5 sibling propagation, some products may still have empty available_sizes or material_properties.thickness. This stage merges all text chunks for the document and runs regex patterns to extract dimensions and thickness values, filling the remaining gaps.
Process:
document_chunks.content for the document, merge into one text blob(\d+)[xXร](\d+)\s*(?:cm|CM) with sanity check (5โ300 cm per axis)thickness|spessore|รฉpaisseur|Stรคrke) or bare X.Ymm patternavailable_sizes โ insert found sizesmaterial_properties.thickness โ insert {value, confidence: 0.65, source: "document_text"}Coverage: This stage acts as a safety net โ even if Stage 0 AI extraction and Stage 4.5 sibling propagation both missed a dimension, the raw text almost always contains it somewhere.
Returns: A dict with products_updated, sizes_found (list), thickness_found, and source: "document_text".
Note (2026-04): The former asynchronous "Phase 2 background image processor" (app/services/images/background_image_processor.py) was deleted โ it called a non-existent generate_material_embeddings method and produced no output. Image-text dimension extraction is now handled inline during Phase 1 image processing by _enrich_product_metadata_from_spec_image() in the image processing service, using the same regex patterns against Qwen vision output.
Logic:
keyword_hits >= 3 โ image is a spec table โ handled by _enrich_product_metadata_from_spec_image() as a spec tablekeyword_hits < 3 โ image is a product photo with possible text overlay โ regex extraction against Qwen raw outputRegex patterns applied (same as Stage 4.6 but on Qwen's OCR output):
(\d+)[xXร](\d+)\s*(?:cm|CM|mm|MM)?X.Ymm patternConfidence: 0.70, source: "image_text"
Three-Layer Coverage Summary:
| Layer | Stage | Source | Confidence | AI? |
|---|---|---|---|---|
| Sibling propagation | 4.5 | Sibling product DB | 0.75 | No |
| Text chunk regex | 4.6 | document_chunks text | 0.65 | No |
| Image OCR regex | Phase 1 (inline) | Qwen raw_qwen_output | 0.70 | Yes (Qwen) |
File: app/services/discovery/entity_linking_service.py
Purpose: Link all extracted entities to products and create relationships
Process:
Services Used:
EntityLinkingService - Entity relationship managementData Validated:
Database Storage:
Returns: A dict with product_id, chunks_linked, images_linked, tables_linked, total_entities, and validation_passed.
Logging Output: Produces a structured log entry like "Product 'NOVA' entities linked: Chunks: 45, Images: 12, Tables: 3, Total: 60 entities".
Output:
๐ ON-DEMAND DOWNLOAD ARCHITECTURE
Model: Qwen3-VL 17B Vision
Process (Per Image):
Why URL-Based?
Output: A JSON result with total_images_classified, material_images, non_material_images, classification_errors, non_material_deleted_from_supabase, memory_usage, and processing_time.
Performance Metrics:
๐ ZERO-DOWNLOAD ARCHITECTURE
Model: SigLIP2 via SLIG cloud endpoint (768D). Legacy SigLIP ViT-SO400M (1152D) was retired in 2026-04 โ its collections were 100% orphans.
Process (Per Image):
image_slig_embeddings, image_color_embeddings, image_texture_embeddings, image_style_embeddings, image_material_embeddings)Why Zero-Download?
5 SLIG Embedding Types Generated Per Image (SigLIP2 via SLIG cloud endpoint, 768D each):
image_slig_embeddings. Producer key: visual_768.image_color_embeddings. Producer key: color_slig_768.image_texture_embeddings. Producer key: texture_slig_768.image_style_embeddings. Producer key: style_slig_768.image_material_embeddings. Producer key: material_slig_768.Output: A JSON result with material_images_processed, slig_embeddings_generated, total_embeddings, memory_usage, processing_time, and embeddings_by_type (visual, color, texture, style, material counts).
Performance Metrics:
๐ ON-DEMAND DOWNLOAD ARCHITECTURE
Model: Qwen3-VL 17B Vision
Process (Per Image):
Why On-Demand Download?
Output: A JSON result with images_analyzed, quality_scores_generated, material_properties_extracted, memory_usage, and processing_time.
Performance Metrics:
Model: Qwen3-VL 17B Vision
Process:
Output: A JSON structure per image with image_id, ocr_text, materials (list), properties (dict with fields like weight and weave), and quality_score.
Quality Scoring:
Note: This stage runs asynchronously and does not block pipeline completion
Models: Claude Haiku 4.5 โ Claude Sonnet 4.5
Two-Stage Validation:
Stage 1 (Haiku - Fast):
Stage 2 (Sonnet - Deep):
Output: A JSON structure per product with product_id, name, description, metadata (factory, dimensions, material), chunks (list of IDs), images (list of IDs), and confidence_score.
Process:
Relevance Algorithm:
Output: A JSON result with product_image_relationships, chunk_image_relationships, chunk_product_relationships, and total_relationships counts.
Database Tables:
product_image_relationshipschunk_image_relationshipschunk_product_relationshipsProcess:
Output: Complete processed document with all relationships
9 checkpoints for failure recovery:
Recovery Process: On startup, if a job has a saved checkpoint_stage, the pipeline resumes from that checkpoint rather than starting from the beginning.
Note:
NOVA PDF Example (71 pages, 249 images):
Accuracy Metrics:
URL-Based Architecture Impact:
The pipeline has been refactored from a monolithic 2900+ line function into modular services and API endpoints for better debugging, testing, and retry capabilities.
ImageProcessingService (app/services/image_processing_service.py)
classify_images() - Qwen Vision + Claude validationupload_images_to_storage() - Upload to Supabase Storagesave_images_and_generate_clips() - DB save + SLIG 768D embeddings (name retained for backwards compat; writes directly to VECS)UnifiedChunkingService (app/services/unified_chunking_service.py)
chunk_text() - Semantic/hybrid/fixed-size/layout-aware chunkingRelevancyService (app/services/relevancy_service.py) (updated 2026-04)
create_product_image_relationships() - Based on page rangescreate_all_relationships() - Orchestrate all relationshipsentity_linking_service.link_images_to_chunks using page_proximity. The former create_chunk_image_relationships() method was deleted in 2026-04 โ it computed cosine similarity between 1024D text and 768D visual vectors, which was mathematically invalid.Each pipeline stage has a dedicated endpoint for independent testing and retry. The available internal endpoints are:
POST /api/internal/classify-images/{job_id}POST /api/internal/upload-images/{job_id}POST /api/internal/save-images-db/{job_id}POST /api/internal/create-chunks/{job_id}POST /api/internal/create-relationships/{job_id}The main orchestrator is POST /api/rag/documents/upload accepting multipart/form-data with parameters: file (PDF), workspace_id (UUID), category (default: "products"), and focused_extraction (default: true). It returns a JSON response with job_id, document_id, status: "processing", progress: 0, and current_stage: "INITIALIZED".
Orchestrator Flow:
/api/internal/classify-images/{job_id}/api/internal/upload-images/{job_id}/api/internal/save-images-db/{job_id}/api/internal/create-chunks/{job_id}/api/internal/create-relationships/{job_id}Benefits:
PDF processing implements complete production hardening for reliability and monitoring:
Every product, chunk, image, and embedding is tagged with source_type: 'pdf_processing' and source_job_id: job_id at insert time. This applies to the products, document_chunks, document_images, and embeddings tables.
Benefits:
Updates last_heartbeat field every stage to detect stuck jobs. The heartbeat update sets last_heartbeat, current_stage, and progress_percent on the background_jobs record for the active job.
Stuck Job Detection:
Comprehensive error tracking and performance monitoring using sentry_sdk.start_transaction with op="pdf_processing" and name="process_stage". Tags include job_id and stage, and data includes total_pages. Breadcrumbs are added at each processing step. On success, the transaction status is set to "ok". On exception, sentry_sdk.capture_exception() is called and status set to "internal_error" before re-raising.
Features:
| Feature | Status | Details |
|---|---|---|
| Source Tracking | โ COMPLETE | All tables have source_type and source_job_id |
| Heartbeat Monitoring | โ COMPLETE | Updates every stage, 10-minute stuck threshold |
| Sentry Tracking | โ COMPLETE | Transactions, breadcrumbs, exception capture |
| Error Handling | โ COMPLETE | Comprehensive try-catch with Sentry integration |
| Progress Tracking | โ COMPLETE | Real-time progress updates via job_progress table |
| Checkpoint Recovery | โ COMPLETE | Resume from last successful stage |
| Auto-Recovery | โ COMPLETE | Automatic retry of stuck/failed jobs |
Purpose: Retrieve products with all linked entities (chunks, images, tables)
File: mivaa-pdf-extractor/app/api/rag_routes.py
Query Parameters:
document_id (required): Document ID to filter productsinclude_tables (optional, default: true): Include tables in responseResponse Format: A JSON object with a products array and a total count. Each product entry contains id, name, description, page_range, metadata (with factory, dimensions, material), a chunks array (each with id, content, page_number), an images array (each with id, url, page_number), and a tables array (each with id, page_number, table_type, headers, and table_data containing a rows array).
Implementation Details:
product_id for efficient lookupinclude_tables=falseUsage: GET /api/rag/products?document_id=YOUR_DOC_ID (with tables, default) or append &include_tables=false to exclude tables.
Benefits:
The YOLO Layout-Aware Chunking system uses detected layout regions to create intelligent, boundary-respecting chunks that preserve document structure and semantic meaning.
Stage 1 (YOLO Detection) โ Stage 2 (Layout-Aware Chunking)
YOLO detects layout regions (Stage 4.5)
product_layout_regions tableChunking service reads regions (Stage 2)
A TABLE region chunk contains the full table text (headers and rows), a region_type of "TABLE", and a reading_order value.
A TITLE+TEXT chunk combines the section heading with its body paragraph, with region_type: "TITLE+TEXT" and a reading_order value.
A TEXT region chunk contains body text, with region_type: "TEXT" and a reading_order value.
A CAPTION region chunk contains the caption text, region_type: "CAPTION", a reading_order value, and a linked_image_bbox reference.
Layout-aware chunking is enabled by setting strategy=ChunkingStrategy.LAYOUT_AWARE in the ChunkingConfig passed to UnifiedChunkingService, with max_chunk_size=1000 and min_chunk_size=100.
Fallback Behavior:
product_id in metadata โ Falls back to semantic chunkingโ Preserves Document Structure
โ Improves Search Quality
โ Reduces Fragmentation
Current Implementation:
Planned Enhancements:
The system would detect H1, H2, and H3 heading levels and associate body content with its full parent hierarchy (e.g., "Outdoor Furniture โ Chairs โ Ergonomic Series").
A chunk would include a hierarchy object with h1, h2, and h3 keys alongside its content.
Benefits:
Planned Metrics:
Metrics to track per-page: yolo_processing_time_per_page, regions_detected_per_page, confidence_score_avg, confidence_score_min, and table_extraction_success_rate.
Region counts by type: TEXT, TITLE, TABLE, IMAGE, CAPTION, FORMULA counts per document.
Benefits:
Metrics would be stored in the job_progress table with a stage: 'yolo_detection' key.
Current Performance:
With GPU Acceleration:
Implementation Plan:
The system would auto-detect GPU availability using torch.cuda.is_available(). With GPU, it would process multiple pages in parallel batches (batch_size=4 for CUDA vs 1 for CPU), clearing GPU cache between batches with torch.cuda.empty_cache(). The device and batch size would be configurable via YOLO_DEVICE and YOLO_BATCH_SIZE environment variables.
Benefits:
Beyond Title-Content Relationships:
Keep bullet and numbered lists together as atomic units with region_type: "LIST" and list_type: "bullet".
Include surrounding prose context (introductory text before a table and notes after) in the same chunk as the table data, with region_type: "TABLE_WITH_CONTEXT".
Link captions to their specific images by including a linked_image_id field in CAPTION chunks.
Detect phrases like "see Figure 3" and record cross_references and linked_chunks in the chunk metadata.
Never split across major sections. Track section and subsection fields in chunk metadata.
Keep mathematical formulas intact with region_type: "FORMULA" and a formula_type descriptor.
Benefits:
Last Updated: February 20, 2026 Pipeline Version: Product-Centric Architecture with YOLO Layout Detection, Table Extraction & Cross-Product Field Propagation Status: Production
Major Features:
YOLO_ENABLED=true)product_layout_regions tableproduct_tables tablefield_propagation stage, 2-min timeoutFuture Enhancements (Planned):
A per-document health view, surfaced as a third tab on completed jobs in the Admin โ Async Job Queue Monitor (alongside "Product Extraction Pipeline" and "Technical Logs"). It only appears when the job's status === 'completed'.
Frontend: src/components/Admin/AsyncJobQueueMonitor/DocumentHealthPanel.tsx
Backend: GET /api/internal/document-extraction-status/{document_id} (MIVAA)
Re-run action: POST /api/internal/run-catalog-knowledge/{document_id}?force=true
| Section | Detail |
|---|---|
| Average coverage % | Big number, color-coded by health (green โฅ75%, amber 50โ75%, red <50%) |
| Layer 1 โ Catalog Layout | Run state + page-type breakdown (legend_pages, product_spec_pages, product_photo_pages, named_products_detected) |
| Layer 2 โ Catalog Legends | Run state + legend_types_found + global_certifications propagated catalog-wide |
| Coverage bucket bar chart | Distribution of products across 0โ25% / 25โ50% / 50โ75% / 75โ100% buckets |
| Per-product drilldown | Sample of products with their missing_critical fields and a source-breakdown chip set per product |
| Issues banner | Detected problems + one-click "Re-run Catalog Knowledge" remediation |
Each chip indicates which tier produced a given field on that product. The same labels are used throughout the admin UI:
| Source key | Tier label |
|---|---|
pymupdf_text_dict |
PyMuPDF Tier A |
claude_sonnet_vision / claude_spec_vision |
Claude Sonnet Tier B |
catalog_legend |
Catalog Legend Tier C |
chunk_regex |
Chunk Regex |
vision_rollup |
Image Vision Rollup |
ai_text_extraction |
AI Text (Stage 0) |
Coverage and source mix are the two best signals for "did this catalog actually parse well?" The bucket chart spots catalogs where average coverage is fine but the long tail is empty; the source chips spot catalogs where one tier silently failed and another is doing all the work. The "Re-run Catalog Knowledge" button is the standard one-click fix when Layer 2 needs to retry.