Unified Product Generation Flow

Complete documentation for all three product generation methods and their unified search architecture.

📚 Related Documentation:


📋 Table of Contents

  1. Overview
  2. Architecture Diagram
  3. Method Comparison
  4. Production Hardening
  5. Unified Storage
  6. Unified Search
  7. Verification Checklist

Overview

The Material Kai Vision Platform supports three product generation methods, all feeding into a unified storage and search infrastructure:

  1. 📄 PDF Processing - Extract products from PDF catalogs
  2. 🌐 Web Scraping - Scrape products from manufacturer websites
  3. 📋 XML Import - Import products from XML files

Key Principles

Unified Pipeline: All methods use the same AI models and services ✅ Same Quality: Same metadata extraction, chunking, and embeddings ✅ Same Storage: All products stored in same tables and VECS collections ✅ Same Search: All products searchable via unified multi-vector search ✅ Same Limits: Same concurrency limits and async processing


Architecture Diagram

All three methods converge into a shared storage and search layer. Each method follows a distinct ingestion path but produces the same artifacts:

METHOD 1 (PDF Processing): PDF Upload → PyMuPDF4LLM text extraction → ProductDiscoveryService.discover_products() → Products with metadata → ChunkingService.create_chunks_and_embeddings() → Text Embeddings (Voyage AI 1024D) and Image Extraction → SLIG Embeddings (SigLIP2 cloud 768D × 5 types) + Voyage understanding 1024D. (updated 2026-04)

METHOD 2 (Web Scraping): Firecrawl scraping → Markdown content from scraping_pagesProductDiscoveryService.discover_products_from_text() → Products with metadata → same ChunkingService for text and CLIP pipelines.

METHOD 3 (XML Import): XML upload → Parse XML / extract products → Direct insert into products table → _queue_text_processing() for async chunking → Text Embeddings and image download from URLs → same CLIP pipeline.

Unified Storage (VECS Collections):

Unified Search: A user query is embedded, then searched in parallel across all 6 collections. Results from PDF, web, and XML sources are merged and ranked together.


Method Comparison

Feature Comparison

Feature 📄 PDF Processing 🌐 Web Scraping 📋 XML Import
Product Discovery ProductDiscoveryService.discover_products() ProductDiscoveryService.discover_products_from_text() ✅ Direct creation from XML
Text Chunks ChunkingService.create_chunks_and_embeddings() ChunkingService.create_chunks_and_embeddings() _queue_text_processing() (async)
Text Embeddings ✅ Voyage AI 1024D → chunks.text_embedding (key text_1024) ✅ Voyage AI 1024D → chunks.text_embedding (key text_1024) ✅ Voyage AI 1024D → chunks.text_embedding (key text_1024)
Image Processing ✅ Extract from PDF pages ✅ Extract from scraped pages ✅ Download from URLs
SLIG Embeddings ✅ SigLIP2 768D x5 types + Voyage 1024D understanding ✅ SigLIP2 768D x5 types + Voyage 1024D understanding ✅ SigLIP2 768D x5 types + Voyage 1024D understanding
VECS Storage ✅ All 6 collections ✅ All 6 collections ✅ All 6 collections
Searchable ✅ Via unified search ✅ Via unified search ✅ Via unified search

Detailed Flow Verification

METHOD 1: PDF Processing

Product discovery uses ProductDiscoveryService.discover_products() with the PDF bytes and markdown text. This creates products in the database. Text chunks are created via ChunkingService.create_chunks_and_embeddings(), generating 1024D Voyage AI embeddings stored in chunks.text_embedding (updated 2026-04). Visual embeddings are generated by vecs_service.upsert_specialized_embeddings() for each image, producing six VECS collections: image_slig_embeddings, image_color_embeddings, image_texture_embeddings, image_style_embeddings, image_material_embeddings (all 768D halfvec), plus image_understanding_embeddings (1024D Voyage from Qwen3-VL vision_analysis).


METHOD 2: Web Scraping

Product discovery uses ProductDiscoveryService.discover_products_from_text() with the scraped markdown and source_type="web_scraping". Products are created in the database via _create_products_in_database(). Text chunks and embeddings are generated with the same ChunkingService as PDF processing. Image embeddings use the same ImageProcessingService as PDF processing.


METHOD 3: XML Import

Products are created directly from parsed XML fields inserted into the products table. Text processing is queued asynchronously via _queue_text_processing(), which creates chunks in the chunks table and queues embedding generation via async_queue.queue_ai_analysis_jobs(). Image embeddings use the same ImageProcessingService as the other methods.


Production Hardening

All three processing methods implement complete production hardening for reliability, monitoring, and debugging.

Source Tracking

Every product, chunk, image, and embedding is tagged with its source for complete data lineage:

Field Purpose Example Values
source_type Processing method 'pdf_processing', 'xml_import', 'web_scraping'
source_job_id Originating job Job UUID from background_jobs or data_import_jobs

Benefits:

All three methods insert records into their respective tables (products, document_chunks, document_images, embeddings) with source_type and source_job_id fields populated at write time.


Heartbeat Monitoring

All processing methods update heartbeat timestamps to detect stuck/crashed jobs:

Method Heartbeat Field Update Frequency Stuck Threshold
PDF Processing last_heartbeat Every stage >10 minutes
XML Import last_heartbeat Every batch (10 products) >30 minutes
Web Scraping last_heartbeat_at Every 30 seconds >5 minutes

Benefits:

Stuck job detection queries the background_jobs table for records in processing status where last_heartbeat is older than the threshold for that method.


Sentry Error Tracking

All processing methods use Sentry for comprehensive error tracking and performance monitoring:

Feature PDF XML Web Scraping
Transaction Tracking
Breadcrumbs
Exception Capture
Performance Monitoring
Error Context

Benefits:

Each processing method wraps its main logic in a Sentry transaction (sentry_sdk.start_transaction) tagged with job_id and stage metadata. Breadcrumbs are added for each batch, and exceptions are captured via sentry_sdk.capture_exception() before being re-raised.


Production Hardening Status

Feature PDF XML Web Scraping Status
Source Tracking COMPLETE
Heartbeat Monitoring COMPLETE
Sentry Tracking COMPLETE
Error Handling COMPLETE
Progress Tracking COMPLETE
Checkpoint Recovery COMPLETE
Auto-Recovery COMPLETE

Unified Storage

All three methods store data in the same tables and VECS collections:

1. PostgreSQL Tables

Table Purpose Used By
products Product records PDF, Web, XML
chunks Text chunks with embeddings PDF, Web, XML
document_images Image metadata PDF, Web, XML
documents Source documents PDF, Web, XML

2. VECS Collections

All three methods store embeddings in the same VECS collections:

Text Embeddings

Collection Dimension Model Used By
chunks.text_embedding 1024D Voyage AI (voyage-3.5) PDF, Web, XML

Visual Embeddings

Collection Dimension Model Used By
image_slig_embeddings 768D SigLIP2 via SLIG cloud endpoint PDF, Web, XML
image_color_embeddings 768D SLIG (color-focused similarity) PDF, Web, XML
image_texture_embeddings 768D SLIG (texture-focused similarity) PDF, Web, XML
image_material_embeddings 768D SLIG (material-focused similarity) PDF, Web, XML
image_style_embeddings 768D SLIG (style-focused similarity) PDF, Web, XML
image_understanding_embeddings 1024D Voyage AI (from Qwen3-VL vision_analysis) PDF, Web, XML

Unified Search

All products are searchable via the same unified search service, regardless of source:

Frontend: UnifiedSearchService

The UnifiedSearchService.searchMultiVector() method accepts a query string and workspace ID and returns results spanning all three source types (PDF, web, and XML) transparently.


Backend: unified_search_service.py

The backend search function generates a query embedding, then runs the following in sequence or parallel:

  1. Semantic text search against the chunks table (covers PDF, Web, and XML sources)
  2. Visual embedding search via vecs_service.search_similar_images() with workspace filter
  3. Color specialized embedding search via vecs_service.search_specialized_embeddings(embedding_type='color', ...)

Additional specialized searches run for texture, material, and style in the same fashion. All results are fused and ranked before being returned as a unified result set containing products from all sources.


Search Flow

User Query: "blue ceramic tiles" ↓ Generate Query Embedding (Voyage AI 1024D) ↓ ┌──────────────────────────────────────┐ │ Multi-Vector Search (Parallel) │ ├──────────────────────────────────────┤ │ 1. Text Search (chunks table) │ │ → Searches: PDF + Web + XML │ │ │ │ 2. Visual Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 3. Color Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 4. Texture Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 5. Material Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 6. Style Search (VECS) │ │ → Searches: PDF + Web + XML │ └──────────────────────────────────────┘ ↓ Combine & Rank Results ↓ Return Unified Results (PDF + Web + XML)


Verification Checklist

✅ Product Generation

Requirement PDF Web XML Evidence
Products Created products table
Chunks Created chunks table
Text Embeddings chunks.text_embedding (Voyage 1024D)
Image Embeddings VECS collections (SLIG 768D x5 + Voyage understanding 1024D)

✅ Unified Storage

Requirement Status Evidence
Same Products Table All methods insert to products
Same Chunks Table All methods insert to chunks
Same VECS Collections All methods use same 6 collections
Same Embedding Models Voyage AI 1024D text + SLIG 768D visual (updated 2026-04)

✅ Unified Search

Requirement Status Evidence
Text Search Searches chunks table (all sources)
Visual Search Searches VECS collections (all sources)
Multi-Vector Search Combines all search types
Cross-Source Results Returns products from PDF + Web + XML

✅ Async Processing

Requirement Status Evidence
Fully Async All methods use async/await
Same Limits 5 HuggingFace/Qwen, 2 Claude, 10 uploads, 20 SLIG
Same Timeouts 300s discovery, 120s AI, 30s downloads
Same Services ImageProcessingService, RealEmbeddingsService, AsyncQueueService

Summary

All 3 methods generate products: PDF, Web Scraping, XML Import ✅ All use same AI models: Claude/GPT for discovery, Voyage AI for text, SLIG (SigLIP2 cloud) for images (updated 2026-04) ✅ All create chunks: Text chunks with embeddings ✅ All create embeddings: Text (Voyage 1024D) + Visual (SLIG 768D x5) + Understanding (Voyage 1024D) ✅ All use same storage: PostgreSQL tables + VECS collections ✅ All searchable: Via unified multi-vector search ✅ All fully async: Same concurrency limits and timeout guards

The architecture is unified, consistent, and production-ready! 🚀