Unified Product Generation Flow

Complete documentation for all three product generation methods and their unified search architecture.

📚 Related Documentation:

Async Processing & Limits - Concurrency limits and async architecture

PDF Processing Pipeline - PDF processing details

Web Scraping Integration - Web scraping details

Data Import System - XML import details

Overview

The Material Kai Vision Platform supports three product generation methods, all feeding into a unified storage and search infrastructure:

📄 PDF Processing - Extract products from PDF catalogs
🌐 Web Scraping - Scrape products from manufacturer websites
📋 XML Import - Import products from XML files

Key Principles

✅ Unified Pipeline: All methods use the same AI models and services ✅ Same Quality: Same metadata extraction, chunking, and embeddings ✅ Same Storage: All products stored in same tables and VECS collections ✅ Same Search: All products searchable via unified multi-vector search ✅ Same Limits: Same concurrency limits and async processing

Architecture Diagram

All three methods converge into a shared storage and search layer. Each method follows a distinct ingestion path but produces the same artifacts:

METHOD 1 (PDF Processing): PDF Upload → PyMuPDF4LLM text extraction → ProductDiscoveryService.discover_products() → Products with metadata → ChunkingService.create_chunks_and_embeddings() → Text Embeddings (Voyage AI 1024D) and Image Extraction → SLIG Embeddings (SigLIP2 cloud 768D × 5 types) + Voyage understanding 1024D. (updated 2026-04)

METHOD 2 (Web Scraping): Firecrawl scraping → Markdown content from scraping_pages → ProductDiscoveryService.discover_products_from_text() → Products with metadata → same ChunkingService for text and CLIP pipelines.

METHOD 3 (XML Import): XML upload → Parse XML / extract products → Direct insert into products table → _queue_text_processing() for async chunking → Text Embeddings and image download from URLs → same CLIP pipeline.

Unified Storage (VECS Collections):

chunks table (text_embedding 1024D — Voyage AI)
image_slig_embeddings (768D — primary visual, SLIG)
image_color_embeddings (768D — text-guided color SLIG)
image_texture_embeddings (768D — text-guided texture SLIG)
image_material_embeddings (768D — text-guided material SLIG)
image_style_embeddings (768D — text-guided style SLIG)
image_understanding_embeddings (1024D — Voyage AI from Qwen3-VL)

Unified Search: A user query is embedded, then searched in parallel across all 6 collections. Results from PDF, web, and XML sources are merged and ranked together.

Method Comparison

Feature Comparison

Feature	📄 PDF Processing	🌐 Web Scraping	📋 XML Import
Product Discovery	✅ `ProductDiscoveryService.discover_products()`	✅ `ProductDiscoveryService.discover_products_from_text()`	✅ Direct creation from XML
Text Chunks	✅ `ChunkingService.create_chunks_and_embeddings()`	✅ `ChunkingService.create_chunks_and_embeddings()`	✅ `_queue_text_processing()` (async)
Text Embeddings	✅ Voyage AI 1024D → `chunks.text_embedding` (key `text_1024`)	✅ Voyage AI 1024D → `chunks.text_embedding` (key `text_1024`)	✅ Voyage AI 1024D → `chunks.text_embedding` (key `text_1024`)
Image Processing	✅ Extract from PDF pages	✅ Extract from scraped pages	✅ Download from URLs
SLIG Embeddings	✅ SigLIP2 768D x5 types + Voyage 1024D understanding	✅ SigLIP2 768D x5 types + Voyage 1024D understanding	✅ SigLIP2 768D x5 types + Voyage 1024D understanding
VECS Storage	✅ All 6 collections	✅ All 6 collections	✅ All 6 collections
Searchable	✅ Via unified search	✅ Via unified search	✅ Via unified search

Detailed Flow Verification

METHOD 1: PDF Processing ✅

Product discovery uses ProductDiscoveryService.discover_products() with the PDF bytes and markdown text. This creates products in the database. Text chunks are created via ChunkingService.create_chunks_and_embeddings(), generating 1024D Voyage AI embeddings stored in chunks.text_embedding (updated 2026-04). Visual embeddings are generated by vecs_service.upsert_specialized_embeddings() for each image, producing six VECS collections: image_slig_embeddings, image_color_embeddings, image_texture_embeddings, image_style_embeddings, image_material_embeddings (all 768D halfvec), plus image_understanding_embeddings (1024D Voyage from Qwen3-VL vision_analysis).

METHOD 2: Web Scraping ✅

Product discovery uses ProductDiscoveryService.discover_products_from_text() with the scraped markdown and source_type="web_scraping". Products are created in the database via _create_products_in_database(). Text chunks and embeddings are generated with the same ChunkingService as PDF processing. Image embeddings use the same ImageProcessingService as PDF processing.

METHOD 3: XML Import ✅

Products are created directly from parsed XML fields inserted into the products table. Text processing is queued asynchronously via _queue_text_processing(), which creates chunks in the chunks table and queues embedding generation via async_queue.queue_ai_analysis_jobs(). Image embeddings use the same ImageProcessingService as the other methods.

Production Hardening

All three processing methods implement complete production hardening for reliability, monitoring, and debugging.

Source Tracking

Every product, chunk, image, and embedding is tagged with its source for complete data lineage:

Field	Purpose	Example Values
source_type	Processing method	`'pdf_processing'`, `'xml_import'`, `'web_scraping'`
source_job_id	Originating job	Job UUID from `background_jobs` or `data_import_jobs`

Benefits:

✅ Filter Materials Data page by specific job
✅ Trace any data back to its source
✅ Delete all data from a specific import
✅ Audit data quality by source

All three methods insert records into their respective tables (products, document_chunks, document_images, embeddings) with source_type and source_job_id fields populated at write time.

Heartbeat Monitoring

All processing methods update heartbeat timestamps to detect stuck/crashed jobs:

Method	Heartbeat Field	Update Frequency	Stuck Threshold
PDF Processing	`last_heartbeat`	Every stage	>10 minutes
XML Import	`last_heartbeat`	Every batch (10 products)	>30 minutes
Web Scraping	`last_heartbeat_at`	Every 30 seconds	>5 minutes

Benefits:

✅ Detect crashed/stuck jobs automatically
✅ Enable auto-recovery mechanisms
✅ Monitor job health in real-time
✅ Alert on processing delays

Stuck job detection queries the background_jobs table for records in processing status where last_heartbeat is older than the threshold for that method.

Sentry Error Tracking

All processing methods use Sentry for comprehensive error tracking and performance monitoring:

Feature	PDF	XML	Web Scraping
Transaction Tracking	✅	✅	✅
Breadcrumbs	✅	✅	✅
Exception Capture	✅	✅	✅
Performance Monitoring	✅	✅	✅
Error Context	✅	✅	✅

Benefits:

✅ Track performance bottlenecks
✅ Debug errors with full context
✅ Monitor AI model usage
✅ Identify slow operations

Each processing method wraps its main logic in a Sentry transaction (sentry_sdk.start_transaction) tagged with job_id and stage metadata. Breadcrumbs are added for each batch, and exceptions are captured via sentry_sdk.capture_exception() before being re-raised.

Production Hardening Status

Feature	PDF	XML	Web Scraping	Status
Source Tracking	✅	✅	✅	COMPLETE
Heartbeat Monitoring	✅	✅	✅	COMPLETE
Sentry Tracking	✅	✅	✅	COMPLETE
Error Handling	✅	✅	✅	COMPLETE
Progress Tracking	✅	✅	✅	COMPLETE
Checkpoint Recovery	✅	✅	✅	COMPLETE
Auto-Recovery	✅	✅	✅	COMPLETE

Unified Storage

All three methods store data in the same tables and VECS collections:

1. PostgreSQL Tables

Table	Purpose	Used By
products	Product records	PDF, Web, XML
chunks	Text chunks with embeddings	PDF, Web, XML
document_images	Image metadata	PDF, Web, XML
documents	Source documents	PDF, Web, XML

2. VECS Collections

All three methods store embeddings in the same VECS collections:

Text Embeddings

Collection	Dimension	Model	Used By
chunks.text_embedding	1024D	Voyage AI (voyage-3.5)	PDF, Web, XML

Visual Embeddings

Collection	Dimension	Model	Used By
image_slig_embeddings	768D	SigLIP2 via SLIG cloud endpoint	PDF, Web, XML
image_color_embeddings	768D	SLIG (color-focused similarity)	PDF, Web, XML
image_texture_embeddings	768D	SLIG (texture-focused similarity)	PDF, Web, XML
image_material_embeddings	768D	SLIG (material-focused similarity)	PDF, Web, XML
image_style_embeddings	768D	SLIG (style-focused similarity)	PDF, Web, XML
image_understanding_embeddings	1024D	Voyage AI (from Qwen3-VL vision_analysis)	PDF, Web, XML

Unified Search

All products are searchable via the same unified search service, regardless of source:

Frontend: UnifiedSearchService

The UnifiedSearchService.searchMultiVector() method accepts a query string and workspace ID and returns results spanning all three source types (PDF, web, and XML) transparently.

Backend: unified_search_service.py

The backend search function generates a query embedding, then runs the following in sequence or parallel:

Semantic text search against the chunks table (covers PDF, Web, and XML sources)
Visual embedding search via vecs_service.search_similar_images() with workspace filter
Color specialized embedding search via vecs_service.search_specialized_embeddings(embedding_type='color', ...)

Additional specialized searches run for texture, material, and style in the same fashion. All results are fused and ranked before being returned as a unified result set containing products from all sources.

Search Flow

User Query: "blue ceramic tiles" ↓ Generate Query Embedding (Voyage AI 1024D) ↓ ┌──────────────────────────────────────┐ │ Multi-Vector Search (Parallel) │ ├──────────────────────────────────────┤ │ 1. Text Search (chunks table) │ │ → Searches: PDF + Web + XML │ │ │ │ 2. Visual Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 3. Color Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 4. Texture Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 5. Material Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 6. Style Search (VECS) │ │ → Searches: PDF + Web + XML │ └──────────────────────────────────────┘ ↓ Combine & Rank Results ↓ Return Unified Results (PDF + Web + XML)

Verification Checklist

✅ Product Generation

Requirement	PDF	Web	XML	Evidence
Products Created	✅	✅	✅	`products` table
Chunks Created	✅	✅	✅	`chunks` table
Text Embeddings	✅	✅	✅	`chunks.text_embedding` (Voyage 1024D)
Image Embeddings	✅	✅	✅	VECS collections (SLIG 768D x5 + Voyage understanding 1024D)

✅ Unified Storage

Requirement	Status	Evidence
Same Products Table	✅	All methods insert to `products`
Same Chunks Table	✅	All methods insert to `chunks`
Same VECS Collections	✅	All methods use same 6 collections
Same Embedding Models	✅	Voyage AI 1024D text + SLIG 768D visual (updated 2026-04)

✅ Unified Search

Requirement	Status	Evidence
Text Search	✅	Searches `chunks` table (all sources)
Visual Search	✅	Searches VECS collections (all sources)
Multi-Vector Search	✅	Combines all search types
Cross-Source Results	✅	Returns products from PDF + Web + XML

✅ Async Processing

Requirement	Status	Evidence
Fully Async	✅	All methods use `async/await`
Same Limits	✅	5 HuggingFace/Qwen, 2 Claude, 10 uploads, 20 SLIG
Same Timeouts	✅	300s discovery, 120s AI, 30s downloads
Same Services	✅	ImageProcessingService, RealEmbeddingsService, AsyncQueueService

Summary

✅ All 3 methods generate products: PDF, Web Scraping, XML Import ✅ All use same AI models: Claude/GPT for discovery, Voyage AI for text, SLIG (SigLIP2 cloud) for images (updated 2026-04) ✅ All create chunks: Text chunks with embeddings ✅ All create embeddings: Text (Voyage 1024D) + Visual (SLIG 768D x5) + Understanding (Voyage 1024D) ✅ All use same storage: PostgreSQL tables + VECS collections ✅ All searchable: Via unified multi-vector search ✅ All fully async: Same concurrency limits and timeout guards

The architecture is unified, consistent, and production-ready! 🚀