Complete documentation for all three product generation methods and their unified search architecture.
📚 Related Documentation:
- Async Processing & Limits - Concurrency limits and async architecture
- PDF Processing Pipeline - PDF processing details
- Web Scraping Integration - Web scraping details
- Data Import System - XML import details
The Material Kai Vision Platform supports three product generation methods, all feeding into a unified storage and search infrastructure:
✅ Unified Pipeline: All methods use the same AI models and services ✅ Same Quality: Same metadata extraction, chunking, and embeddings ✅ Same Storage: All products stored in same tables and VECS collections ✅ Same Search: All products searchable via unified multi-vector search ✅ Same Limits: Same concurrency limits and async processing
All three methods converge into a shared storage and search layer. Each method follows a distinct ingestion path but produces the same artifacts:
METHOD 1 (PDF Processing): PDF Upload → PyMuPDF4LLM text extraction → ProductDiscoveryService.discover_products() → Products with metadata → ChunkingService.create_chunks_and_embeddings() → Text Embeddings (Voyage AI 1024D) and Image Extraction → SLIG Embeddings (SigLIP2 cloud 768D × 5 types) + Voyage understanding 1024D. (updated 2026-04)
METHOD 2 (Web Scraping): Firecrawl scraping → Markdown content from scraping_pages → ProductDiscoveryService.discover_products_from_text() → Products with metadata → same ChunkingService for text and CLIP pipelines.
METHOD 3 (XML Import): XML upload → Parse XML / extract products → Direct insert into products table → _queue_text_processing() for async chunking → Text Embeddings and image download from URLs → same CLIP pipeline.
Unified Storage (VECS Collections):
chunks table (text_embedding 1024D — Voyage AI)image_slig_embeddings (768D — primary visual, SLIG)image_color_embeddings (768D — text-guided color SLIG)image_texture_embeddings (768D — text-guided texture SLIG)image_material_embeddings (768D — text-guided material SLIG)image_style_embeddings (768D — text-guided style SLIG)image_understanding_embeddings (1024D — Voyage AI from Qwen3-VL)Unified Search: A user query is embedded, then searched in parallel across all 6 collections. Results from PDF, web, and XML sources are merged and ranked together.
| Feature | 📄 PDF Processing | 🌐 Web Scraping | 📋 XML Import |
|---|---|---|---|
| Product Discovery | ✅ ProductDiscoveryService.discover_products() |
✅ ProductDiscoveryService.discover_products_from_text() |
✅ Direct creation from XML |
| Text Chunks | ✅ ChunkingService.create_chunks_and_embeddings() |
✅ ChunkingService.create_chunks_and_embeddings() |
✅ _queue_text_processing() (async) |
| Text Embeddings | ✅ Voyage AI 1024D → chunks.text_embedding (key text_1024) |
✅ Voyage AI 1024D → chunks.text_embedding (key text_1024) |
✅ Voyage AI 1024D → chunks.text_embedding (key text_1024) |
| Image Processing | ✅ Extract from PDF pages | ✅ Extract from scraped pages | ✅ Download from URLs |
| SLIG Embeddings | ✅ SigLIP2 768D x5 types + Voyage 1024D understanding | ✅ SigLIP2 768D x5 types + Voyage 1024D understanding | ✅ SigLIP2 768D x5 types + Voyage 1024D understanding |
| VECS Storage | ✅ All 6 collections | ✅ All 6 collections | ✅ All 6 collections |
| Searchable | ✅ Via unified search | ✅ Via unified search | ✅ Via unified search |
Product discovery uses ProductDiscoveryService.discover_products() with the PDF bytes and markdown text. This creates products in the database. Text chunks are created via ChunkingService.create_chunks_and_embeddings(), generating 1024D Voyage AI embeddings stored in chunks.text_embedding (updated 2026-04). Visual embeddings are generated by vecs_service.upsert_specialized_embeddings() for each image, producing six VECS collections: image_slig_embeddings, image_color_embeddings, image_texture_embeddings, image_style_embeddings, image_material_embeddings (all 768D halfvec), plus image_understanding_embeddings (1024D Voyage from Qwen3-VL vision_analysis).
Product discovery uses ProductDiscoveryService.discover_products_from_text() with the scraped markdown and source_type="web_scraping". Products are created in the database via _create_products_in_database(). Text chunks and embeddings are generated with the same ChunkingService as PDF processing. Image embeddings use the same ImageProcessingService as PDF processing.
Products are created directly from parsed XML fields inserted into the products table. Text processing is queued asynchronously via _queue_text_processing(), which creates chunks in the chunks table and queues embedding generation via async_queue.queue_ai_analysis_jobs(). Image embeddings use the same ImageProcessingService as the other methods.
All three processing methods implement complete production hardening for reliability, monitoring, and debugging.
Every product, chunk, image, and embedding is tagged with its source for complete data lineage:
| Field | Purpose | Example Values |
|---|---|---|
| source_type | Processing method | 'pdf_processing', 'xml_import', 'web_scraping' |
| source_job_id | Originating job | Job UUID from background_jobs or data_import_jobs |
Benefits:
All three methods insert records into their respective tables (products, document_chunks, document_images, embeddings) with source_type and source_job_id fields populated at write time.
All processing methods update heartbeat timestamps to detect stuck/crashed jobs:
| Method | Heartbeat Field | Update Frequency | Stuck Threshold |
|---|---|---|---|
| PDF Processing | last_heartbeat |
Every stage | >10 minutes |
| XML Import | last_heartbeat |
Every batch (10 products) | >30 minutes |
| Web Scraping | last_heartbeat_at |
Every 30 seconds | >5 minutes |
Benefits:
Stuck job detection queries the background_jobs table for records in processing status where last_heartbeat is older than the threshold for that method.
All processing methods use Sentry for comprehensive error tracking and performance monitoring:
| Feature | XML | Web Scraping | |
|---|---|---|---|
| Transaction Tracking | ✅ | ✅ | ✅ |
| Breadcrumbs | ✅ | ✅ | ✅ |
| Exception Capture | ✅ | ✅ | ✅ |
| Performance Monitoring | ✅ | ✅ | ✅ |
| Error Context | ✅ | ✅ | ✅ |
Benefits:
Each processing method wraps its main logic in a Sentry transaction (sentry_sdk.start_transaction) tagged with job_id and stage metadata. Breadcrumbs are added for each batch, and exceptions are captured via sentry_sdk.capture_exception() before being re-raised.
| Feature | XML | Web Scraping | Status | |
|---|---|---|---|---|
| Source Tracking | ✅ | ✅ | ✅ | COMPLETE |
| Heartbeat Monitoring | ✅ | ✅ | ✅ | COMPLETE |
| Sentry Tracking | ✅ | ✅ | ✅ | COMPLETE |
| Error Handling | ✅ | ✅ | ✅ | COMPLETE |
| Progress Tracking | ✅ | ✅ | ✅ | COMPLETE |
| Checkpoint Recovery | ✅ | ✅ | ✅ | COMPLETE |
| Auto-Recovery | ✅ | ✅ | ✅ | COMPLETE |
All three methods store data in the same tables and VECS collections:
| Table | Purpose | Used By |
|---|---|---|
| products | Product records | PDF, Web, XML |
| chunks | Text chunks with embeddings | PDF, Web, XML |
| document_images | Image metadata | PDF, Web, XML |
| documents | Source documents | PDF, Web, XML |
All three methods store embeddings in the same VECS collections:
| Collection | Dimension | Model | Used By |
|---|---|---|---|
| chunks.text_embedding | 1024D | Voyage AI (voyage-3.5) | PDF, Web, XML |
| Collection | Dimension | Model | Used By |
|---|---|---|---|
| image_slig_embeddings | 768D | SigLIP2 via SLIG cloud endpoint | PDF, Web, XML |
| image_color_embeddings | 768D | SLIG (color-focused similarity) | PDF, Web, XML |
| image_texture_embeddings | 768D | SLIG (texture-focused similarity) | PDF, Web, XML |
| image_material_embeddings | 768D | SLIG (material-focused similarity) | PDF, Web, XML |
| image_style_embeddings | 768D | SLIG (style-focused similarity) | PDF, Web, XML |
| image_understanding_embeddings | 1024D | Voyage AI (from Qwen3-VL vision_analysis) | PDF, Web, XML |
All products are searchable via the same unified search service, regardless of source:
The UnifiedSearchService.searchMultiVector() method accepts a query string and workspace ID and returns results spanning all three source types (PDF, web, and XML) transparently.
The backend search function generates a query embedding, then runs the following in sequence or parallel:
chunks table (covers PDF, Web, and XML sources)vecs_service.search_similar_images() with workspace filtervecs_service.search_specialized_embeddings(embedding_type='color', ...)Additional specialized searches run for texture, material, and style in the same fashion. All results are fused and ranked before being returned as a unified result set containing products from all sources.
User Query: "blue ceramic tiles" ↓ Generate Query Embedding (Voyage AI 1024D) ↓ ┌──────────────────────────────────────┐ │ Multi-Vector Search (Parallel) │ ├──────────────────────────────────────┤ │ 1. Text Search (chunks table) │ │ → Searches: PDF + Web + XML │ │ │ │ 2. Visual Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 3. Color Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 4. Texture Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 5. Material Search (VECS) │ │ → Searches: PDF + Web + XML │ │ │ │ 6. Style Search (VECS) │ │ → Searches: PDF + Web + XML │ └──────────────────────────────────────┘ ↓ Combine & Rank Results ↓ Return Unified Results (PDF + Web + XML)
| Requirement | Web | XML | Evidence | |
|---|---|---|---|---|
| Products Created | ✅ | ✅ | ✅ | products table |
| Chunks Created | ✅ | ✅ | ✅ | chunks table |
| Text Embeddings | ✅ | ✅ | ✅ | chunks.text_embedding (Voyage 1024D) |
| Image Embeddings | ✅ | ✅ | ✅ | VECS collections (SLIG 768D x5 + Voyage understanding 1024D) |
| Requirement | Status | Evidence |
|---|---|---|
| Same Products Table | ✅ | All methods insert to products |
| Same Chunks Table | ✅ | All methods insert to chunks |
| Same VECS Collections | ✅ | All methods use same 6 collections |
| Same Embedding Models | ✅ | Voyage AI 1024D text + SLIG 768D visual (updated 2026-04) |
| Requirement | Status | Evidence |
|---|---|---|
| Text Search | ✅ | Searches chunks table (all sources) |
| Visual Search | ✅ | Searches VECS collections (all sources) |
| Multi-Vector Search | ✅ | Combines all search types |
| Cross-Source Results | ✅ | Returns products from PDF + Web + XML |
| Requirement | Status | Evidence |
|---|---|---|
| Fully Async | ✅ | All methods use async/await |
| Same Limits | ✅ | 5 HuggingFace/Qwen, 2 Claude, 10 uploads, 20 SLIG |
| Same Timeouts | ✅ | 300s discovery, 120s AI, 30s downloads |
| Same Services | ✅ | ImageProcessingService, RealEmbeddingsService, AsyncQueueService |
✅ All 3 methods generate products: PDF, Web Scraping, XML Import ✅ All use same AI models: Claude/GPT for discovery, Voyage AI for text, SLIG (SigLIP2 cloud) for images (updated 2026-04) ✅ All create chunks: Text chunks with embeddings ✅ All create embeddings: Text (Voyage 1024D) + Visual (SLIG 768D x5) + Understanding (Voyage 1024D) ✅ All use same storage: PostgreSQL tables + VECS collections ✅ All searchable: Via unified multi-vector search ✅ All fully async: Same concurrency limits and timeout guards
The architecture is unified, consistent, and production-ready! 🚀