MIVAA Platform uses 7 different AI models across 4 providers for different purposes:
| Provider | Models Used | Primary Purpose |
|---|---|---|
| Google (HuggingFace) | SigLIP2 ViT-SO400M (SLIG) | Visual embeddings (768D) - Cloud endpoint |
| Voyage AI | voyage-3.5 | Text embeddings (1024D) - Primary for semantic search |
| OpenAI | GPT-4o, GPT-5 | Chat, product discovery (OpenAI text embeddings retired 2026-04 — Voyage AI is the sole text embedder) |
| Anthropic | Claude Sonnet 4.5, Claude Haiku 4.5 | Vision analysis, validation, agents |
| Qwen (HuggingFace) | Qwen3-VL-32B-Instruct | Image analysis, OCR, material detection - Cloud endpoint |
┌─────────────────────────────────────────────────────────────────────────┐ │ PDF UPLOAD & PROCESSING │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 1: Product Discovery (BEFORE extraction) │ │ Model: Claude Sonnet 4.5 OR GPT-5 │ │ Purpose: Identify products, count pages, map image-to-product │ │ Input: PDF pages (images) │ │ Output: Product list with page ranges │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 2: Image Extraction & OCR Filtering │ │ Model: SLIG (SigLIP2 cloud endpoint, 768D) │ │ Purpose: Filter images - only OCR technical specs, skip lifestyle │ │ Input: Extracted images │ │ Output: Filtered images for OCR processing │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 3: Image Analysis (Primary) │ │ Model: Qwen3-VL-32B-Instruct (HuggingFace Endpoint) │ │ Endpoint: https://gbz6krk3i2is85b0.us-east-1.aws.endpoints.huggingface.cloud │ │ Service: mh-qwen332binstruct (namespace: basiliskan) │ │ Purpose: Detailed material analysis, color detection, texture │ │ Input: Product images │ │ Output: Material properties, colors, textures, quality scores │ │ Why: State-of-the-art vision-language model, superior OCR, cloud-based│ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 4: Image Analysis (Validation - Optional) │ │ Model: Claude 4.5 Sonnet Vision │ │ Purpose: Validate low-quality Qwen results, enrich metadata │ │ Input: Images with quality_score < 0.7 │ │ Output: Enhanced analysis, validation │ │ Why: Higher accuracy, better reasoning, used only when needed │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 5: Visual Embeddings (5 types) - 100% CLOUD │ │ Model: SLIG (SigLIP2 ViT-SO400M) - HuggingFace Endpoint │ │ Endpoint: https://xxxxxxxx.us-east-1.aws.endpoints.huggingface.cloud │ │ Service: mh-siglip2 (namespace: basiliskan) │ │ Purpose: Generate 5 specialized 768D embeddings per image │ │ Types: │ │ 1. Visual (general appearance) - image_embedding mode │ │ 2. Color (color palette) - text_embedding mode │ │ 3. Texture (surface patterns) - text_embedding mode │ │ 4. Style (design aesthetic) - text_embedding mode │ │ 5. Material (material type) - text_embedding mode │ │ Why: Cloud-based, auto-pause enabled, 0GB local RAM, superior quality │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 6: Text Embeddings (updated 2026-04) │ │ Model: Voyage AI voyage-3.5 (sole provider) │ │ Purpose: Generate 1024D embeddings for text chunks │ │ Input: Product descriptions, specifications, chunk text │ │ Output: 1024D text embeddings (dict key: text_1024) │ │ Input Types: "document" for indexing, "query" for search │ │ Why: Superior quality, optimized for retrieval, $0.06/1M tokens │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ STORAGE (Supabase) │ │ - Products table (metadata) │ │ - VECS Collections (5x 768D visual + 1x 1024D understanding per image)│ │ • image_slig_embeddings (768D — primary visual, SLIG) │ │ • image_color_embeddings (768D — text-guided color SLIG) │ │ • image_texture_embeddings (768D — text-guided texture SLIG) │ │ • image_material_embeddings (768D — text-guided material SLIG) │ │ • image_style_embeddings (768D — text-guided style SLIG) │ │ • image_understanding_embeddings (1024D — Voyage from Qwen3-VL) │ │ - Chunks table (1024D text embeddings - Voyage AI) │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ USER SEARCH & AGENT QUERIES │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ SEARCH: Direct Vector DB RAG (Claude 4.5 + Multi-Vector) │ │ Models: │ │ - Text Embeddings: Voyage AI voyage-3.5 (1024D) │ │ - Visual Embeddings: 5x SLIG specialized (768D each) │ │ • Visual, Color, Texture, Material, Style │ │ - LLM: Claude Sonnet 4.5 (200K context) │ │ Purpose: Multi-vector search + intelligent synthesis │ │ Why: Direct vector DB queries, no intermediate indexing layer │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ AGENTS: Mastra Framework (Agent Hub) │ │ Models Available: │ │ - Claude Sonnet 4.5 (default for agents) │ │ - Claude Haiku 4.5 (fast responses) │ │ - GPT-5 (advanced reasoning) │ │ - Qwen3-VL 17B (cost-effective) │ │ Purpose: Conversational AI, material search, recommendations │ │ Why: Mastra provides agent orchestration, tool calling │ └─────────────────────────────────────────────────────────────────────────┘
File: mivaa-pdf-extractor/app/services/embeddings/slig_client.py
Cloud-Only Architecture (HuggingFace Inference Endpoint): The SLIG client supports 4 modes — zero_shot, image_embedding, text_embedding, and similarity. For general visual embeddings it uses image_embedding mode to retrieve a 768D vector. For specialized embeddings (color, texture, material, style), it uses the similarity mode: it obtains the base image embedding, scores it against a text prompt, retrieves the text embedding, and blends the two with weighted averaging before normalizing to a unit vector.
Benefits:
Purpose:
Impact on Flow:
Cost: Free (Hugging Face) Speed: 150-400ms per image (SigLIP), 100-300ms (CLIP) Output: 512D numpy array → normalized → list
File: mivaa-pdf-extractor/app/services/real_embeddings_service.py
The service calls Voyage AI's embeddings endpoint with the model voyage-3.5. This produces a 1024D text embedding used for chunk indexing, product text, and semantic search. OpenAI text-embedding-3-small (1536D) was retired in 2026-04 — Voyage is now the sole text embedder, and the embedding dict key is text_1024 (previously text_1536).
Purpose:
Impact on Flow:
Cost: $0.06 per 1M tokens Speed: 100-300ms Output: 1024D list
File: mivaa-pdf-extractor/app/services/real_image_analysis_service.py
Requests are sent to the Together AI chat completions endpoint using the Qwen/Qwen3-VL-8B-Instruct model. The message structure requires the text prompt to appear before the image content block — this ordering is critical for Qwen models. The image is provided as a base64-encoded data URL.
Purpose:
Impact on Flow:
Why Qwen:
Cost: $0.18 per 1M input tokens, $0.18 per 1M output tokens Speed: 2-5 seconds per image Output: JSON with material properties, colors, quality scores
File: mivaa-pdf-extractor/app/services/real_image_analysis_service.py
Uses the Anthropic Python client to call claude-sonnet-4-5 with max_tokens=4096. The message includes an image block (base64 source, JPEG media type) followed by a text validation prompt.
Purpose:
Impact on Flow:
Why Claude:
Cost: $3.00 per 1M input tokens, $15.00 per 1M output tokens Speed: 3-8 seconds per image Output: JSON with enhanced analysis
Files:
mivaa-pdf-extractor/app/services/product_discovery_service.pymivaa-pdf-extractor/app/services/rag_service.py (Direct Vector DB)GPT-5 is used for product discovery via the OpenAI chat completions endpoint. GPT-4o is available for Direct Vector DB RAG as the synthesis LLM.
Purpose:
Impact on Flow:
Cost: GPT-4o: $2.50/$10.00 per 1M tokens, GPT-5: TBD Speed: 2-6 seconds Output: Text responses, JSON
File: src/components/AI/AgentHub.tsx
The frontend invokes the agent-chat Supabase Edge Function with model: 'anthropic/claude-haiku-4-20250514' for fast agent responses.
Purpose:
Impact on Flow:
Cost: $0.25 per 1M input tokens, $1.25 per 1M output tokens Speed: 1-3 seconds Output: Text responses
File: mivaa-pdf-extractor/app/services/rag_service.py
The RAG service uses 7 specialized embedding collections for multi-vector search (all halfvec in VECS): image_slig_embeddings (visual, 768D), image_color_embeddings (768D), image_texture_embeddings (768D), image_style_embeddings (768D), image_material_embeddings (768D), image_understanding_embeddings (1024D, Voyage from Qwen3-VL vision_analysis), plus text (Voyage AI 3.5 1024D). These are queried in parallel for maximum retrieval accuracy. Legacy 1152D SigLIP-SO400M and CLIP 512D collections were dropped 2026-04.
Purpose:
Impact on Flow:
Cost: ~$0.001 per query (Voyage AI) Speed: 300-500ms (parallel execution) Output: 6 different embedding types
Status: REMOVED - AI-powered image generation has been removed from the platform
Previously Used For:
Replacement: Manual 3D designer at /designer route using React Three Fiber
Cost: Varies by provider Speed: 5-30 seconds Output: Generated images
Product Discovery (Stage 1): Default model is Claude Sonnet 4.5, with GPT-5 as an alternative. OCR Filtering (Stage 2) uses SLIG (SigLIP2 cloud) zero-shot classification. Image Analysis (Stage 6) uses Qwen3-VL 17B Vision for all images, with Claude 4.5 Sonnet used for validation when quality_score < 0.7. Visual Embeddings (Stages 7-10) use SLIG (SigLIP2 via HuggingFace cloud endpoint, 768D). Text Embeddings (Stage 5) use Voyage AI voyage-3.5 (1024D) as the only model — OpenAI text-embedding-3-small was retired in 2026-04.
Direct Vector DB RAG (Claude 4.5) uses Voyage AI 3.5 (1024D) for text embeddings, 5x SLIG specialized 768D embeddings + 1x Voyage understanding 1024D embedding for multi-vector visual search, and Claude Sonnet 4.5 (200K context) for synthesis. The Agent Hub (Mastra) supports Claude Sonnet 4.5 as default, Claude Haiku 4.5 for fast responses, GPT-5 for advanced reasoning, and Qwen3-VL 17B as a cost-effective option.
| Model | Usage | Cost |
|---|---|---|
| Claude Sonnet 4.5 | Product discovery (1 call) | ~$0.05 |
| SigLIP | OCR filtering (50 images) | $0.00 (free) |
| Qwen3-VL | Image analysis (50 images) | ~$0.02 |
| Claude Sonnet 4.5 | Validation (10 low-quality) | ~$0.15 |
| SigLIP | Visual embeddings (250 total) | $0.00 (free) |
| OpenAI Embeddings | Text chunks (500 chunks) | ~$0.01 |
| TOTAL | Per PDF | ~$0.23 |
| Model | Usage | Cost |
|---|---|---|
| Voyage AI 3.5 | Query embedding | ~$0.001 |
| Multi-Vector CLIP | 6-way parallel search | $0.00 |
| Claude 4.5 | Answer synthesis | ~$0.02 |
| TOTAL | Per query | ~$0.01 |
| Metric | Before (CLIP) | After (SigLIP) | Improvement |
|---|---|---|---|
| Visual Search Accuracy | 70-75% | 89-94% | +19-29% |
| Embedding Generation | 100-300ms | 150-400ms | Acceptable |
| Model Size | 350MB | 1.5GB | Larger but worth it |
| Cost | $0.00 | $0.00 | Same (free) |
Last Updated: 2025-01-17 Status: ✅ Production Ready