The Metadata Prototype Validation System is a semantic validation system that standardizes AI-extracted material metadata using CLIP text embeddings. It eliminates inconsistent naming, enables fuzzy search, validates AI outputs, and improves search accuracy by comparing extracted values to prototype embeddings and standardizing them to canonical values.
material_properties table as single source of truthThe Metadata Prototype Validation System enhances MIVAA's existing dynamic metadata extraction by adding semantic validation using CLIP text embeddings. This ensures that Qwen Vision's free-text metadata extractions are standardized to consistent, validated property values.
DynamicMetadataExtractor discovers metadata in 3 tiers:
Critical Metadata (Always extracted):
material_category, factory_name, factory_group_nameproducts.metadata JSONBDiscovered Metadata (200+ dynamic fields):
material_properties: composition, texture, finish, pattern, weight, densitydimensions: length, width, height, thickness, diameter, sizeappearance: color, color_code, gloss_level, sheen, transparencyperformance: water_resistance, fire_rating, slip_resistance, wear_ratingapplication: recommended_use, installation_method, room_type, traffic_levelcompliance: certifications, standards, eco_friendly, sustainability_ratingdesign: designer, studio, collection, series, aesthetic_stylemanufacturing: factory, manufacturer, country_of_origincommercial: pricing, availability, supplier, sku, warrantyproducts.metadata JSONBUnknown Metadata (Custom fields):
products.metadata with _custom_ prefix_custom_installation_time, _custom_warranty_years, _custom_special_coatingStorage Example: Products have a metadata JSONB field containing critical fields (e.g., material_category: "ceramic_tile", factory_name: "CastellΓ³n Ceramics"), discovered fields (e.g., finish: "shiny" β inconsistent, should be "glossy"), and custom fields with _custom_ prefix. This illustrates the problem the validation system solves.
MetadataPrototypeValidator adds a validation layer WITHOUT changing storage:
Check if property has prototypes: Query the material_properties table. If the property has prototype_descriptions, validate the extracted value against them. Otherwise, store as-is (custom metadata).
Validate against prototypes (if they exist): Generate a CLIP embedding for the extracted value, compare it to the prototype embedding using cosine similarity. If similarity exceeds 0.80, return the standardized prototype value with validated: True and the confidence score. Otherwise, keep the original with validated: False.
Track validation metadata: The metadata._validation dictionary stores per-property validation details including original_value, validated_value, confidence, prototype_matched, and timestamp.
Key Benefits:
Question: How do we identify when new prototype values should be added?
Example Scenario:
finish: glossy, matte, satin, textured, brushedTrack all extracted values for each property using a metadata_value_frequency table with columns: property_key, extracted_value, frequency_count, first_seen_at, last_seen_at, workspace_ids (UUID array), product_ids (UUID array), and validation_status (unvalidated/validated/rejected). A unique constraint on (property_key, extracted_value) prevents duplicates. After each metadata extraction, an upsert increments the frequency count and appends the workspace and product IDs.
Admin Panel (/admin/metadata-prototypes) shows:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Metadata Prototype Management β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Property: finish β β Current Prototypes: 9 values (glossy, matte, satin, ...) β β β β Suggested Additions (frequency > 10): β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β "brushed metal" (23 occurrences) β β β β Similarity to existing: brushed (0.78) β β β β Products: HAR-001, HAR-002, ... β β β β [Add as New] [Merge with "brushed"] [Ignore] β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β "semi-gloss" (18 occurrences) β β β β Similarity to existing: satin (0.87), glossy (0.72) β β β β Products: CER-045, CER-046, ... β β β β [Add as New] [Merge with "satin"] [Ignore] β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The API endpoint GET /api/admin/metadata-prototypes/suggestions queries unvalidated values with frequency >= a threshold, calculates CLIP embedding similarity against existing prototypes for each suggestion, and returns enriched suggestions ordered by frequency.
Action 1: Add as New Prototype β POST /api/admin/metadata-prototypes/add updates the prototype_descriptions JSONB on the material_properties row, regenerates the CLIP embedding for the property, marks the value as validated in the frequency table, and queues a re-validation job for all affected products.
Action 2: Merge with Existing β POST /api/admin/metadata-prototypes/merge adds the extracted value as a variation of an existing prototype, regenerates the embedding, updates all products that had the extracted value to use the target prototype, and marks the value as validated.
Action 3: Ignore β POST /api/admin/metadata-prototypes/ignore marks the suggestion as rejected so it won't appear in the dashboard again.
Answer: Implemented comprehensive search query tracking system.
Architecture:
A search_query_tracking table records every search with: workspace_id, query_text, query_metadata (JSONB, e.g., {"finish": "shiny"}), search_type, result_count, zero_results flag, searched_terms array, matched_terms array (those that validated), unmatched_terms array (those that didn't), validation_results JSONB, and response_time_ms.
An unmatched_term_frequency table aggregates patterns with: term, property_key, frequency_count, workspace_ids, similar_prototypes JSONB (e.g., [{"prototype": "glossy", "similarity": 0.78}]), and review_status (pending/approved/rejected).
How It Works: The multi_vector_search function automatically tracks each query asynchronously. The tracker validates each filter term against prototypes, identifies unmatched terms, updates frequency counts, and flags zero-result queries.
Admin Dashboard (/admin/prototype-suggestions):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Unmatched Terms Requiring Review β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Property: finish β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β "shiny" (47 searches, 0 results) β β β β Similar to: glossy (0.92), polished (0.85) β β β β Workspaces: 12 different workspaces β β β β [Add as "glossy" variation] [Create new prototype] β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β "semi-gloss" (23 searches, 0 results) β β β β Similar to: satin (0.87), glossy (0.72) β β β β [Add as "satin" variation] [Create new prototype] β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Benefits:
Answer: YES - Enabled by default with automatic scoring boost.
Implementation: In rag_service.py, the multi_vector_search function automatically applies validation scoring when material_filters and results are both present. It loads the MetadataPrototypeValidator, calculates a metadata boost (up to +20% of the original score) for each result based on how well the product's validated metadata matches the query filters, and re-sorts results by their enhanced scores.
Scoring Formula: For each query filter field and value, the system checks whether the product has validation metadata. If both are validated and match the same prototype, the field scores 1.0. If they're different prototypes, cosine similarity determines a partial score (if > 0.70). Unvalidated product values receive an exact-match score of 0.8 or a fuzzy match with 0.8Γ penalty. Unvalidated properties use exact match only. The per-field scores are averaged and multiplied by 0.2 to produce the final boost factor.
Example:
Query: {"finish": "shiny", "slip_resistance": "R-11"}
Product A (validated): finish: "glossy" (validated from "shiny", confidence 0.92), slip_resistance: "R11" (validated from "R-11", confidence 1.0). Metadata boost: (0.92 + 1.0) / 2 = 0.96. Final score: 0.85 Γ 1.192 = 1.013 β
Product B (unvalidated): finish: "shiny surface", slip_resistance: "R-11". Metadata boost: 0.0. Final score: 0.85 Γ 1.0 = 0.85 β
Result: Product A ranks 19% higher!
Configuration:
Answer: Fully integrated - new properties automatically get validation support.
Flow:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 1. AI Discovers New Metadata Field β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β DynamicMetadataExtractor finds: β β "installation_time": "2 hours" β β β β This field is NOT in predefined categories β β β Classified as "unknown" metadata β β β Stored with custom prefix β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 2. Auto-Create material_properties Entry β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β _ensure_properties_exist() automatically creates: β β β β INSERT INTO material_properties ( β β property_key: "_custom_installation_time", β β name: "Installation Time", β β data_type: "string", β β is_searchable: true, β β is_filterable: true, β β is_ai_extractable: true, β β category: "custom", β β prototype_descriptions: NULL, β No prototypes yet β β text_embedding_512: NULL β β ) β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 3. Property Stored Without Validation (Initially) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β products.metadata = { β β "_custom_installation_time": "2 hours", β β "_validation": {} β No validation (no prototypes) β β } β β β β Search behavior: β β - Exact match only (no semantic matching) β β - No validation boost β β - Still searchable and filterable β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 4. Frequency Tracking Identifies Pattern β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β After 50 products extracted with this field: β β β β unmatched_term_frequency: β β term: "2 hours" (frequency: 23) β β term: "1 hour" (frequency: 15) β β term: "3-4 hours" (frequency: 12) β β property_key: "_custom_installation_time" β β β β Admin sees suggestion: β β "Add prototypes for _custom_installation_time?" β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 5. Admin Adds Prototypes β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Admin creates prototypes: β β "quick": ["1 hour", "fast", "quick install"] β β "standard": ["2 hours", "2-3 hours", "normal"] β β "extended": ["3-4 hours", "4+ hours", "complex"] β β β β System generates 512D CLIP embeddings β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 6. Future Extractions Get Validated β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Next PDF extraction: β β AI extracts: "fast installation" β β Validator matches: "quick" (confidence 0.89) β β β β products.metadata = { β β "_custom_installation_time": "quick", β Standardized! β β "_validation": { β β "_custom_installation_time": { β β "original_value": "fast installation", β β "validated_value": "quick", β β "confidence": 0.89, β β "prototype_matched": true β β } β β } β β } β β β β Search now supports: β β - Semantic matching ("fast" finds "quick") β β - Validation boost in scoring β β - Consistent terminology β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Points:
Automatic Property Creation: Every discovered field gets a material_properties entry. No manual intervention needed. Enables future prototype addition.
Graceful Degradation: Properties without prototypes still work (exact match only). No validation errors or failures. System remains functional.
Progressive Enhancement: Start with no prototypes (exact match), add prototypes when patterns emerge, automatically upgrade to semantic matching.
Data-Driven Workflow: Frequency tracking identifies candidates, admin reviews and approves, system learns from real usage.
Example Timeline:
Day 1: AI discovers "_custom_warranty_years" β Auto-created in material_properties (no prototypes) β Stored as-is: "10 years", "5 years", "lifetime"
Day 30: Frequency analysis shows: β "10 years" (45 occurrences) β "5 years" (32 occurrences) β "lifetime" (18 occurrences)
Day 31: Admin adds prototypes: β "standard": ["5 years", "5-year", "five years"] β "extended": ["10 years", "10-year", "decade"] β "lifetime": ["lifetime", "permanent", "forever"]
Day 32: New extractions get validated: β "5-year warranty" β "standard" (confidence 0.95) β "decade coverage" β "extended" (confidence 0.88)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β EXISTING: PDF Processing Pipeline (UNCHANGED) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Stage 0A: Product Discovery β β βββ Discover products with basic metadata β β β β Stage 0B: Metadata Extraction (DynamicMetadataExtractor) β β βββ Extract 200+ fields across 9 categories β β Returns critical metadata and discovered metadata β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β NEW: Prototype Validation Layer (ADDED) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Stage 0C: Metadata Validation (MetadataPrototypeValidator) β β βββ Validate extracted metadata against prototypes β β Input: {"finish": "glossy", "slip_resistance": "R11"} β β Process: β β 1. Generate CLIP embedding for "glossy" β β 2. Compare to finish prototypes (glossy, matte, satin) β β 3. Return best match with confidence β β Output: validated value, validated flag, confidence score β β β β Stage 1-8: Continue as normal (UNCHANGED) β β βββ Image extraction, embeddings, chunking, etc. β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Existing functionality preserved:
New validation layer added:
The current scoring formula weights: 40% text similarity, 30% visual similarity, 20% metadata match (exact), 10% confidence score.
Problem: Metadata matching is binary (exact match or no match)
For each query filter field, the system checks whether the property has prototype validation. If so, and the product's value is prototype-matched, it scores 1.0 for an exact prototype match or a cosine similarity score (if > 0.70) for a different prototype. Unvalidated product values receive fuzzy match scores with a penalty. Properties without prototypes use exact match only. Scores are averaged across all query fields to produce metadata_prototype_match.
The new scoring formula weights: 30% text similarity, 30% visual similarity, 20% metadata_prototype_match (NEW), 10% metadata_match (reduced, exact fallback), 10% confidence score.
Benefits:
Query: {"finish": "shiny", "slip_resistance": "R-11"}
Product A (validated metadata): finish: "glossy" validated from "shiny" at confidence 0.92, slip_resistance: "R11" validated from "R-11" at confidence 1.0. Metadata Score: (0.92 + 1.0) / 2 = 0.96 β
High score
Product B (unvalidated metadata): finish: "shiny surface" not validated, slip_resistance: "R-11" not validated. Metadata Score: approximately 0.28 β Low score
Result: Product A ranks significantly higher despite not having exact text match!
1. Multi-Vector Search (rag_service.py): Results are scored using _score_with_metadata_validation with enable_prototype_matching=True (enabled by default).
2. Material Visual Search (material_visual_search_service.py, line 411): The category filter value is validated against prototypes before being used to filter products.
3. Semantic Search (rag_service.py): Each result's score is multiplied by (1.0 + validation_boost * 0.2) where validation_boost is calculated from the product's validation metadata.
Existing Columns (UNCHANGED): id, property_key (unique), name, display_name, description, data_type (string/number/enum/boolean), validation_rules (JSONB), is_searchable, is_filterable, is_ai_extractable.
NEW Columns (ADDED):
prototype_descriptions β JSONB mapping value names to arrays of 3-5 prototype descriptions. For example, the "finish" property maps "glossy" to descriptions like "High gloss reflective surface, Polished shiny appearance, Mirror-like finish".text_embedding_512 β VECTOR(512) containing the averaged CLIP embedding for all prototypes of that propertyprototype_updated_at β TIMESTAMP of last updateA vector index using ivfflat with vector_cosine_ops is created on text_embedding_512 for fast similarity search.
1. Qwen Extracts Free Text: The vision model produces structured property values such as finish: "glossy" and pattern: "marble-like veining".
2. Prototype Validator Processes Each Field: The MetadataPrototypeValidator.validate_property(property_key, extracted_value) method is called for each field. It returns the best-matching prototype value, a validated boolean, a confidence score, and the similarity scores against all prototypes.
3. Validation Algorithm: A CLIP embedding is generated for the extracted value. The property's prototypes are retrieved from the database. Cosine similarity is computed between the extracted embedding and each prototype embedding. The best match is selected; if its similarity exceeds 0.80, the prototype value is returned as validated, otherwise the original value is kept.
4. Store Validated Metadata: The product metadata is saved with both validated values and a _validation_metadata dictionary tracking which properties were validated and at what confidence.
1. Exact Metadata Filtering (EXISTING - UNCHANGED): Search products by exact metadata values using POST /api/search/products with a filters object (e.g., {"metadata.finish": "glossy", "metadata.slip_resistance": "R11"}).
2. Semantic Metadata Search (NEW - ADDED): Search using natural language via POST /api/search/products/semantic with use_metadata_prototypes: true. The system generates CLIP embeddings for query terms (e.g., "shiny" β matches "glossy" prototype at 0.89), then boosts products whose validated metadata matches the inferred prototypes.
3. Metadata Similarity Scoring (NEW - ADDED): The combined score formula now includes a 20% weight for metadata_prototype_match alongside text similarity (40%), visual similarity (30%), and confidence score (10%).
Better Fuzzy Matching: "shiny" β "glossy", "non-slip" β "R11" Standardized Filters: All variations map to same validated value Confidence Boosting: High-confidence validated metadata ranks higher Semantic Understanding: Natural language queries match technical terms
material_type (Primary classification) has prototypes for: "ceramic" (glazed surfaces, interior applications), "marble" (natural stone with veining, polished high gloss), "porcelain" (high density, low water absorption, vitrified), "wood" (natural grain, hardwood flooring, organic texture), and "granite" (speckled appearance, crystalline structure, exceptional hardness). Each prototype value has 3 descriptive sentences for averaging.
finish (Surface treatment) has prototypes for: "glossy" (high gloss reflective surface, mirror-like quality, brilliant shine), "matte" (non-reflective flat surface, no shine, ideal for hiding imperfections), and "satin" (semi-gloss with subtle sheen, between matte and glossy, soft luster).
slip_resistance (Safety rating) has prototypes for R9 (low, dry interior areas), R10 (medium, wet areas/bathrooms), R11 (high, commercial wet areas), and R12 (very high, industrial/outdoor). Each has 3 descriptive sentences.
Three new columns are added to material_properties: prototype_descriptions (JSONB, default {}), text_embedding_512 (VECTOR(512)), and prototype_updated_at (TIMESTAMP). A vector index using ivfflat with vector_cosine_ops is created on text_embedding_512.
The material_properties table is populated with 50+ meta fields organized into categories: Material Properties, Dimensions, Appearance, Performance, Application, Compliance, Design, Manufacturing, and Commercial.
A PROPERTY_PROTOTYPES dictionary maps property keys to value names, each with 3-5 descriptive sentences. For example, material_type β ceramic β list of 3 descriptions. This dictionary is used to populate the database.
For each property/value combination, embeddings are generated for all descriptions and then averaged to produce a single representative embedding that is stored in the database.
The MetadataPrototypeValidator class provides three main methods: validate_property(property_key, extracted_value) for single-field validation, validate_metadata(metadata) for batch validation, and get_property_prototypes(property_key) for inspection. The validator generates CLIP embeddings for extracted values, compares them against all prototype embeddings for the property, returns the best match with confidence, uses a 0.80 confidence threshold, and falls back to the original value if confidence is below threshold.
After metadata extraction, the MetadataPrototypeValidator is called to validate the extracted values. The validated results are merged with validation tracking metadata before being stored in the database.
The system integrates prototype validation into all search endpoints:
/api/rag/search?strategy=multi_vector (PRIMARY - enabled by default)/api/rag/search?strategy=material/api/rag/search?strategy=all/api/search/multimodal/api/search/material-visualIn multi-vector search, filter values are validated against prototypes before building SQL conditions, and similar values (similarity > 0.7) are also included in the match. The combined score is recalculated to include a 10% weight for the metadata validation match score.
material_properties table for everythingPopulate Property Prototypes β POST /api/metadata/properties/populate-prototypes with optional property_key (specific property) and regenerate (boolean) fields.
Validate Metadata β POST /api/metadata/validate with a metadata object containing property key-value pairs. Returns a validated_metadata object where each field has value, validated (boolean), and confidence.
Get Property Prototypes β GET /api/metadata/properties/{property_key}/prototypes returns the property's prototypes dictionary mapping value names to their description arrays.