Meta Field Aggregation System

Last Updated: 2025-12-03 Status: ✅ Complete


Overview

The Meta Field Aggregation System uses a 3-source redundancy strategy to ensure maximum metadata coverage. It collects and consolidates metadata (colors, textures, finishes, materials, applications) from:

  1. Product Discovery (Stage 0) - Structured product pages
  2. AI Extraction (DynamicMetadataExtractor) - Context-aware semantic understanding
  3. Chunk Aggregation (NEW!) - Comprehensive keyword scanning across ALL chunks

This belt-and-suspenders approach ensures products have complete metadata even when information is scattered across the entire document.


🎯 Why 3 Sources?

The Problem:

Product information is often scattered across multiple pages:

The Solution:

Use 3 complementary extraction methods to catch everything:

Source What It Catches Strengths Weaknesses
Product Discovery Structured product pages ✅ High accuracy, ✅ Structured format ❌ Misses scattered info, ❌ Limited to product pages
AI Extraction Context-aware semantics ✅ Understands context, ✅ Extracts implied info ❌ Limited to page range, ❌ Expensive (AI calls)
Chunk Aggregation Everything mentioned anywhere ✅ Comprehensive, ✅ No AI cost, ✅ Fills gaps ❌ No context, ❌ Keyword-based only

Result:

Without chunk aggregation (Sources 1 + 2 only), you might get colors: ["beige", "white"], missing clay and natural. With all 3 sources, the result is colors: ["beige", "clay", "natural", "white"] — complete.


Architecture

1. Quality Scoring Enhancement ✅

File: mivaa-pdf-extractor/app/services/rag_service.py (Direct Vector DB)

Features:

Meta Keywords Detected:


2. Meta Field Aggregation ✅

File: mivaa-pdf-extractor/app/services/product_creation_service.py (lines 1732-1802)

Method: _aggregate_meta_fields_from_chunks(document_id, product_name)

Process:

  1. Fetch all chunks for the document
  2. Filter chunks that mention the product name
  3. Extract meta keywords from each chunk
  4. Deduplicate and sort values
  5. Return structured dictionary

Example Output:

The method returns a dictionary with keys colors, textures, finishes, materials, and applications, each containing a sorted, deduplicated list of values found across all chunks mentioning the product.


3. Integration into Product Creation ✅

File: mivaa-pdf-extractor/app/services/product_creation_service.py (lines 1855-1875)

Integration Point: After dimension aggregation, before building product data

Merge Logic (Prevents Duplication):

  1. No existing data → Use aggregated values from chunks
  2. Existing string → Convert to list, merge with aggregated values (case-insensitive)
  3. Existing list → Merge with aggregated values (case-insensitive deduplication)
  4. Existing dict/other → Keep AI extraction as-is (takes priority)

Example Deduplication:

AI extraction provides ['White', 'Beige']. Chunk aggregation finds ['white', 'clay', 'natural']. The result after case-insensitive merge is ['beige', 'clay', 'natural', 'white'] — no duplicates, 'White' and 'white' merged correctly.


Comparison: Dimensions vs Meta Fields

Feature Dimensions Meta Fields
Quality Boost ✅ +0.3 ✅ +0.3
Aggregation Method _aggregate_dimensions_from_chunks() _aggregate_meta_fields_from_chunks()
Storage Location metadata['available_sizes'] metadata['colors'], metadata['textures'], etc.
Deduplication ✅ Yes ✅ Yes
Merge Logic ✅ Yes ✅ Yes

Result: Both dimensions and meta fields are now handled identically!


Example: NOVA Product

Chunks:

  1. "NOVA available in white and beige"
  2. "NOVA 15×38 cm ceramic tiles"
  3. "Matte finish, waterproof for indoor use"

Aggregated Metadata:

The NOVA product would have: available_sizes with one entry (15×38 cm), colors: ["beige", "white"], materials: ["ceramic"], finishes: ["matte"], applications: ["indoor", "waterproof"].


Benefits

  1. No Data Loss - All meta field mentions are captured
  2. Comprehensive Metadata - Products have complete information
  3. Consistent Architecture - Same pattern as dimensions
  4. Better Search - More accurate multi-vector search results
  5. Quality Preservation - Meta-rich chunks no longer rejected

Data Sources & Merge Priority

Three Sources of Metadata:

  1. AI Extraction (DynamicMetadataExtractor) - Lines 1844-1847

    • Uses Claude Sonnet 4.5 or GPT-4o
    • Extracts from product-specific text
    • Returns: enrichment_data['colors'], enrichment_data['materials'], etc.
    • Format: Can be string, list, or dict with confidence scores
  2. Chunk Aggregation - Lines 1855-1875

    • Scans ALL chunks mentioning the product
    • Keyword-based extraction (simple but comprehensive)
    • Returns: meta_fields['colors'], meta_fields['materials'], etc.
    • Format: Always a list of strings
  3. Product Discovery - Earlier in pipeline

    • Initial metadata from Stage 0 discovery
    • Stored in product.metadata before enrichment
    • Format: Various (depends on discovery method)

Merge Priority:

Product Discovery (highest priority) → AI Extraction (medium priority) → Chunk Aggregation (lowest priority, fills gaps)

Why This Order?

Deduplication Strategy:

Case-Insensitive Merge: When AI extraction provides ['White', 'Beige'] and chunk aggregation finds ['white', 'clay', 'natural', 'BEIGE'], the result is ['beige', 'clay', 'natural', 'white'] — fully deduplicated.

String to List Conversion: When AI extraction returns a single string value (e.g., 'Matte') and chunk aggregation finds a list, the string is converted to a list and merged.

AI Dict Format Preserved: When AI extraction returns a dict with confidence scores (e.g., {'value': 'White', 'confidence': 0.95}), that format is preserved and chunk aggregation values are not merged into it.


Testing

Test Case: Upload Harmony PDF and verify NOVA product has:

Expected result for NOVA: colors: ["beige", "clay", "natural", "white"], materials: ["ceramic", "porcelain"], finishes: ["glazed", "matte"], applications: ["floor", "indoor", "wall"], plus available_sizes with 15×38 and 20×40 cm entries.


Next Steps

  1. Test with Harmony PDF
  2. Verify aggregation works correctly
  3. Check that search results include aggregated meta fields
  4. Validate frontend displays all metadata properly
  5. Monitor for duplicate values in production