Meta Field Aggregation System

Last Updated: 2025-12-03 Status: ✅ Complete

Overview

The Meta Field Aggregation System uses a 3-source redundancy strategy to ensure maximum metadata coverage. It collects and consolidates metadata (colors, textures, finishes, materials, applications) from:

Product Discovery (Stage 0) - Structured product pages
AI Extraction (DynamicMetadataExtractor) - Context-aware semantic understanding
Chunk Aggregation (NEW!) - Comprehensive keyword scanning across ALL chunks

This belt-and-suspenders approach ensures products have complete metadata even when information is scattered across the entire document.

🎯 Why 3 Sources?

The Problem:

Product information is often scattered across multiple pages:

Page 12: "NOVA available in White and Beige"
Page 15: "All NOVA tiles feature a matte finish"
Page 23: "NOVA suitable for indoor and outdoor use"
Page 31: "NOVA also available in Clay and Natural tones"

The Solution:

Use 3 complementary extraction methods to catch everything:

Source	What It Catches	Strengths	Weaknesses
Product Discovery	Structured product pages	✅ High accuracy, ✅ Structured format	❌ Misses scattered info, ❌ Limited to product pages
AI Extraction	Context-aware semantics	✅ Understands context, ✅ Extracts implied info	❌ Limited to page range, ❌ Expensive (AI calls)
Chunk Aggregation	Everything mentioned anywhere	✅ Comprehensive, ✅ No AI cost, ✅ Fills gaps	❌ No context, ❌ Keyword-based only

Result:

Without chunk aggregation (Sources 1 + 2 only), you might get colors: ["beige", "white"], missing clay and natural. With all 3 sources, the result is colors: ["beige", "clay", "natural", "white"] — complete.

Architecture

1. Quality Scoring Enhancement ✅

File: mivaa-pdf-extractor/app/services/rag_service.py (Direct Vector DB)

Features:

_is_meta_rich_chunk() method detects chunks with 2+ meta categories
Quality boost: +0.3 for meta-rich chunks
Minimum length score: 0.7 (no harsh penalty for short chunks)
Same treatment as dimension chunks

Meta Keywords Detected:

Colors: white, black, gray, beige, brown, blue, green, red, yellow, natural, clay, sand, taupe, ivory, cream, charcoal
Textures: smooth, rough, textured, polished, brushed, embossed, matte, glossy, satin, honed
Finishes: matte, glossy, satin, polished, honed, brushed, natural, unglazed, glazed, semi-gloss
Materials: ceramic, porcelain, stone, marble, granite, wood, metal, glass, concrete, terracotta, slate
Applications: indoor, outdoor, wall, floor, bathroom, kitchen, commercial, residential, waterproof, wet areas

2. Meta Field Aggregation ✅

File: mivaa-pdf-extractor/app/services/product_creation_service.py (lines 1732-1802)

Method: _aggregate_meta_fields_from_chunks(document_id, product_name)

Process:

Fetch all chunks for the document
Filter chunks that mention the product name
Extract meta keywords from each chunk
Deduplicate and sort values
Return structured dictionary

Example Output:

The method returns a dictionary with keys colors, textures, finishes, materials, and applications, each containing a sorted, deduplicated list of values found across all chunks mentioning the product.

3. Integration into Product Creation ✅

File: mivaa-pdf-extractor/app/services/product_creation_service.py (lines 1855-1875)

Integration Point: After dimension aggregation, before building product data

Merge Logic (Prevents Duplication):

No existing data → Use aggregated values from chunks
Existing string → Convert to list, merge with aggregated values (case-insensitive)
Existing list → Merge with aggregated values (case-insensitive deduplication)
Existing dict/other → Keep AI extraction as-is (takes priority)

Example Deduplication:

AI extraction provides ['White', 'Beige']. Chunk aggregation finds ['white', 'clay', 'natural']. The result after case-insensitive merge is ['beige', 'clay', 'natural', 'white'] — no duplicates, 'White' and 'white' merged correctly.

Comparison: Dimensions vs Meta Fields

Feature	Dimensions	Meta Fields
Quality Boost	✅ +0.3	✅ +0.3
Aggregation Method	✅ `_aggregate_dimensions_from_chunks()`	✅ `_aggregate_meta_fields_from_chunks()`
Storage Location	`metadata['available_sizes']`	`metadata['colors']`, `metadata['textures']`, etc.
Deduplication	✅ Yes	✅ Yes
Merge Logic	✅ Yes	✅ Yes

Result: Both dimensions and meta fields are now handled identically!

Example: NOVA Product

Chunks:

"NOVA available in white and beige"
"NOVA 15×38 cm ceramic tiles"
"Matte finish, waterproof for indoor use"

Aggregated Metadata:

The NOVA product would have: available_sizes with one entry (15×38 cm), colors: ["beige", "white"], materials: ["ceramic"], finishes: ["matte"], applications: ["indoor", "waterproof"].

Benefits

No Data Loss - All meta field mentions are captured
Comprehensive Metadata - Products have complete information
Consistent Architecture - Same pattern as dimensions
Better Search - More accurate multi-vector search results
Quality Preservation - Meta-rich chunks no longer rejected

Data Sources & Merge Priority

Three Sources of Metadata:

AI Extraction (DynamicMetadataExtractor) - Lines 1844-1847
- Uses Claude Sonnet 4.5 or GPT-4o
- Extracts from product-specific text
- Returns: enrichment_data['colors'], enrichment_data['materials'], etc.
- Format: Can be string, list, or dict with confidence scores
Chunk Aggregation - Lines 1855-1875
- Scans ALL chunks mentioning the product
- Keyword-based extraction (simple but comprehensive)
- Returns: meta_fields['colors'], meta_fields['materials'], etc.
- Format: Always a list of strings
Product Discovery - Earlier in pipeline
- Initial metadata from Stage 0 discovery
- Stored in product.metadata before enrichment
- Format: Various (depends on discovery method)

Merge Priority:

Product Discovery (highest priority) → AI Extraction (medium priority) → Chunk Aggregation (lowest priority, fills gaps)

Why This Order?

Product Discovery: Most accurate (from structured product pages)
AI Extraction: Context-aware (understands semantics)
Chunk Aggregation: Comprehensive (catches everything mentioned)

Deduplication Strategy:

Case-Insensitive Merge: When AI extraction provides ['White', 'Beige'] and chunk aggregation finds ['white', 'clay', 'natural', 'BEIGE'], the result is ['beige', 'clay', 'natural', 'white'] — fully deduplicated.

String to List Conversion: When AI extraction returns a single string value (e.g., 'Matte') and chunk aggregation finds a list, the string is converted to a list and merged.

AI Dict Format Preserved: When AI extraction returns a dict with confidence scores (e.g., {'value': 'White', 'confidence': 0.95}), that format is preserved and chunk aggregation values are not merged into it.

Testing

Test Case: Upload Harmony PDF and verify NOVA product has:

✅ All available sizes aggregated
✅ All colors mentioned across chunks
✅ All textures/finishes/materials/applications collected
✅ No duplicate values (case-insensitive)
✅ Sorted lists for easy reading
✅ AI extraction values preserved when present

Expected result for NOVA: colors: ["beige", "clay", "natural", "white"], materials: ["ceramic", "porcelain"], finishes: ["glazed", "matte"], applications: ["floor", "indoor", "wall"], plus available_sizes with 15×38 and 20×40 cm entries.

Next Steps

Test with Harmony PDF
Verify aggregation works correctly
Check that search results include aggregated meta fields
Validate frontend displays all metadata properly
Monitor for duplicate values in production