Last Updated: 2025-12-03 Status: ✅ Complete
The Meta Field Aggregation System uses a 3-source redundancy strategy to ensure maximum metadata coverage. It collects and consolidates metadata (colors, textures, finishes, materials, applications) from:
This belt-and-suspenders approach ensures products have complete metadata even when information is scattered across the entire document.
Product information is often scattered across multiple pages:
Use 3 complementary extraction methods to catch everything:
| Source | What It Catches | Strengths | Weaknesses |
|---|---|---|---|
| Product Discovery | Structured product pages | ✅ High accuracy, ✅ Structured format | ❌ Misses scattered info, ❌ Limited to product pages |
| AI Extraction | Context-aware semantics | ✅ Understands context, ✅ Extracts implied info | ❌ Limited to page range, ❌ Expensive (AI calls) |
| Chunk Aggregation | Everything mentioned anywhere | ✅ Comprehensive, ✅ No AI cost, ✅ Fills gaps | ❌ No context, ❌ Keyword-based only |
Without chunk aggregation (Sources 1 + 2 only), you might get colors: ["beige", "white"], missing clay and natural. With all 3 sources, the result is colors: ["beige", "clay", "natural", "white"] — complete.
File: mivaa-pdf-extractor/app/services/rag_service.py (Direct Vector DB)
Features:
_is_meta_rich_chunk() method detects chunks with 2+ meta categoriesMeta Keywords Detected:
File: mivaa-pdf-extractor/app/services/product_creation_service.py (lines 1732-1802)
Method: _aggregate_meta_fields_from_chunks(document_id, product_name)
Process:
Example Output:
The method returns a dictionary with keys colors, textures, finishes, materials, and applications, each containing a sorted, deduplicated list of values found across all chunks mentioning the product.
File: mivaa-pdf-extractor/app/services/product_creation_service.py (lines 1855-1875)
Integration Point: After dimension aggregation, before building product data
Merge Logic (Prevents Duplication):
Example Deduplication:
AI extraction provides ['White', 'Beige']. Chunk aggregation finds ['white', 'clay', 'natural']. The result after case-insensitive merge is ['beige', 'clay', 'natural', 'white'] — no duplicates, 'White' and 'white' merged correctly.
| Feature | Dimensions | Meta Fields |
|---|---|---|
| Quality Boost | ✅ +0.3 | ✅ +0.3 |
| Aggregation Method | ✅ _aggregate_dimensions_from_chunks() |
✅ _aggregate_meta_fields_from_chunks() |
| Storage Location | metadata['available_sizes'] |
metadata['colors'], metadata['textures'], etc. |
| Deduplication | ✅ Yes | ✅ Yes |
| Merge Logic | ✅ Yes | ✅ Yes |
Result: Both dimensions and meta fields are now handled identically!
Chunks:
Aggregated Metadata:
The NOVA product would have: available_sizes with one entry (15×38 cm), colors: ["beige", "white"], materials: ["ceramic"], finishes: ["matte"], applications: ["indoor", "waterproof"].
AI Extraction (DynamicMetadataExtractor) - Lines 1844-1847
enrichment_data['colors'], enrichment_data['materials'], etc.Chunk Aggregation - Lines 1855-1875
meta_fields['colors'], meta_fields['materials'], etc.Product Discovery - Earlier in pipeline
product.metadata before enrichmentProduct Discovery (highest priority) → AI Extraction (medium priority) → Chunk Aggregation (lowest priority, fills gaps)
Why This Order?
Case-Insensitive Merge: When AI extraction provides ['White', 'Beige'] and chunk aggregation finds ['white', 'clay', 'natural', 'BEIGE'], the result is ['beige', 'clay', 'natural', 'white'] — fully deduplicated.
String to List Conversion: When AI extraction returns a single string value (e.g., 'Matte') and chunk aggregation finds a list, the string is converted to a list and merged.
AI Dict Format Preserved: When AI extraction returns a dict with confidence scores (e.g., {'value': 'White', 'confidence': 0.95}), that format is preserved and chunk aggregation values are not merged into it.
Test Case: Upload Harmony PDF and verify NOVA product has:
Expected result for NOVA: colors: ["beige", "clay", "natural", "white"], materials: ["ceramic", "porcelain"], finishes: ["glazed", "matte"], applications: ["floor", "indoor", "wall"], plus available_sizes with 15×38 and 20×40 cm entries.