πŸ“š Comprehensive Metadata Fields Guide

πŸ“‹ Overview

The MIVAA platform extracts 200+ metadata fields from PDF catalogs using AI-powered dynamic discovery. All metadata is organized into 9 functional categories and stored in the products.metadata JSONB field in the database.


🎯 Metadata Extraction Architecture

βš™οΈ How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Stage 0: Product Discovery & Metadata Extraction β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ 0A: Product Discovery (Claude/GPT) β”‚ β”‚ β”œβ”€β”€ Identify product names β”‚ β”‚ β”œβ”€β”€ Extract page ranges β”‚ β”‚ β”œβ”€β”€ Extract basic metadata (designer, dimensions) β”‚ β”‚ └── Classify content by category β”‚ β”‚ β”‚ β”‚ 0B: Metadata Enrichment (DynamicMetadataExtractor) β”‚ β”‚ β”œβ”€β”€ For each discovered product: β”‚ β”‚ β”‚ β”œβ”€β”€ Extract product-specific text from PDF β”‚ β”‚ β”‚ β”œβ”€β”€ Call DynamicMetadataExtractor (Claude/GPT) β”‚ β”‚ β”‚ β”œβ”€β”€ Extract 200+ fields across 9 categories β”‚ β”‚ β”‚ └── Merge with discovery metadata β”‚ β”‚ β”‚ β”‚ β”‚ └── Store enriched products in database β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ€– AI Models Used

πŸ” Metadata Priority

When merging metadata from multiple sources, the system uses this priority:

  1. Discovery Metadata (Highest Priority)

    • Extracted during product discovery (Stage 0A)
    • Includes: product name, designer, dimensions, variants
  2. Critical Metadata (High Priority)

    • Always extracted: material_category, factory_name, factory_group_name
    • Required for product classification
  3. Discovered Metadata (Standard Priority)

    • 200+ dynamic fields extracted by DynamicMetadataExtractor
    • Organized into 9 functional categories

πŸ“¦ The 9 Metadata Categories

🧱 1. Material Properties

Purpose: Physical and structural characteristics of the material

Fields (11 total):


πŸ“ 2. Dimensions

Purpose: Physical measurements and sizing information

Fields (8 total):


🎨 3. Appearance

Purpose: Visual and aesthetic characteristics

Fields (7 total):


βœ… 6. Compliance & Certifications

Purpose: Regulatory compliance and environmental certifications

Fields (6 total):


🎨 7. Design

Purpose: Design attribution and aesthetic classification

Fields (6 total):


🏭 8. Manufacturing

Purpose: Production and sourcing information

Fields (6 total):


πŸ’° 9. Commercial

Purpose: Business and commercial information

Fields (5 total):


πŸ”§ Technical Implementation

πŸ—„οΈ Database Schema

All metadata is stored in the products table in the metadata JSONB field. The products table has columns: id (UUID), sku, name, description, category, type, status, metadata (JSONB β€” all 200+ metadata fields), properties (JSONB), specifications (JSONB), created_at, and updated_at.

πŸ“ Example Product Metadata

A complete product record has a metadata JSONB field containing fields from all 9 categories: material properties (material_type, composition, texture, finish, pattern, weight, density), dimensions (size, thickness, area), appearance (color, color_code, gloss_level, grain), performance (water_absorption, fire_rating, slip_resistance, wear_rating, breaking_strength), application (recommended_use, installation_method, room_type, traffic_level), compliance (certifications, standards, eco_friendly, voc_rating), design (designer, collection, aesthetic_style), manufacturing (factory, factory_group, country_of_origin), commercial (pricing, availability, warranty), and _extraction_metadata (extraction_timestamp, extraction_method, model_used, confidence_score, validation_passed).


πŸš€ API Usage

πŸ“€ Extract Metadata from PDF

Endpoint: POST /api/rag/process-pdf

Upload a PDF file with extract_categories parameter. The response contains a job_id, status, message, products_discovered count, and metadata_extraction status.

πŸ“₯ Get Product with Metadata

Endpoint: GET /api/products/{product_id}

Returns the product record with its complete metadata object containing all extracted fields.

πŸ” Search Products by Metadata

Endpoint: POST /api/search/products

Send a filters object with dot-notation keys like "metadata.slip_resistance": "R11", "metadata.fire_rating": "A1", or "metadata.country_of_origin": "Spain" to filter products by their metadata values.


πŸ“Š Frontend Display

ProductDetailModal Component

The frontend displays metadata organized by category in the ProductDetailModal component:

Location: src/components/AI/ProductDetailModal.tsx

Features:

Example UI: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ NOVA - Product Details β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ [Product Image] β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Material Properties β”‚ β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ Material Type: ceramic β”‚ β”‚ β”‚ β”‚ Texture: smooth β”‚ β”‚ β”‚ β”‚ Finish: matte β”‚ β”‚ β”‚ β”‚ Weight: 800 kg/mΒ³ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Dimensions β”‚ β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ Size: 15Γ—38 cm β”‚ β”‚ β”‚ β”‚ Thickness: 8mm β”‚ β”‚ β”‚ β”‚ Area: 0.057 mΒ² β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Performance β”‚ β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ Slip Resistance: R11 β”‚ β”‚ β”‚ β”‚ Fire Rating: A1 β”‚ β”‚ β”‚ β”‚ Water Absorption: Class 3 β”‚ β”‚ β”‚ β”‚ Breaking Strength: 1200 N β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ ... (6 more categories) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


πŸ” How Metadata Extraction Works

Step-by-Step Process

1. PDF Upload

User uploads PDF β†’ MIVAA API receives file β†’ Job created

2. Product Discovery (Stage 0A)

The ProductDiscoveryService analyzes the PDF and returns products with basic metadata including name, page_range, and initial fields (designer, dimensions, variants).

3. Metadata Enrichment (Stage 0B)

For each product, the system extracts product-specific text from the page range, initializes DynamicMetadataExtractor, and runs extraction to get 200+ fields organized into critical (material_category, factory_name, factory_group_name), discovered (all dynamic fields), and metadata (extraction tracking info).

4. Metadata Merging

Metadata is merged with this priority: discovered fields as base, then critical fields override, then discovery metadata (highest priority) overrides those, plus _extraction_metadata added separately.

5. Database Storage

The product record is stored with its complete metadata JSONB containing all 200+ fields.

6. Frontend Display

The ProductDetailModal component reads the metadata object and renders each category section dynamically, showing only categories that have data.


🎯 Confidence Scoring

Each extracted metadata field has a confidence score (0.0-1.0):

Confidence scores are stored alongside field values, tracking both the value and the source location (e.g., "page 6, line 23" or "inferred from image description").


πŸ“ Best Practices

For PDF Catalog Creators

  1. Be Explicit: Clearly state all technical specifications
  2. Use Standard Terminology: Use industry-standard terms (R11, PEI 4, etc.)
  3. Organize by Product: Group all product information together
  4. Include Units: Always include units (mm, kg/mΒ³, etc.)
  5. Provide Certifications: List all certifications and standards

For Platform Users

  1. Review Extracted Metadata: Always review AI-extracted metadata for accuracy
  2. Use Filters: Filter products by metadata fields for precise searches
  3. Check Confidence Scores: Pay attention to confidence scores for critical fields
  4. Report Issues: Report incorrect extractions to improve AI models

For Developers

  1. Validate Critical Fields: Always validate critical fields (material_category, factory_name)
  2. Handle Missing Data: Gracefully handle missing metadata fields
  3. Use JSONB Queries: Leverage PostgreSQL JSONB queries for efficient filtering
  4. Monitor Extraction Quality: Track extraction accuracy and confidence scores

πŸ”§ Troubleshooting

Common Issues

Issue: Metadata not extracted

Issue: Incorrect metadata values

Issue: Missing metadata fields

Issue: Low confidence scores


πŸ“š Related Documentation


Last Updated: 2025-01-12 Version: 2.0 (Comprehensive Metadata Extraction)

⚑ 4. Performance

Purpose: Technical performance metrics and ratings

Fields (8 total):


πŸ”§ 5. Application

Purpose: Usage recommendations and installation guidance

Fields (6 total):