Metafield Extraction & Processing - Complete Guide

Comprehensive guide on how the Material Kai Vision Platform identifies, extracts, and processes metafields (structured metadata) from PDF catalogs using AI-powered multi-stage processing.


Overview

Metafields are structured metadata attributes extracted from PDF catalogs and linked to products, chunks, and images. The platform supports 200+ metafield types with AI-powered identification, extraction, and processing across 5 dedicated stages.

Key Capabilities:


What Are Metafields?

Metafields are dynamic, structured data attributes that describe material properties and characteristics.

Real-World Examples


🔄 Complete Processing Pipeline

PDF Upload ↓ ┌─────────────────────────────────────────────────────────────────┐ │ STAGE 0: Product Discovery & Metafield Identification (0-15%) │ │ AI Model: Claude Sonnet 4.5 / GPT-4o │ │ Purpose: Identify products and metafield types │ │ Output: Product catalog with metafield types │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ STAGE 1: Focused Extraction (15-30%) │ │ Process: Extract ONLY pages containing identified products │ │ Output: Focused PDF with product content │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ STAGE 2: Semantic Chunking with Metafield Preservation (30-50%)│ │ AI: Anthropic Claude │ │ Purpose: Create chunks, preserve metafield context │ │ Output: Chunks with metafield metadata │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ STAGE 3: Image Processing & Visual Metafield Extraction (50-70%)│ │ AI: Qwen Vision 4 Scout 17B + CLIP │ │ Purpose: Extract images, analyze for visual metafields │ │ Output: Images with detected colors, texture, finish │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ STAGE 4: Product Creation & Metafield Consolidation (70-90%) │ │ AI: Claude Haiku 4.5 → Claude Sonnet 4.5 │ │ Purpose: Create products, consolidate all metafields │ │ Output: Product records with consolidated metafields │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ STAGE 12: Metafield Extraction & Database Linking (95-97%) │ │ Process: Extract & Link metafields to database │ │ Purpose: Create metafield_values records, link to products │ │ Output: metafield_values linked to products/chunks/images │ └─────────────────────────────────────────────────────────────────┘ ↓ ✅ COMPLETE - All metafields extracted and linked


🔍 Stage 0: Product Discovery - Initial Metafield Identification

AI Model & Process

Claude Sonnet 4.5 / GPT-4o

The AI analyzes the entire PDF to:

  1. Identify product boundaries - Detect where each product starts/ends
  2. Extract product names - Get official product names
  3. Detect metafield types - Identify what metadata is present (material, color, dimensions, etc.)
  4. Map images to products - Link product images to product records
  5. Create product catalog - Build preliminary product structure with metafields

How AI Identifies Metafields

Text-Based Identification:

Context-Based Identification:

Accuracy Metrics


📄 Stage 1: Focused Extraction - Extract Product Pages

Process

  1. Extract only pages containing identified products (from Stage 0)
  2. Preserve metafield context - Keep all metadata information
  3. Prepare for detailed analysis in next stages
  4. Optimize for processing efficiency

Output


📝 Stage 2: Semantic Chunking with Metafield Preservation

AI Model & Process

Anthropic Claude (Semantic Chunking)

The AI creates semantic chunks while preserving metafield information:

  1. Create semantic text chunks (1000 tokens, 200 overlap)
  2. Preserve metafield information in chunk metadata
  3. Link chunks to products - Maintain product relationships
  4. Generate text embeddings (1024D Voyage AI embeddings, updated 2026-04)

How Metafields Are Preserved in Chunks

Metadata Extraction:

Each chunk stores its product_name, page_range, and a metafields dictionary (e.g., dimensions, material, finish, colors, patterns) along with a metafield_sources dictionary tracking where each value came from (e.g., text_extraction).

Metafield Linking in Chunks


🖼️ Stage 3: Image Processing & Visual Metafield Extraction

AI Models & Process

The AI extracts images and analyzes them for visual metafields:

  1. Extract images from product pages
  2. Analyze images for material properties and visual characteristics
  3. Identify visual metafields:
    • Color detection
    • Texture analysis
    • Finish identification
    • Pattern recognition
    • Material appearance
  4. Perform OCR on images to extract text
  5. Generate CLIP embeddings (512D) for visual search

How AI Identifies Visual Metafields

Color Detection:

Texture Analysis:

Material Recognition:

Pattern Recognition:

Visual Metafield Linking


🏭 Stage 4: Product Creation & Metafield Consolidation

AI Models & Process

The AI creates product records and consolidates metafields from all sources:

  1. Create product records from chunks and images
  2. Consolidate metafields from multiple sources (text, images, OCR)
  3. Validate completeness of metafield data
  4. Enrich with additional metadata (designer, studio, category)
  5. Link chunks and images to products

How AI Consolidates Metafields

Multi-Source Consolidation:

Conflict Resolution:

Enrichment Process:

Metafield Consolidation Details


🔗 Stage 12: Metafield Extraction & Database Linking

Process & Purpose

The final stage extracts structured metafields from product records and creates database relationships:

  1. Parse metafield values from product metadata
  2. Identify metafield types (200+ types supported)
  3. Create metafield records if not already exist
  4. Create metafield_values records for each value
  5. Link to products, chunks, and images
  6. Store confidence scores and extraction method
  7. Enable search and filtering by metafields

How Metafields Are Linked

Product Linking:

Chunk Linking:

Image Linking:

Supported Metafield Types (200+)

Material Properties (25+ types)

Dimensions & Size (15+ types)

Appearance (20+ types)

Performance (20+ types)

Application & Use (25+ types)

Compliance & Certifications (20+ types)

Commercial & Availability (25+ types)

Design & Aesthetics (20+ types)

Product Information (20+ types)

Technical Specifications (20+ types)

Visual & Sensory (15+ types)

Packaging & Delivery (15+ types)

Maintenance & Care (15+ types)

Linking Relationships Diagram

Product (VALENOVA) ├── product_metafield_values │ ├── material: "White Body Tile" (confidence: 0.98) │ ├── dimensions: "11.8×11.8" (confidence: 0.95) │ ├── finish: "matte" (confidence: 0.92) │ └── colors: ["clay", "sand", "white", "taupe"] (confidence: 0.91-0.96) │ ├── document_chunks │ └── chunk_123 │ └── chunk_metafield_values │ ├── material: "White Body Tile" (confidence: 0.98) │ └── dimensions: "11.8×11.8" (confidence: 0.95) │ └── document_images └── img_789 └── image_metafield_values ├── finish: "matte" (confidence: 0.92) └── colors: ["clay", "sand"] (confidence: 0.93-0.94)


🔄 Metafield Linking Process

Metafield values are inserted into product_metafield_values, chunk_metafield_values, and image_metafield_values tables respectively, each record containing the entity ID (product, chunk, or image), the field_id, the extracted value_text, a confidence_score, the extraction_method, and a timestamp.


📊 Accuracy & Performance

Extraction Accuracy

Processing Speed

Success Rate


🔍 Searching by Metafields

Property Search API

GET /api/search/properties?material=ceramic&color=white&limit=20 returns matching products with their metafield values (material, color, dimensions, etc.) and a response time in milliseconds.

Metafield Filtering


📈 Metafield Management

Create Metafield

POST /api/metafields with a JSON body containing name, type, and workspace_id returns the created metafield id and created_at timestamp.

Get Metafield Values

GET /api/products/{product_id}/metafields returns the product's metafields array, each entry containing field_id, name, value, and confidence_score.


✅ Best Practices

  1. Validate Confidence Scores - Only use metafields with confidence > 0.85
  2. Link Multiple Sources - Link same metafield to product, chunks, and images
  3. Support Multiple Values - Use multiselect for colors, patterns, variants
  4. Track Extraction Method - Document whether extracted by AI, OCR, or manual
  5. Monitor Accuracy - Track extraction accuracy over time
  6. Update Regularly - Refresh metafields when products are updated

🚀 Integration Points


📊 Complete Processing Summary

5-Stage Metafield Processing Pipeline

Stage AI Model Input Process Output Accuracy
0 Claude Sonnet 4.5 / GPT-4o Full PDF Identify products & metafield types Product catalog with metafield types 88%+
2 Anthropic Claude Product pages Create chunks, preserve metafields Chunks with metafield metadata 88%+
3 Qwen Vision + CLIP Images Analyze for visual metafields Images with colors, texture, finish 85-94%
4 Claude Haiku 4.5 → Sonnet 4.5 Chunks + Images Consolidate metafields Product records with consolidated metafields 95%+
12 Extract & Link Product metadata Create database records, link to products/chunks/images metafield_values linked 100%

Key Metrics

Extraction Accuracy:

Processing Performance:

Success Rate:

Metafield Types Supported (200+)

Material Properties (20+ types): Material composition, Texture, Finish, Pattern, Weight, Density, Durability, Water resistance

Dimensions & Size (10+ types): Length, Width, Height, Thickness, Diameter, Area, Volume, Weight per unit

Appearance (15+ types): Color, Gloss level, Surface treatment, Transparency, Pattern type, Grain direction

Performance (15+ types): Durability rating, Water resistance, Fire rating, Slip resistance, Wear rating, Stain resistance

Application (20+ types): Recommended use, Installation method, Maintenance, Care instructions, Compatibility, Limitations

Compliance (15+ types): Certifications, Standards, Environmental, Safety ratings, Compliance marks

Commercial (20+ types): Pricing, Availability, Lead time, Supplier, SKU, Variants

Other (20+ types): Designer, Studio, Category, Related products, Variants, Specifications

How Materials Are Handled

Material Identification:

  1. Stage 0: Claude identifies material types from PDF (e.g., "White Body Tile", "Ceramic")
  2. Stage 2: Chunks preserve material information in metadata
  3. Stage 3: Qwen Vision analyzes material appearance (texture, finish, gloss)
  4. Stage 4: Claude consolidates material data from all sources
  5. Stage 12: Material metafield linked to product, chunks, and images

Material Properties Extracted:

Material Linking:

Example: VALENOVA Material Processing

Each stage produces progressively richer output:


✨ Key Features Summary

Automatic Identification - AI identifies metafield types in PDFs ✅ Multi-Source Extraction - Extract from chunks, images, and text ✅ Confidence Scoring - Track extraction confidence (0.0-1.0) ✅ Extraction Method Tracking - Know if extracted by AI, OCR, or manual ✅ 200+ Metafield Types - Support comprehensive material properties ✅ Relationship Linking - Link to products, chunks, and images ✅ Search Integration - Filter and find by metafields ✅ Dynamic Creation - Create new metafield types as needed ✅ Type Validation - Validate metafield values by type ✅ Multi-Value Support - Support multiple values per metafield ✅ Material Handling - Specialized processing for material properties ✅ Visual Analysis - Extract visual properties from images ✅ Performance Optimized - Fast extraction and linking ✅ Production Ready - Enterprise-grade implementation


📚 Related Documentation