Metafield Extraction & Processing - Complete Guide
Comprehensive guide on how the Material Kai Vision Platform identifies, extracts, and processes metafields (structured metadata) from PDF catalogs using AI-powered multi-stage processing.
Overview
Metafields are structured metadata attributes extracted from PDF catalogs and linked to products, chunks, and images. The platform supports 200+ metafield types with AI-powered identification, extraction, and processing across 5 dedicated stages.
Key Capabilities:
- Automatic identification of metafield types using Claude AI
- Multi-source extraction (text chunks, images, OCR)
- Confidence scoring (0.0-1.0) for each extracted value
- Extraction method tracking (AI, OCR, manual)
- Relationship linking to products, chunks, and images
- 200+ metafield types supported
- 88%+ extraction accuracy
What Are Metafields?
Metafields are dynamic, structured data attributes that describe material properties and characteristics.
Real-World Examples
- Material Composition: "100% Wool", "Polyester Blend", "White Body Tile"
- Dimensions: "11.8×11.8 inches", "2.5cm thickness", "Length × Width"
- Weight: "250g/m²", "5kg", "Weight per unit"
- Color: "Clay", "Sand", "White", "Taupe"
- Texture: "Matte", "Glossy", "Embossed", "Smooth"
- Application: "Wall Tiles", "Floor Tiles", "Decorative", "Recommended use"
- Care Instructions: "Dry Clean Only", "Machine Wash", "Maintenance"
- Certifications: "ISO 9001", "LEED Certified", "Safety ratings"
- Pricing: "$45.99/unit", "€35.50", "Lead time"
- Availability: "In Stock", "Made to Order", "Supplier info"
🔄 Complete Processing Pipeline
PDF Upload
↓
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 0: Product Discovery & Metafield Identification (0-15%) │
│ AI Model: Claude Sonnet 4.5 / GPT-4o │
│ Purpose: Identify products and metafield types │
│ Output: Product catalog with metafield types │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 1: Focused Extraction (15-30%) │
│ Process: Extract ONLY pages containing identified products │
│ Output: Focused PDF with product content │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 2: Semantic Chunking with Metafield Preservation (30-50%)│
│ AI: Anthropic Claude │
│ Purpose: Create chunks, preserve metafield context │
│ Output: Chunks with metafield metadata │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 3: Image Processing & Visual Metafield Extraction (50-70%)│
│ AI: Qwen Vision 4 Scout 17B + CLIP │
│ Purpose: Extract images, analyze for visual metafields │
│ Output: Images with detected colors, texture, finish │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 4: Product Creation & Metafield Consolidation (70-90%) │
│ AI: Claude Haiku 4.5 → Claude Sonnet 4.5 │
│ Purpose: Create products, consolidate all metafields │
│ Output: Product records with consolidated metafields │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 12: Metafield Extraction & Database Linking (95-97%) │
│ Process: Extract & Link metafields to database │
│ Purpose: Create metafield_values records, link to products │
│ Output: metafield_values linked to products/chunks/images │
└─────────────────────────────────────────────────────────────────┘
↓
✅ COMPLETE - All metafields extracted and linked
🔍 Stage 0: Product Discovery - Initial Metafield Identification
AI Model & Process
Claude Sonnet 4.5 / GPT-4o
The AI analyzes the entire PDF to:
- Identify product boundaries - Detect where each product starts/ends
- Extract product names - Get official product names
- Detect metafield types - Identify what metadata is present (material, color, dimensions, etc.)
- Map images to products - Link product images to product records
- Create product catalog - Build preliminary product structure with metafields
How AI Identifies Metafields
Text-Based Identification:
- Looks for specification tables (Material, Dimensions, Weight, etc.)
- Identifies property lists (Colors: Clay, Sand, White)
- Recognizes dimension patterns (11.8×11.8, 2.5cm, etc.)
- Detects material descriptions (White Body Tile, Ceramic, etc.)
Context-Based Identification:
- Analyzes product descriptions for properties
- Identifies care instructions and certifications
- Recognizes pricing and availability information
- Detects application and use recommendations
Accuracy Metrics
- Product Detection: 95%+
- Metafield Type Identification: 88%+
- Confidence Score Range: 0.85-0.99
- Success Rate: 95%+
📄 Stage 1: Focused Extraction - Extract Product Pages
Process
- Extract only pages containing identified products (from Stage 0)
- Preserve metafield context - Keep all metadata information
- Prepare for detailed analysis in next stages
- Optimize for processing efficiency
Output
- Product pages extracted (focused PDF)
- Metafield context preserved
- Ready for semantic chunking
- Reduced processing scope (only product pages)
📝 Stage 2: Semantic Chunking with Metafield Preservation
AI Model & Process
Anthropic Claude (Semantic Chunking)
The AI creates semantic chunks while preserving metafield information:
- Create semantic text chunks (1000 tokens, 200 overlap)
- Preserve metafield information in chunk metadata
- Link chunks to products - Maintain product relationships
- Generate text embeddings (1024D Voyage AI embeddings, updated 2026-04)
How Metafields Are Preserved in Chunks
Metadata Extraction:
- Extract metafield values from chunk content
- Identify which metafields are mentioned in each chunk
- Store metafield references in chunk metadata
- Maintain context for later linking
Each chunk stores its product_name, page_range, and a metafields dictionary (e.g., dimensions, material, finish, colors, patterns) along with a metafield_sources dictionary tracking where each value came from (e.g., text_extraction).
Metafield Linking in Chunks
- Each chunk stores metafield values found in its content
- Metafield sources tracked (text extraction, OCR, etc.)
- Chunks linked to products for relationship tracking
- Enables searching by metafields within chunks
🖼️ Stage 3: Image Processing & Visual Metafield Extraction
AI Models & Process
- Qwen Vision 4 Scout 17B: Advanced image analysis
- CLIP: Image embeddings (512D) for visual similarity
The AI extracts images and analyzes them for visual metafields:
- Extract images from product pages
- Analyze images for material properties and visual characteristics
- Identify visual metafields:
- Color detection
- Texture analysis
- Finish identification
- Pattern recognition
- Material appearance
- Perform OCR on images to extract text
- Generate CLIP embeddings (512D) for visual search
How AI Identifies Visual Metafields
Color Detection:
- Analyzes pixel data to identify dominant colors
- Matches colors to material color names (clay, sand, white, taupe)
- Detects color variations and patterns
- Accuracy: 94%+
Texture Analysis:
- Analyzes surface patterns and characteristics
- Identifies texture types (matte, glossy, embossed, smooth)
- Detects surface treatment and finish
- Accuracy: 85%+
Material Recognition:
- Identifies material type from visual appearance
- Recognizes ceramic, tile, fabric, wood, etc.
- Detects material properties (porosity, gloss, etc.)
- Accuracy: 90%+
Pattern Recognition:
- Identifies patterns and designs
- Counts pattern variations
- Detects geometric or organic patterns
- Accuracy: 88%+
Visual Metafield Linking
- Each image stores detected visual metafields
- Confidence scores track extraction reliability
- Images linked to products for relationship tracking
- Enables visual similarity search by color, texture, material
🏭 Stage 4: Product Creation & Metafield Consolidation
AI Models & Process
- Claude Haiku 4.5: Initial product creation from chunks
- Claude Sonnet 4.5: Validation, enrichment, and consolidation
The AI creates product records and consolidates metafields from all sources:
- Create product records from chunks and images
- Consolidate metafields from multiple sources (text, images, OCR)
- Validate completeness of metafield data
- Enrich with additional metadata (designer, studio, category)
- Link chunks and images to products
How AI Consolidates Metafields
Multi-Source Consolidation:
- Merges metafields from chunks (text-based)
- Merges metafields from images (visual-based)
- Resolves conflicts (e.g., multiple color values)
- Validates data consistency
Conflict Resolution:
- When multiple sources provide different values, AI selects most confident
- Combines multi-value fields (colors, patterns, variants)
- Validates against known material properties
- Tracks which source provided each value
Enrichment Process:
- Extracts designer and studio information
- Identifies product category
- Detects variants and options
- Adds related product information
Metafield Consolidation Details
- All metafields consolidated from chunks and images
- Source tracking for each metafield value
- Confidence scores maintained
- Variants properly structured
- Ready for database linking
🔗 Stage 12: Metafield Extraction & Database Linking
Process & Purpose
The final stage extracts structured metafields from product records and creates database relationships:
- Parse metafield values from product metadata
- Identify metafield types (200+ types supported)
- Create metafield records if not already exist
- Create metafield_values records for each value
- Link to products, chunks, and images
- Store confidence scores and extraction method
- Enable search and filtering by metafields
How Metafields Are Linked
Product Linking:
- Extract metafield values from product metadata
- Create product_metafield_values records
- Link each value to the product
- Store confidence score and extraction method
Chunk Linking:
- Extract metafield values from chunk metadata
- Create chunk_metafield_values records
- Link each value to the chunk
- Track which chunks contain which metafields
Image Linking:
- Extract visual metafields from image metadata
- Create image_metafield_values records
- Link each value to the image
- Enable visual search by metafields
Supported Metafield Types (200+)
Material Properties (25+ types)
- Material composition, material type, material blend, fiber content, yarn type, yarn weight, yarn count, texture, texture type, surface texture, finish, finish type, surface finish, pattern, pattern type, pattern repeat, weight, weight per unit, density, durability rating, wear resistance, pilling resistance, shrinkage rate, color fastness, flammability rating
Dimensions & Size (15+ types)
- Length, width, height, thickness, diameter, radius, circumference, area, volume, weight, depth, size, size range, aspect ratio, scale
Appearance (20+ types)
- Color, color name, color code, color family, color variation, gloss level, gloss type, sheen, surface treatment, surface type, transparency, opacity, grain, grain direction, grain pattern, texture appearance, finish appearance, pattern appearance, visual effect, aesthetic style
Performance (20+ types)
- Durability rating, durability class, water resistance, water repellency, stain resistance, stain protection, fire rating, fire resistance, flammability, slip resistance, slip rating, wear rating, wear class, abrasion resistance, tensile strength, tear strength, pilling resistance, fading resistance, moisture resistance, chemical resistance
Application & Use (25+ types)
- Recommended use, application, application area, suitable for, not suitable for, installation method, installation type, installation difficulty, mounting type, orientation, placement, room type, traffic level, maintenance, care instructions, cleaning method, cleaning products, washing instructions, drying instructions, storage instructions, compatibility, compatible with, incompatible with, limitations, restrictions
Compliance & Certifications (20+ types)
- Certifications, certification type, standards, standard compliance, environmental certification, eco-friendly, sustainability rating, recycled content, recyclable, biodegradable, VOC rating, safety rating, safety standards, compliance marks, testing standards, quality standards, industry standards, regulatory compliance, health & safety, allergen information
Commercial & Availability (25+ types)
- Pricing, price per unit, price range, currency, availability, stock status, in stock, out of stock, lead time, delivery time, supplier, supplier name, manufacturer, manufacturer name, brand, brand name, SKU, product code, product ID, variant code, batch number, production date, expiration date, warranty, warranty period, return policy
Design & Aesthetics (20+ types)
- Designer, designer name, studio, studio name, design style, design era, design movement, aesthetic, aesthetic style, visual style, artistic style, inspiration, inspired by, collection, collection name, series, series name, limited edition, edition number, collaboration, collaborator
Product Information (20+ types)
- Product name, product type, product category, category, subcategory, product line, product family, product group, description, product description, features, key features, benefits, unique selling points, variants, variant type, variant options, related products, complementary products, accessories, replacement parts
Technical Specifications (20+ types)
- Specifications, technical specs, composition, construction, construction method, manufacturing process, production method, quality level, grade, class, rating, certification level, performance level, specification sheet, technical documentation, test results, test data, compliance documentation, safety documentation, environmental documentation
Visual & Sensory (15+ types)
- Color palette, color scheme, color combination, texture feel, surface feel, touch sensation, visual weight, visual balance, visual harmony, aesthetic appeal, design appeal, sensory experience, tactile quality, visual quality, overall impression
Packaging & Delivery (15+ types)
- Packaging type, packaging material, packaging size, packaging weight, shipping weight, shipping dimensions, shipping method, delivery method, delivery options, handling instructions, storage requirements, storage conditions, temperature range, humidity range, special handling
Maintenance & Care (15+ types)
- Maintenance level, maintenance frequency, maintenance requirements, care level, care difficulty, cleaning frequency, cleaning difficulty, special care, professional cleaning, DIY cleaning, maintenance cost, maintenance products, recommended products, prohibited products, lifespan
Linking Relationships Diagram
Product (VALENOVA)
├── product_metafield_values
│ ├── material: "White Body Tile" (confidence: 0.98)
│ ├── dimensions: "11.8×11.8" (confidence: 0.95)
│ ├── finish: "matte" (confidence: 0.92)
│ └── colors: ["clay", "sand", "white", "taupe"] (confidence: 0.91-0.96)
│
├── document_chunks
│ └── chunk_123
│ └── chunk_metafield_values
│ ├── material: "White Body Tile" (confidence: 0.98)
│ └── dimensions: "11.8×11.8" (confidence: 0.95)
│
└── document_images
└── img_789
└── image_metafield_values
├── finish: "matte" (confidence: 0.92)
└── colors: ["clay", "sand"] (confidence: 0.93-0.94)
🔄 Metafield Linking Process
Metafield values are inserted into product_metafield_values, chunk_metafield_values, and image_metafield_values tables respectively, each record containing the entity ID (product, chunk, or image), the field_id, the extracted value_text, a confidence_score, the extraction_method, and a timestamp.
📊 Accuracy & Performance
Extraction Accuracy
- Metafield Extraction: 88%+
- Material Recognition: 90%+
- Dimension Extraction: 92%+
- Color Detection: 94%+
Processing Speed
- Product Discovery: 3-5 seconds
- Image Analysis: 2-4 seconds per image
- Metafield Extraction: 1-2 seconds per product
Success Rate
- Complete Extraction: 95%+
- Partial Extraction: 4%
- Failed Extraction: 1%
🔍 Searching by Metafields
Property Search API
GET /api/search/properties?material=ceramic&color=white&limit=20 returns matching products with their metafield values (material, color, dimensions, etc.) and a response time in milliseconds.
Metafield Filtering
- Filter by material type
- Filter by color
- Filter by texture
- Filter by dimensions
- Filter by application
- Combine multiple filters
📈 Metafield Management
Create Metafield
POST /api/metafields with a JSON body containing name, type, and workspace_id returns the created metafield id and created_at timestamp.
Get Metafield Values
GET /api/products/{product_id}/metafields returns the product's metafields array, each entry containing field_id, name, value, and confidence_score.
✅ Best Practices
- Validate Confidence Scores - Only use metafields with confidence > 0.85
- Link Multiple Sources - Link same metafield to product, chunks, and images
- Support Multiple Values - Use multiselect for colors, patterns, variants
- Track Extraction Method - Document whether extracted by AI, OCR, or manual
- Monitor Accuracy - Track extraction accuracy over time
- Update Regularly - Refresh metafields when products are updated
🚀 Integration Points
- Search: Filter and find materials by metafields
- Analytics: Track metafield usage and trends
- Admin: Manage metafield definitions
- API: Query and update metafields
- Export: Include metafields in product exports
📊 Complete Processing Summary
5-Stage Metafield Processing Pipeline
| Stage |
AI Model |
Input |
Process |
Output |
Accuracy |
| 0 |
Claude Sonnet 4.5 / GPT-4o |
Full PDF |
Identify products & metafield types |
Product catalog with metafield types |
88%+ |
| 2 |
Anthropic Claude |
Product pages |
Create chunks, preserve metafields |
Chunks with metafield metadata |
88%+ |
| 3 |
Qwen Vision + CLIP |
Images |
Analyze for visual metafields |
Images with colors, texture, finish |
85-94% |
| 4 |
Claude Haiku 4.5 → Sonnet 4.5 |
Chunks + Images |
Consolidate metafields |
Product records with consolidated metafields |
95%+ |
| 12 |
Extract & Link |
Product metadata |
Create database records, link to products/chunks/images |
metafield_values linked |
100% |
Key Metrics
Extraction Accuracy:
- Product Detection: 95%+
- Material Recognition: 90%+
- Metafield Extraction: 88%+
- Dimension Extraction: 92%+
- Color Detection: 94%+
- Texture Detection: 85%+
Processing Performance:
- Product Discovery: 5-10 seconds per PDF
- Chunk Creation: 2-5 seconds per product
- Image Analysis: 1-3 seconds per image
- Product Creation: 2-4 seconds per product
- Metafield Extraction: 1-2 seconds per product
Success Rate:
- Complete Extraction: 95%+
- Partial Extraction: 4%
- Failed Extraction: 1%
Metafield Types Supported (200+)
Material Properties (20+ types): Material composition, Texture, Finish, Pattern, Weight, Density, Durability, Water resistance
Dimensions & Size (10+ types): Length, Width, Height, Thickness, Diameter, Area, Volume, Weight per unit
Appearance (15+ types): Color, Gloss level, Surface treatment, Transparency, Pattern type, Grain direction
Performance (15+ types): Durability rating, Water resistance, Fire rating, Slip resistance, Wear rating, Stain resistance
Application (20+ types): Recommended use, Installation method, Maintenance, Care instructions, Compatibility, Limitations
Compliance (15+ types): Certifications, Standards, Environmental, Safety ratings, Compliance marks
Commercial (20+ types): Pricing, Availability, Lead time, Supplier, SKU, Variants
Other (20+ types): Designer, Studio, Category, Related products, Variants, Specifications
How Materials Are Handled
Material Identification:
- Stage 0: Claude identifies material types from PDF (e.g., "White Body Tile", "Ceramic")
- Stage 2: Chunks preserve material information in metadata
- Stage 3: Qwen Vision analyzes material appearance (texture, finish, gloss)
- Stage 4: Claude consolidates material data from all sources
- Stage 12: Material metafield linked to product, chunks, and images
Material Properties Extracted:
- Material composition (e.g., "100% Wool", "Ceramic")
- Material type (e.g., "Tile", "Fabric", "Wood")
- Material appearance (texture, finish, gloss)
- Material performance (durability, water resistance, fire rating)
- Material care (maintenance, cleaning instructions)
- Material certifications (ISO, LEED, safety standards)
Material Linking:
- Product level: Material metafield linked to product record
- Chunk level: Material references in chunks linked to metafield
- Image level: Visual material properties linked to images
- Search level: Filter products by material type
Example: VALENOVA Material Processing
Each stage produces progressively richer output:
- Stage 0: Identifies "White Body Tile" as the material with "Ceramic" as type (confidence 0.98)
- Stage 2: Chunk preserves material and material_type in its metafields
- Stage 3: Visual analysis detects texture "matte", finish "smooth", appearance "ceramic", gloss_level "low"
- Stage 4: Consolidates all into a complete product record with consolidation_status "complete"
- Stage 12: All values inserted into the
metafield_values table linked to the product
✨ Key Features Summary
✅ Automatic Identification - AI identifies metafield types in PDFs
✅ Multi-Source Extraction - Extract from chunks, images, and text
✅ Confidence Scoring - Track extraction confidence (0.0-1.0)
✅ Extraction Method Tracking - Know if extracted by AI, OCR, or manual
✅ 200+ Metafield Types - Support comprehensive material properties
✅ Relationship Linking - Link to products, chunks, and images
✅ Search Integration - Filter and find by metafields
✅ Dynamic Creation - Create new metafield types as needed
✅ Type Validation - Validate metafield values by type
✅ Multi-Value Support - Support multiple values per metafield
✅ Material Handling - Specialized processing for material properties
✅ Visual Analysis - Extract visual properties from images
✅ Performance Optimized - Fast extraction and linking
✅ Production Ready - Enterprise-grade implementation
📚 Related Documentation
- PDF Processing Pipeline:
docs/pdf-processing-pipeline.md - Complete 14-stage pipeline
- Database Schema:
docs/database-schema-complete.md - Full database structure
- AI Models Guide:
docs/ai-models-guide.md - AI models used in platform
- API Endpoints:
docs/api-endpoints.md - All API endpoints
- Features Guide:
docs/features-guide.md - Platform features overview
- System Architecture:
docs/system-architecture.md - System design