Extract Categories Guide

Overview

The extract_categories parameter allows you to control what content is extracted from PDFs during processing. This enables focused extraction of specific content types (products, certificates, logos, specifications) while skipping irrelevant content.

How It Works

1. Product Discovery (Stage 0)

Claude/GPT analyzes the entire PDF and classifies content into categories:

Products: Product specifications, materials, dimensions
Certificates: Environmental certifications (EPD, LEED, etc.)
Logos: Company logos, brand marks
Specifications: Technical specifications, installation guides
Other: Marketing content, company history, administrative pages

2. Focused Extraction (Stage 1)

Based on focused_extraction and extract_categories parameters:

If focused_extraction=True: Extract only pages matching extract_categories
If focused_extraction=False: Extract all pages (ignore extract_categories)

3. Content Processing (Stages 2-5)

Chunks: Created only from extracted pages
Images: Saved only from extracted pages (filtered by category)
Products: Created from Claude discovery results
Embeddings: Generated for extracted content only

API Parameters

`focused_extraction` (boolean, default: `True`)

Controls whether to filter content by categories.

True: Process only pages matching extract_categories
False: Process entire PDF (all pages, all images)

`extract_categories` (string, default: `"products"`)

Comma-separated list of categories to extract.

Available Categories:

products - Product pages (Fully Implemented)
certificates - Certification pages (Planned)
logos - Logo/branding pages (Planned)
specifications - Technical specification pages (Planned)
all - All content (same as focused_extraction=False)

Use Cases

Use Case 1: Product Catalog (Default)

Extract only product information, skip marketing/admin content.

Result:

Chunks from product pages only
Images from product pages only
Products created from discovery
Marketing content skipped
Company history skipped

Use Case 2: Products + Certificates

Extract products and environmental certifications.

Result (when certificates category is implemented):

Chunks from product pages
Chunks from certificate pages
Images from product pages
Images from certificate pages
Products created
Certificates extracted and linked to products

Use Case 3: Full PDF Processing

Extract everything from the PDF by setting focused_extraction=false or extract_categories=all.

Result:

Chunks from all pages
Images from all pages
Products created from discovery
All content processed

Implementation Status

Fully Implemented: Products Category

How it works:

Claude analyzes PDF → identifies products on pages 5-11
product_pages = {5, 6, 7, 8, 9, 10, 11}
If extract_categories="products":
- Chunks created from pages 5-11 only
- Images saved from pages 5-11 only
- Products created from discovery

Code Location: mivaa-pdf-extractor/app/api/rag_routes.py

Lines 2057-2071: Product page filtering
Lines 2202-2264: Image filtering by category

Planned Categories

Certificates, Logos, Specifications are planned for future implementation.

Implementation Requirements:

Update Product Discovery Service to classify content into categories
Add category-specific page sets (like product_pages, add certificate_pages, logo_pages, etc.)
Update image filtering logic to handle multiple categories
Create database tables for certificates, logos, specifications
Add API endpoints to retrieve category-specific content

Current Implementation: The codebase includes placeholder support for additional categories. The products category is fully functional and serves as the template for implementing additional categories.

Database Schema

Image Metadata

Images now include category information. Each image record's metadata JSONB field contains: category (e.g., "product"), extract_categories (e.g., ["products"]), focused_extraction (boolean), and product_page (boolean).

Migration Path

Phase 1: Products Only (Current)

Extract products
Filter images by product pages
Skip non-product content

Phase 2: Add Certificates

Update Claude prompt to identify certificate pages
Create certificates table
Link certificates to products
Filter images by certificate pages

Phase 3: Add Logos & Specifications

Identify logo pages (usually first few pages)
Identify specification pages (technical details)
Create appropriate database tables
Link to products

Phase 4: Advanced Classification

AI-powered content classification
Automatic category detection
Smart filtering based on user preferences

Testing

Test 1: Products Only (Default)

Expected:

Images only from product pages (5-11)
Chunks only from product pages
Non-product pages skipped

Test 2: Full PDF