Extract Categories Guide

Overview

The extract_categories parameter allows you to control what content is extracted from PDFs during processing. This enables focused extraction of specific content types (products, certificates, logos, specifications) while skipping irrelevant content.


How It Works

1. Product Discovery (Stage 0)

Claude/GPT analyzes the entire PDF and classifies content into categories:

2. Focused Extraction (Stage 1)

Based on focused_extraction and extract_categories parameters:

3. Content Processing (Stages 2-5)


API Parameters

focused_extraction (boolean, default: True)

Controls whether to filter content by categories.

extract_categories (string, default: "products")

Comma-separated list of categories to extract.

Available Categories:


Use Cases

Use Case 1: Product Catalog (Default)

Extract only product information, skip marketing/admin content.

Result:


Use Case 2: Products + Certificates

Extract products and environmental certifications.

Result (when certificates category is implemented):


Use Case 3: Full PDF Processing

Extract everything from the PDF by setting focused_extraction=false or extract_categories=all.

Result:


Implementation Status

Fully Implemented: Products Category

How it works:

  1. Claude analyzes PDF → identifies products on pages 5-11
  2. product_pages = {5, 6, 7, 8, 9, 10, 11}
  3. If extract_categories="products":
    • Chunks created from pages 5-11 only
    • Images saved from pages 5-11 only
    • Products created from discovery

Code Location: mivaa-pdf-extractor/app/api/rag_routes.py


Planned Categories

Certificates, Logos, Specifications are planned for future implementation.

Implementation Requirements:

  1. Update Product Discovery Service to classify content into categories
  2. Add category-specific page sets (like product_pages, add certificate_pages, logo_pages, etc.)
  3. Update image filtering logic to handle multiple categories
  4. Create database tables for certificates, logos, specifications
  5. Add API endpoints to retrieve category-specific content

Current Implementation: The codebase includes placeholder support for additional categories. The products category is fully functional and serves as the template for implementing additional categories.


Database Schema

Image Metadata

Images now include category information. Each image record's metadata JSONB field contains: category (e.g., "product"), extract_categories (e.g., ["products"]), focused_extraction (boolean), and product_page (boolean).


Migration Path

Phase 1: Products Only (Current)

Phase 2: Add Certificates

Phase 3: Add Logos & Specifications

Phase 4: Advanced Classification


Testing

Test 1: Products Only (Default)

Expected:

Test 2: Full PDF

Expected:


Summary

Current Behavior

Future Enhancements

Key Benefits