Modular PDF Processing Pipeline - API Endpoints Documentation

Overview

The PDF processing pipeline has been refactored into 6 modular internal API endpoints that work together in sequence. Each endpoint receives the output from the previous stage and processes it further. All endpoints support dynamic AI model configuration via optional ai_config parameter.

Key Concept: Each endpoint knows what to process because it receives filtered/processed data from the previous endpoint, not raw data.


🔄 Data Flow Between Endpoints

  1. classify-images → Receives: ALL extracted images ↓ Returns: material_images + non_material_images

  2. upload-images → Receives: ONLY material_images (from step 10) ↓ Returns: uploaded_images with storage URLs

  3. save-images-db → Receives: ONLY uploaded material images (from step 20) ↓ Returns: images_saved + visual embeddings count (SLIG 768D)

  4. extract-metadata → Receives: product_ids + PDF text ↓ Returns: enriched products with extracted metadata

  5. create-chunks → Receives: Full extracted text + product_ids ↓ Returns: chunks + text embeddings + relationships

  6. create-relationships → Receives: document_id + product_ids Returns: chunk-image + product-image relationships

Important: Each endpoint receives pre-filtered data from the orchestrator, which calls them in sequence and passes the output of one as input to the next.


🤖 AI Model Configuration

All endpoints support dynamic AI model configuration via the optional ai_config parameter. This allows you to customize which AI models are used for different stages of the pipeline.

Configuration Parameters

The ai_config object accepts the following optional fields:

Pre-configured Profiles

DEFAULT_AI_CONFIG (Balanced):

FAST_CONFIG (Speed Optimized):

HIGH_ACCURACY_CONFIG (Quality Optimized):

COST_OPTIMIZED_CONFIG (Budget Friendly):

If ai_config is not provided, the endpoint uses DEFAULT_AI_CONFIG.


🏗️ Architecture

Main Orchestrator

Internal Endpoints

All internal endpoints are prefixed with /api/internal/ and tagged as "Internal Pipeline Stages" in OpenAPI docs.


📋 Endpoint Details

10. POST /api/internal/classify-images/{job_id} (10-20%)

Purpose: Classify ALL extracted images as material or non-material using two-stage AI classification

What It Receives:

How It Knows What to Process:

AI Processing:

Defaults:

Progress: Updates job to 10-20%

What Happens Next: The material_images list is passed to the next endpoint (upload-images)


20. POST /api/internal/upload-images/{job_id} (20-30%)

Purpose: Upload material images to Supabase Storage

What It Receives:

How It Knows to Process ONLY Material Images:

Processing:

Defaults:

Progress: Updates job to 20-30%

What Happens Next: The uploaded_images list (with storage URLs) is passed to the next endpoint (save-images-db)


30. POST /api/internal/save-images-db/{job_id} (30-50%)

Purpose: Save images to database and generate visual embeddings

What It Receives:

How It Knows to Process ONLY Material Images:

AI Processing - Visual Embeddings (SLIG via HuggingFace Endpoint):

Technical Details:

Storage:

Calculation: 65 images × 5 embeddings = 325 total SLIG embeddings (768D each)

Defaults:

Progress: Updates job to 30-50%

What Happens Next: Images and embeddings are now in the database, ready for relationship creation


40. POST /api/internal/extract-metadata/{job_id} (50-60%)

Purpose: Extract comprehensive metadata from PDF text for products using AI

What It Receives:

How It Knows What to Process:

AI Processing:

Defaults:

Progress: Updates job to 50-60%

What Happens Next: Products are enriched with metadata, ready for chunking


50. POST /api/internal/create-chunks/{job_id} (60-80%)

Purpose: Create semantic chunks from extracted text and generate text embeddings

What It Receives:

How It Knows What to Process:

AI Processing:

Defaults:

Progress: Updates job to 60-80%

What Happens Next: Chunks and text embeddings are in the database, ready for relationship creation


60. POST /api/internal/create-relationships/{job_id} (80-100%)

Purpose: Create chunk-image and product-image relationships

What It Receives:

How It Knows What to Process:

Processing:

Defaults:

Progress: Updates job to 80-100%

What Happens Next: Pipeline complete! All data is in the database with relationships.


🎯 Default Behavior

Focused Extraction (Default: ENABLED)

Extract Categories

AI Models Used

  1. Product Discovery: Claude Sonnet 4.5 or GPT-5 (configurable)
  2. Image Classification: Qwen3-VL-32B-Instruct (HuggingFace Endpoint) → Claude Sonnet 4.5 (validation)
  3. Visual Embeddings: SLIG (SigLIP2) via HuggingFace Endpoint - 5 types per image, 768D each
  4. Text Embeddings: Voyage AI voyage-3.5 (1024D) — sole provider (updated 2026-04)

Thresholds

Retry & Timeout


📊 Progress Tracking

Pipeline Stages (50-100%)

Infrastructure Integration


✅ Summary

What Each Endpoint Does

  1. classify-images: AI classification (Qwen3-VL-32B → Claude) to filter material vs non-material images
  2. upload-images: Upload material images to Supabase Storage (receives pre-filtered list)
  3. save-images-db: Save to DB + generate 5 visual embeddings per image (SLIG 768D via HuggingFace endpoint)
  4. create-chunks: Semantic chunking + text embeddings (Voyage AI voyage-3.5 1024D)
  5. create-relationships: Chunk-image and product-image relationships via similarity

Default Processing Flow

  1. Extract ALL images from PDF
  2. Classify ALL images with AI (Qwen3-VL-32B + Claude) → material_images + non_material_images
  3. Upload ONLY material_images to storage → uploaded_images
  4. Save ONLY uploaded_images to database
  5. Generate 5 visual embeddings per saved image (SLIG 768D via HuggingFace endpoint)
  6. Create semantic chunks from text
  7. Generate text embeddings for chunks (Voyage AI voyage-3.5 1024D)
  8. Create relationships between chunks, images, and products

Key Features