Modular PDF Processing Pipeline - API Endpoints Documentation
Overview
The PDF processing pipeline has been refactored into 6 modular internal API endpoints that work together in sequence. Each endpoint receives the output from the previous stage and processes it further. All endpoints support dynamic AI model configuration via optional ai_config parameter.
Key Concept: Each endpoint knows what to process because it receives filtered/processed data from the previous endpoint, not raw data.
🔄 Data Flow Between Endpoints
classify-images → Receives: ALL extracted images
↓ Returns: material_images + non_material_images
upload-images → Receives: ONLY material_images (from step 10)
↓ Returns: uploaded_images with storage URLs
save-images-db → Receives: ONLY uploaded material images (from step 20)
↓ Returns: images_saved + visual embeddings count (SLIG 768D)
extract-metadata → Receives: product_ids + PDF text
↓ Returns: enriched products with extracted metadata
create-chunks → Receives: Full extracted text + product_ids
↓ Returns: chunks + text embeddings + relationships
create-relationships → Receives: document_id + product_ids
Returns: chunk-image + product-image relationships
Important: Each endpoint receives pre-filtered data from the orchestrator, which calls them in sequence and passes the output of one as input to the next.
🤖 AI Model Configuration
All endpoints support dynamic AI model configuration via the optional ai_config parameter. This allows you to customize which AI models are used for different stages of the pipeline.
Configuration Parameters
The ai_config object accepts the following optional fields:
visual_embedding_model — Default: "SLIG"
visual_embedding_dimensions — Default: 768
text_embedding_model — Default: "voyage-3.5"
text_embedding_dimensions — Default: 1024
text_embedding_input_type — Default: "document"
classification_primary_model — Default: "Qwen/Qwen3-VL-32B-Instruct"
classification_validation_model — Default: "claude-sonnet-4-6-20260217"
classification_confidence_threshold — Default: 0.7
discovery_model — Default: "claude-sonnet-4-6-20260217"
metadata_extraction_model — Default: "claude"
chunking_model — Default: "gpt-4o"
discovery_temperature — Default: 0.1
classification_temperature — Default: 0.1
metadata_temperature — Default: 0.1
discovery_max_tokens — Default: 4096
classification_max_tokens — Default: 512
metadata_max_tokens — Default: 4096
Pre-configured Profiles
DEFAULT_AI_CONFIG (Balanced):
- Best overall accuracy and reliability
- Uses Claude Sonnet 4.5 for discovery and metadata
- Uses SLIG (SigLIP2) for visual embeddings (768D, HuggingFace endpoint)
- Uses Voyage AI voyage-3.5 for text embeddings (1024D)
- Uses Qwen3-VL-32B-Instruct for vision classification (HuggingFace endpoint) with Claude validation
FAST_CONFIG (Speed Optimized):
- Uses GPT-4o instead of Claude for faster processing
- Uses Claude Haiku for validation (faster than Sonnet)
- Reduced max tokens for faster responses
HIGH_ACCURACY_CONFIG (Quality Optimized):
- Uses GPT-5 for discovery (most accurate)
- Higher confidence threshold (0.8) for better quality
- Increased max tokens for more context
COST_OPTIMIZED_CONFIG (Budget Friendly):
- Uses GPT-4o and Claude Haiku (cheaper models)
- Lower confidence threshold (0.6) to reduce validation calls
- Reduced max tokens to minimize costs
If ai_config is not provided, the endpoint uses DEFAULT_AI_CONFIG.
🏗️ Architecture
Main Orchestrator
- Endpoint:
POST /api/rag/documents/upload
- Purpose: Main entry point that orchestrates the complete pipeline
- Calls: All 6 internal endpoints sequentially
- Infrastructure: Manages ProgressTracker, CheckpointRecoveryService, job_storage, heartbeat monitoring
Internal Endpoints
All internal endpoints are prefixed with /api/internal/ and tagged as "Internal Pipeline Stages" in OpenAPI docs.
📋 Endpoint Details
10. POST /api/internal/classify-images/{job_id} (10-20%)
Purpose: Classify ALL extracted images as material or non-material using two-stage AI classification
What It Receives:
extracted_images: ALL images extracted from the PDF (no filtering)
- Each image has: filename, path, page_number, width, height
How It Knows What to Process:
- It receives ALL images and classifies each one
- Uses AI to determine if each image is material-related or not
AI Processing:
Defaults:
- Uses
DEFAULT_AI_CONFIG if ai_config not provided
- Timeout: 600s (10 minutes)
Progress: Updates job to 10-20%
What Happens Next: The material_images list is passed to the next endpoint (upload-images)
20. POST /api/internal/upload-images/{job_id} (20-30%)
Purpose: Upload material images to Supabase Storage
What It Receives:
material_images: ONLY the images classified as material-related by endpoint #1
- Each image has: filename, path, page_number (from classification step)
How It Knows to Process ONLY Material Images:
- It doesn't decide - it receives a pre-filtered list from the orchestrator
- The orchestrator passes only the
material_images output from classify-images endpoint
- Non-material images are never sent to this endpoint
Processing:
- Uploads each image to Supabase Storage
- Storage bucket:
material-images
- Path format:
{document_id}/{filename}
- Parallel uploads with rate limiting
- Generates public URLs for each image
Defaults:
- Concurrency: 5 parallel uploads
- Timeout: 600s (10 minutes)
- Retry: 3 attempts with exponential backoff
Progress: Updates job to 20-30%
What Happens Next: The uploaded_images list (with storage URLs) is passed to the next endpoint (save-images-db)
30. POST /api/internal/save-images-db/{job_id} (30-50%)
Purpose: Save images to database and generate visual embeddings
What It Receives:
material_images: ONLY the successfully uploaded material images from endpoint #2
- Each image has: filename, storage_url, page_number
How It Knows to Process ONLY Material Images:
- It receives a pre-filtered list from the orchestrator
- The orchestrator passes only the
uploaded_images output from upload-images endpoint
- These are already filtered twice: (1) AI classification, (2) successful upload
AI Processing - Visual Embeddings (SLIG via HuggingFace Endpoint):
- Model: SLIG (SigLIP2) via HuggingFace Inference Endpoint
- Cloud-based GPU inference for consistent performance
- Auto-pause enabled to reduce costs
- Superior to CLIP for material understanding
- Embedding Dimension: 768D per embedding
- 5 Embedding Types Per Image:
- Visual (768D): General visual features from SLIG image encoder (image_embedding mode)
- Color (768D): Text-guided SLIG embedding optimized for color matching (text_embedding mode)
- Texture (768D): Text-guided SLIG embedding optimized for texture matching (text_embedding mode)
- Style (768D): Text-guided SLIG embedding optimized for style matching (text_embedding mode)
- Material (768D): Text-guided SLIG embedding optimized for material type matching (text_embedding mode)
Technical Details:
- Uses SLIG HuggingFace endpoint exclusively for all visual embeddings
- Generates base visual embedding (768D) using SLIG image_embedding mode
- Creates 4 text-guided specialized embeddings using SLIG text_embedding mode with prompts
- Each embedding is normalized to unit vector (L2 normalization)
- Total: 5 × 768D = 3,840 dimensions per image
- All embeddings use SLIG endpoint (mh-siglip2)
Storage:
- Saves to
document_images table (PostgreSQL) with embedding columns:
- visual_slig_768 (SLIG base embedding)
- color_slig_768 (SLIG color embedding)
- texture_slig_768 (SLIG texture embedding)
- style_slig_768 (SLIG style embedding)
- material_slig_768 (SLIG material embedding)
Calculation: 65 images × 5 embeddings = 325 total SLIG embeddings (768D each)
Defaults:
- Embeddings per image: 5 (visual, color, texture, style, material)
- Timeout: 600s (10 minutes)
- Batch processing: All embeddings generated in parallel
Progress: Updates job to 30-50%
What Happens Next: Images and embeddings are now in the database, ready for relationship creation
40. POST /api/internal/extract-metadata/{job_id} (50-60%)
Purpose: Extract comprehensive metadata from PDF text for products using AI
What It Receives:
product_ids: List of product IDs to enrich with metadata
pdf_text: Full PDF text content
document_id: ID of the document being processed
How It Knows What to Process:
- It receives product IDs from product discovery stage
- Extracts product-specific text based on page ranges
- Uses AI to extract structured metadata from text
AI Processing:
- Model: Configurable via
ai_config.metadata_extraction_model
- Default:
claude (Claude Sonnet 4.5)
- Alternative:
gpt (GPT-4o or GPT-5)
- Temperature: Configurable via
ai_config.metadata_temperature (default: 0.1)
- Max Tokens: Configurable via
ai_config.metadata_max_tokens (default: 4096)
- Extraction Method: Dynamic metadata extraction with category hints
- Metadata Fields: Dimensions, colors, patterns, materials, finishes, applications, certifications, etc.
Defaults:
metadata_extraction_model: "claude"
metadata_temperature: 0.1
metadata_max_tokens: 4096
- Timeout: 600s (10 minutes)
Progress: Updates job to 50-60%
What Happens Next: Products are enriched with metadata, ready for chunking
50. POST /api/internal/create-chunks/{job_id} (60-80%)
Purpose: Create semantic chunks from extracted text and generate text embeddings
What It Receives:
extracted_text: Full text content extracted from the PDF
product_ids: List of product IDs discovered in the PDF
chunk_size: Size of each chunk in characters
chunk_overlap: Overlap between chunks in characters
How It Knows What to Process:
- It receives the full extracted text from the PDF
- Uses
product_ids to create chunk-to-product relationships
- Respects product boundaries (doesn't mix content from different products)
AI Processing:
Text Chunking:
- Method: Semantic chunking with overlap
- Respects product boundaries (doesn't mix content from different products)
- Excludes index/navigation pages
- Creates chunks of specified size with overlap
Text Embeddings:
- Model: Voyage AI
voyage-3.5
- Dimension: 1024D
- Input Type:
document
- One embedding per chunk
- Uses Voyage AI API (sole text embedder; OpenAI fallback retired 2026-04)
Relationships:
- Creates chunk-to-product relationships
- Links chunks to products based on page ranges
Defaults:
chunk_size: 512 characters
chunk_overlap: 50 characters
- Timeout: 600s (10 minutes)
- Batch processing: All embeddings generated in parallel
Progress: Updates job to 60-80%
What Happens Next: Chunks and text embeddings are in the database, ready for relationship creation
60. POST /api/internal/create-relationships/{job_id} (80-100%)
Purpose: Create chunk-image and product-image relationships
What It Receives:
document_id: ID of the document being processed
product_ids: List of product IDs discovered in the PDF
similarity_threshold: Minimum similarity score for relationships
How It Knows What to Process:
- It queries the database for all chunks and images belonging to this document
- Uses the
document_id to find all relevant entities
- Creates relationships based on similarity and page ranges
Processing:
Defaults:
similarity_threshold: 0.5 (50% minimum similarity)
- Timeout: 600s (10 minutes)
Progress: Updates job to 80-100%
What Happens Next: Pipeline complete! All data is in the database with relationships.
🎯 Default Behavior
Focused Extraction (Default: ENABLED)
- Purpose: Process ONLY material-related images, skip non-material content
- Default:
focused_extraction=True
- Categories:
['products'] (only product-related content)
- Behavior:
- Classifies ALL images with AI
- Uploads ONLY material images to storage
- Saves ONLY material images to database
- Generates SLIG embeddings (768D) ONLY for material images
- Skips non-material images (faces, logos, charts, text)
Extract Categories
- Options:
['products', 'certificates', 'logos', 'specifications', 'all']
- Default:
['products']
- Behavior:
products: Only product pages and images
certificates: Only certificate pages and images
logos: Only logo pages and images
specifications: Only specification pages and images
all: Process entire PDF (disables focused extraction)
AI Models Used
- Product Discovery: Claude Sonnet 4.5 or GPT-5 (configurable)
- Image Classification: Qwen3-VL-32B-Instruct (HuggingFace Endpoint) → Claude Sonnet 4.5 (validation)
- Visual Embeddings: SLIG (SigLIP2) via HuggingFace Endpoint - 5 types per image, 768D each
- Text Embeddings: Voyage AI voyage-3.5 (1024D) — sole provider (updated 2026-04)
Thresholds
- Image Classification Confidence: 0.7 (70% minimum)
- Relationship Similarity: 0.5 (50% minimum)
- Chunk Size: 512 characters
- Chunk Overlap: 50 characters
Retry & Timeout
- Max Retries: 3 attempts per endpoint
- Timeout: 600s (10 minutes) per endpoint
- Backoff: Exponential (2^attempt seconds)
📊 Progress Tracking
Pipeline Stages (50-100%)
- 50-60%: Image Classification (Qwen3-VL-32B + Claude)
- 60-65%: Image Upload (Supabase Storage)
- 65-75%: Save Images & SLIG Embeddings (5 per image, 768D each)
- 75-85%: Chunking & Text Embeddings (Voyage AI 1024D)
- 85-100%: Relationships (chunk-image, product-image)
Infrastructure Integration
- ProgressTracker: Real-time progress updates, heartbeat monitoring (30s interval)
- CheckpointRecoveryService: Checkpoint creation at each stage for recovery
- JobTracker: Database sync after each stage
- job_storage: In-memory tracking for fast access
- Sentry: Error tracking and monitoring
✅ Summary
What Each Endpoint Does
- classify-images: AI classification (Qwen3-VL-32B → Claude) to filter material vs non-material images
- upload-images: Upload material images to Supabase Storage (receives pre-filtered list)
- save-images-db: Save to DB + generate 5 visual embeddings per image (SLIG 768D via HuggingFace endpoint)
- create-chunks: Semantic chunking + text embeddings (Voyage AI voyage-3.5 1024D)
- create-relationships: Chunk-image and product-image relationships via similarity
Default Processing Flow
- Extract ALL images from PDF
- Classify ALL images with AI (Qwen3-VL-32B + Claude) → material_images + non_material_images
- Upload ONLY material_images to storage → uploaded_images
- Save ONLY uploaded_images to database
- Generate 5 visual embeddings per saved image (SLIG 768D via HuggingFace endpoint)
- Create semantic chunks from text
- Generate text embeddings for chunks (Voyage AI voyage-3.5 1024D)
- Create relationships between chunks, images, and products
Key Features
- ✅ Focused extraction by default (only material images)
- ✅ Two-stage AI classification (Qwen3-VL-32B fast, Claude validation)
- ✅ 5 visual embeddings per image (SLIG 768D via HuggingFace endpoint: visual, color, texture, style, material)
- ✅ High-quality text embeddings (Voyage AI voyage-3.5 1024D)
- ✅ Semantic chunking with product boundary respect
- ✅ Comprehensive progress tracking (5% increments)
- ✅ Checkpoint creation at each stage for recovery
- ✅ Retry logic with exponential backoff
- ✅ Error handling with Sentry integration
- ✅ Real-time database sync after each stage
- ✅ Pre-filtered data flow (each endpoint receives processed output from previous stage)