Image Embedding Generation
Image embedding generation system with batching, retry logic, and checkpoint recovery for reliable CLIP embedding coverage.
Overview
The image embedding system generates visual embeddings for all processed images using CLIP models. The system includes batch processing, automatic retry with exponential backoff, and checkpoint recovery to ensure complete embedding coverage.
Features
1. Batch Processing
Implementation:
- Process images in batches of 20 (configurable)
- Reduces memory pressure
- Better progress tracking
- Enables checkpoint recovery per batch
Benefits:
- More efficient resource usage
- Clearer progress reporting
- Easier to resume from failures
2. Retry Logic with Exponential Backoff
Implementation:
- Up to 3 retries per failed image (configurable)
- Exponential backoff: 2^retry_count seconds (2s, 4s, 8s)
- Detailed logging for each retry attempt
Benefits:
- Handles transient failures (network, API rate limits)
- Prevents permanent data loss
- Comprehensive error tracking
3. Checkpoint Recovery
Implementation:
- Queries database for existing embeddings before processing
- Skips already-processed images
- Resumes from last successful batch
Benefits:
- Safe to restart after failures
- No duplicate processing
- Efficient resource usage
4. Detailed Error Tracking
Implementation:
- Returns
failed_images array with index, path, page_number, error
- Logs first 5 failures in detail
- Tracks which images fail and why
Benefits:
- Easy debugging
- Clear visibility into failures
- Actionable error messages
Implementation Details
New Methods
_get_embedding_checkpoint(document_id: str) -> Optional[int]
Queries the document_images table to count images with has_slig_embedding = TRUE for the given document (updated 2026-04 — the legacy visual_clip_embedding_512 column was dropped; VECS is now the single source of truth for image vectors, and per-image presence is tracked via boolean flags on document_images). Returns the count as an integer checkpoint index.
_process_single_image_with_retry(...) -> Tuple[bool, bool, Optional[str]]
Processes a single image with retry logic using a while loop up to max_retries attempts. On each failure, waits 2^retry_count seconds before retrying (exponential backoff). Returns a tuple of (image_saved, embedding_generated, error_message).
save_images_and_generate_clips(...) -> Dict[str, Any]
Main method with batching + retry + checkpointing. Signature: save_images_and_generate_clips(material_images, document_id, workspace_id, batch_size=20, max_retries=3). First checks the checkpoint to skip already-processed images, then processes remaining images in batches, calling _process_single_image_with_retry for each. Returns a dict with images_saved, clip_embeddings_generated, and failed_images.
Configuration
Default Parameters
batch_size: 20 images per batch
max_retries: 3 retry attempts per image
- Exponential backoff: 2^retry_count seconds
Customization
All parameters are configurable via method arguments. For memory-constrained environments, use a smaller batch_size (e.g., 10). For unreliable networks, increase max_retries (e.g., 5).
Performance Impact
Before (Sequential Processing)
- Processing Time: ~2-3 seconds per image
- Success Rate: 51.6% (132/256 images)
- Failure Handling: Silent failures, no retry
- Recovery: Manual intervention required
After (Batched with Retry)
- Processing Time: ~2-3 seconds per image (same)
- Success Rate: 95%+ (expected with retry logic)
- Failure Handling: Up to 3 retries with exponential backoff
- Recovery: Automatic checkpoint recovery
Resource Usage
- Memory: Slightly lower (batch processing)
- Network: More efficient (retry logic handles transient failures)
- Database: Same (checkpoint query is lightweight)
Testing Results
NOVA Test Case
Before Fix:
- Total Images: 256
- Images with Embeddings: 132 (51.6%)
- Missing Embeddings: 124 (48.4%)
After Fix (Expected):
- Total Images: 256
- Images with Embeddings: 243+ (95%+)
- Failed Images: <13 (5%)
- All failures logged with detailed error messages
Error Handling
Retry Scenarios
- Network Timeout - Retries with exponential backoff
- API Rate Limit - Waits and retries
- Temporary Service Unavailable - Retries after delay
- Invalid Image Data - Fails after max retries, logs error
Permanent Failures
Images that fail after all retries are:
- Logged with detailed error messages
- Included in
failed_images array
- Reported in final summary
- Can be manually retried later
Monitoring
Log Output
The log shows progress per batch: saving each image to DB with its UUID, generating CLIP embeddings per image, and batch completion messages. The final summary reports total images saved, total CLIP embeddings generated, and a list of failed images with their page numbers and error reasons (e.g., "Network timeout after 3 retries", "Invalid image format").
Integration
Pipeline Integration
The improved method is automatically used in the PDF processing pipeline at Stage 30: save-images-db (POST /api/internal/save-images-db/{job_id}), which calls save_images_and_generate_clips with the document's material images, document ID, and workspace ID.
Manual Usage
The service can also be called directly for reprocessing existing documents. After calling save_images_and_generate_clips, inspect the returned dict for clip_embeddings_generated, images_saved, and failed_images counts.
Understanding Embeddings (Qwen → Voyage AI)
Overview
Understanding embeddings capture the structured knowledge from Qwen3-VL's vision analysis. Rather than embedding the raw image pixels (which SLIG does), understanding embeddings embed the semantic description of what was detected: material types, colors, textures, dimensions, finishes, and OCR text.
How It Works
- Qwen3-VL Analysis → Produces structured JSON (
vision_analysis) with material type, colors, textures, properties, OCR text
- JSON → Text Conversion → Converts structured fields into descriptive text (e.g., "Material: porcelain tile. Colors: white, grey. Texture: matte. Dimensions: 60x120cm.")
- Voyage AI Embedding → Embeds the text via
voyage-3.5 with input_type="document" → 1024D vector
- VECS Storage → Stored in
image_understanding_embeddings collection (1024D, HNSW index)
Search Flow
- Query → Embedded via Voyage AI with
input_type="query" → 1024D vector
- VECS Search → Similarity search against understanding collection
- Score Fusion → Combined with 6 other embedding scores using weighted fusion
Pipeline Integration (updated 2026-04)
- Phase 1 image pipeline (inline): Generates the understanding embedding directly after Qwen analysis, in the same pass that writes SLIG embeddings to VECS. The former asynchronous "Phase 2 background processor" (
background_image_processor.py) was deleted in 2026-04 — it was silently broken and produced no output.
- CLIP Job Service: Generates understanding embedding for images with existing vision_analysis
- Regeneration Endpoint: Includes understanding in embedding regeneration
- Backfill Script:
scripts/backfill_understanding_embeddings.py for existing images
Benefits
- Spec-based search: Find "porcelain tile 60x120cm" or "R10 slip rating" through semantic matching
- OCR-aware: Text detected in images is included in the embedding
- Property-aware: Material properties, dimensions, finishes are all searchable
- Complements SLIG: SLIG captures visual appearance; understanding captures semantic knowledge
Future Enhancements
- Parallel Batch Processing - Process multiple batches concurrently
- Adaptive Batch Size - Adjust batch size based on available memory
- Smart Retry Strategy - Different retry logic for different error types
- Automatic Reprocessing - Background job to retry failed images
- Metrics Dashboard - Real-time monitoring of embedding generation
Related Documentation