Image Embedding Generation

Image embedding generation system with batching, retry logic, and checkpoint recovery for reliable CLIP embedding coverage.

Overview

The image embedding system generates visual embeddings for all processed images using CLIP models. The system includes batch processing, automatic retry with exponential backoff, and checkpoint recovery to ensure complete embedding coverage.

Features

1. Batch Processing

Implementation:

Process images in batches of 20 (configurable)
Reduces memory pressure
Better progress tracking
Enables checkpoint recovery per batch

Benefits:

More efficient resource usage
Clearer progress reporting
Easier to resume from failures

2. Retry Logic with Exponential Backoff

Implementation:

Up to 3 retries per failed image (configurable)
Exponential backoff: 2^retry_count seconds (2s, 4s, 8s)
Detailed logging for each retry attempt

Benefits:

Handles transient failures (network, API rate limits)
Prevents permanent data loss
Comprehensive error tracking

3. Checkpoint Recovery

Implementation:

Queries database for existing embeddings before processing
Skips already-processed images
Resumes from last successful batch

Benefits:

Safe to restart after failures
No duplicate processing
Efficient resource usage

4. Detailed Error Tracking

Implementation:

Returns failed_images array with index, path, page_number, error
Logs first 5 failures in detail
Tracks which images fail and why

Benefits:

Easy debugging
Clear visibility into failures
Actionable error messages

Implementation Details

New Methods

`_get_embedding_checkpoint(document_id: str) -> Optional[int]`

Queries the document_images table to count images with has_slig_embedding = TRUE for the given document (updated 2026-04 — the legacy visual_clip_embedding_512 column was dropped; VECS is now the single source of truth for image vectors, and per-image presence is tracked via boolean flags on document_images). Returns the count as an integer checkpoint index.

`_process_single_image_with_retry(...) -> Tuple[bool, bool, Optional[str]]`

Processes a single image with retry logic using a while loop up to max_retries attempts. On each failure, waits 2^retry_count seconds before retrying (exponential backoff). Returns a tuple of (image_saved, embedding_generated, error_message).

`save_images_and_generate_clips(...) -> Dict[str, Any]`

Main method with batching + retry + checkpointing. Signature: save_images_and_generate_clips(material_images, document_id, workspace_id, batch_size=20, max_retries=3). First checks the checkpoint to skip already-processed images, then processes remaining images in batches, calling _process_single_image_with_retry for each. Returns a dict with images_saved, clip_embeddings_generated, and failed_images.

Configuration

Default Parameters

batch_size: 20 images per batch
max_retries: 3 retry attempts per image
Exponential backoff: 2^retry_count seconds

Customization

All parameters are configurable via method arguments. For memory-constrained environments, use a smaller batch_size (e.g., 10). For unreliable networks, increase max_retries (e.g., 5).

Performance Impact

Before (Sequential Processing)

Processing Time: ~2-3 seconds per image
Success Rate: 51.6% (132/256 images)
Failure Handling: Silent failures, no retry
Recovery: Manual intervention required

After (Batched with Retry)

Processing Time: ~2-3 seconds per image (same)
Success Rate: 95%+ (expected with retry logic)
Failure Handling: Up to 3 retries with exponential backoff
Recovery: Automatic checkpoint recovery

Resource Usage

Memory: Slightly lower (batch processing)
Network: More efficient (retry logic handles transient failures)
Database: Same (checkpoint query is lightweight)

Testing Results

NOVA Test Case

Before Fix:

Total Images: 256
Images with Embeddings: 132 (51.6%)
Missing Embeddings: 124 (48.4%)

After Fix (Expected):

Total Images: 256
Images with Embeddings: 243+ (95%+)
Failed Images: <13 (5%)
All failures logged with detailed error messages

Error Handling

Retry Scenarios

Network Timeout - Retries with exponential backoff
API Rate Limit - Waits and retries
Temporary Service Unavailable - Retries after delay
Invalid Image Data - Fails after max retries, logs error

Permanent Failures

Images that fail after all retries are:

Logged with detailed error messages
Included in failed_images array
Reported in final summary
Can be manually retried later

Monitoring

Log Output

The log shows progress per batch: saving each image to DB with its UUID, generating CLIP embeddings per image, and batch completion messages. The final summary reports total images saved, total CLIP embeddings generated, and a list of failed images with their page numbers and error reasons (e.g., "Network timeout after 3 retries", "Invalid image format").

Integration

Pipeline Integration

The improved method is automatically used in the PDF processing pipeline at Stage 30: save-images-db (POST /api/internal/save-images-db/{job_id}), which calls save_images_and_generate_clips with the document's material images, document ID, and workspace ID.

Manual Usage

The service can also be called directly for reprocessing existing documents. After calling save_images_and_generate_clips, inspect the returned dict for clip_embeddings_generated, images_saved, and failed_images counts.

Understanding Embeddings (Qwen → Voyage AI)

Overview

Understanding embeddings capture the structured knowledge from Qwen3-VL's vision analysis. Rather than embedding the raw image pixels (which SLIG does), understanding embeddings embed the semantic description of what was detected: material types, colors, textures, dimensions, finishes, and OCR text.

How It Works

Qwen3-VL Analysis → Produces structured JSON (vision_analysis) with material type, colors, textures, properties, OCR text
JSON → Text Conversion → Converts structured fields into descriptive text (e.g., "Material: porcelain tile. Colors: white, grey. Texture: matte. Dimensions: 60x120cm.")
Voyage AI Embedding → Embeds the text via voyage-3.5 with input_type="document" → 1024D vector
VECS Storage → Stored in image_understanding_embeddings collection (1024D, HNSW index)

Search Flow

Query → Embedded via Voyage AI with input_type="query" → 1024D vector
VECS Search → Similarity search against understanding collection
Score Fusion → Combined with 6 other embedding scores using weighted fusion

Pipeline Integration (updated 2026-04)

Phase 1 image pipeline (inline): Generates the understanding embedding directly after Qwen analysis, in the same pass that writes SLIG embeddings to VECS. The former asynchronous "Phase 2 background processor" (background_image_processor.py) was deleted in 2026-04 — it was silently broken and produced no output.
CLIP Job Service: Generates understanding embedding for images with existing vision_analysis
Regeneration Endpoint: Includes understanding in embedding regeneration
Backfill Script: scripts/backfill_understanding_embeddings.py for existing images

Benefits

Spec-based search: Find "porcelain tile 60x120cm" or "R10 slip rating" through semantic matching
OCR-aware: Text detected in images is included in the embedding
Property-aware: Material properties, dimensions, finishes are all searchable
Complements SLIG: SLIG captures visual appearance; understanding captures semantic knowledge

Future Enhancements

Parallel Batch Processing - Process multiple batches concurrently
Adaptive Batch Size - Adjust batch size based on available memory
Smart Retry Strategy - Different retry logic for different error types
Automatic Reprocessing - Background job to retry failed images
Metrics Dashboard - Real-time monitoring of embedding generation

Image Embedding Generation

Overview

Features

1. Batch Processing

2. Retry Logic with Exponential Backoff

3. Checkpoint Recovery

4. Detailed Error Tracking

Implementation Details

New Methods

_get_embedding_checkpoint(document_id: str) -> Optional[int]

_process_single_image_with_retry(...) -> Tuple[bool, bool, Optional[str]]

save_images_and_generate_clips(...) -> Dict[str, Any]

Configuration

Default Parameters

Customization

Performance Impact

Before (Sequential Processing)

After (Batched with Retry)

Resource Usage

Testing Results

NOVA Test Case

Error Handling

Retry Scenarios

Permanent Failures

Monitoring

Log Output

Integration

Pipeline Integration

Manual Usage

Understanding Embeddings (Qwen → Voyage AI)

Overview

How It Works

Search Flow

Pipeline Integration (updated 2026-04)

Benefits

Future Enhancements

Related Documentation

`_get_embedding_checkpoint(document_id: str) -> Optional[int]`

`_process_single_image_with_retry(...) -> Tuple[bool, bool, Optional[str]]`

`save_images_and_generate_clips(...) -> Dict[str, Any]`