Architecture: Supabase-Native with Custom Recovery Layer Last Updated: December 21, 2025
The Material Kai Vision Platform uses a unified job queue system across all data import pipelines with custom checkpoint-based recovery for resilient processing. This hybrid approach combines Supabase's reliability with custom recovery logic for fault tolerance.
background_jobs table┌─────────────────────────────────────────────────────────────────┐ │ Frontend (React) │ │ - PDF Upload (/admin/data-import) │ │ - Web Scraping UI (/scraper) │ │ - XML Import (/admin/data-import) │ │ - Unified Job Monitor (/admin/async-queue-monitor) │ └────────────────┬────────────────────────────────────────────────┘ │ ┌────────────────▼────────────────────────────────────────────────┐ │ Processing Layer │ │ │ │ MIVAA Backend (FastAPI) Edge Functions (Deno) │ │ - PDF Processing Service - scrape-session-manager │ │ - Web Scraping Service - xml-import-orchestrator │ │ - Checkpoint Recovery - scrape-single-page │ │ - Job Monitor (ALL TYPES) │ └────────────────┬────────────────────────────────────────────────┘ │ ┌────────────────▼────────────────────────────────────────────────┐ │ Supabase PostgreSQL │ │ - background_jobs (unified job tracking) │ │ - scraping_sessions (web scraping jobs) │ │ - scraping_pages (page-level tracking) │ │ - data_import_jobs (XML import jobs) │ │ - webhook_calls (API call tracking) │ │ - job_checkpoints (PDF recovery data) │ │ - ai_analysis_queue (AI analysis jobs) │ └─────────────────────────────────────────────────────────┘
Main job tracking table. Key columns: id, workspace_id, document_id, job_type, status (pending/processing/completed/failed/retrying/cancelled), progress_percent (0-100), created_at, started_at, completed_at, error_message, and metadata JSONB. A unique constraint exists on (workspace_id, document_id, job_type).
Job Types:
pdf_processing: Main PDF extraction and processingimage_analysis: Image analysis and embedding generationproduct_creation: Product record creationmetadata_extraction: Metafield extractionStatuses:
pending: Waiting to be processedprocessing: Currently being processedcompleted: Successfully completedfailed: Failed after all retriesretrying: Retrying after failurecancelled: Manually cancelledReal-time progress tracking for each stage. Key columns: id, job_id, stage, progress_percent, current_step, details JSONB, created_at. A unique constraint exists on (job_id, stage).
Stages:
initialized: Job createdpdf_extracted: PDF text extractedchunks_created: Text chunks createdtext_embeddings_generated: Text embeddings generatedimages_extracted: Images extracted from PDFimage_embeddings_generated: Image embeddings generatedproducts_detected: Products identifiedproducts_created: Product records createdcompleted: All processing completeCheckpoint data for recovery. Key columns: id, job_id, stage, checkpoint_data JSONB, metadata JSONB, created_at. A unique constraint exists on (job_id, stage).
Checkpoint Data stored per stage:
chunk_ids: IDs of created chunksimage_ids: IDs of extracted imagesembedding_ids: IDs of generated embeddingsproduct_ids: IDs of created productsmetadata: Stage-specific metadataQueue for image processing jobs. Key columns: id, document_id, image_id, status, priority (normal/high/critical), retry_count, max_retries, created_at, updated_at. A unique constraint exists on image_id.
Queue for AI analysis jobs. Key columns: id, document_id, chunk_id, analysis_type, status, priority, retry_count, created_at, updated_at. A unique constraint exists on (chunk_id, analysis_type).
The frontend uploads a PDF, the backend creates a job record in background_jobs with status pending, and returns the job_id to the frontend for tracking.
The job monitor detects the pending job and starts processing. It updates the status to processing, then executes the 14-stage pipeline (Stage 0: Product Discovery at 0-15%, through Stage 13: Quality Enhancement at 97-100%). At each stage, the system creates a checkpoint, updates job_progress, and updates background_jobs.progress_percent.
After each successful stage, the checkpoint recovery service stores the stage's output data (e.g., chunk IDs, total chunks, average chunk size) in the job_checkpoints table. This allows the job to resume from that point if it fails.
The job monitor runs every 60 seconds and detects jobs that have been stuck for more than 30 minutes without progress. For each stuck job, it checks whether a valid checkpoint exists, restarts from the checkpoint if available, or marks the job as failed if not.
The recovery service checks if a job can resume from its last checkpoint, validates the checkpoint data against the database (verifying that referenced chunks/images still exist), then either auto-restarts the job from the checkpoint (setting status back to pending so it gets picked up) or marks it as failed if the checkpoint is invalid.
Manages job queuing and queue operations. Provides methods for queue_image_processing_jobs(), queue_ai_analysis_jobs(), and update_job_progress() with stage, progress percentage, and item counts.
Handles checkpoint creation and recovery. Provides create_checkpoint(), get_last_checkpoint(), can_resume_from_checkpoint(), auto_restart_stuck_job(), and verify_checkpoint_data().
Monitors jobs and performs auto-recovery. Provides start() to begin monitoring, get_health_status() returning monitor status, job counts by status, stuck job count, and overall health string, and force_restart_job() for manual intervention.
GET /api/v1/admin/job-monitor/health returns a JSON object with monitor_running boolean, stats (checks performed, stuck jobs detected, jobs restarted, jobs failed, last check timestamp), job_counts (by status), stuck_jobs_count, and overall health string.
Maximum 3 attempts with exponential backoff: 1s base delay, 30s max delay, multiplier of 2, with jitter enabled. Resulting delays: Attempt 1: 1s, Attempt 2: 2s, Attempt 3: 4s.
The JobMonitorService is configured with check_interval_seconds=60 (check every minute), stuck_job_timeout_minutes=30 (mark as stuck after 30 minutes), and auto_restart_enabled=True.
critical: Processed immediatelyhigh: Processed before normal jobsnormal: Standard processinglow: Processed when resources availableSymptoms: Job processing >30 minutes without progress
Solution:
POST /api/v1/admin/jobs/{job_id}/restartSymptoms: Job fails to restart from checkpoint
Solution:
Symptoms: Job monitor consuming excessive memory
Solution:
The JobMonitorService continuously monitors ALL job types:
PDF Processing Jobs:
Web Scraping Sessions:
XML Import Jobs:
All job failures are automatically reported to Sentry with:
See monitoring-and-alerting.md for complete details.
The Material Kai Vision Platform uses a production-ready, unified job queue system with:
This hybrid approach combines Supabase's reliability with custom recovery logic to ensure robust PDF processing even in the face of failures.