Job Queue System & Async Processing

Architecture: Supabase-Native with Custom Recovery Layer Last Updated: December 21, 2025


Overview

The Material Kai Vision Platform uses a unified job queue system across all data import pipelines with custom checkpoint-based recovery for resilient processing. This hybrid approach combines Supabase's reliability with custom recovery logic for fault tolerance.

Supported Job Types

  1. PDF Processing - Extract materials from PDF catalogs with 9-stage checkpoint recovery
  2. Web Scraping - Discover materials from websites with page-level tracking
  3. XML Import - Import materials from XML feeds with webhook retry logic

Key Features


🏗️ Architecture

Multi-Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐ │ Frontend (React) │ │ - PDF Upload (/admin/data-import) │ │ - Web Scraping UI (/scraper) │ │ - XML Import (/admin/data-import) │ │ - Unified Job Monitor (/admin/async-queue-monitor) │ └────────────────┬────────────────────────────────────────────────┘ │ ┌────────────────▼────────────────────────────────────────────────┐ │ Processing Layer │ │ │ │ MIVAA Backend (FastAPI) Edge Functions (Deno) │ │ - PDF Processing Service - scrape-session-manager │ │ - Web Scraping Service - xml-import-orchestrator │ │ - Checkpoint Recovery - scrape-single-page │ │ - Job Monitor (ALL TYPES) │ └────────────────┬────────────────────────────────────────────────┘ │ ┌────────────────▼────────────────────────────────────────────────┐ │ Supabase PostgreSQL │ │ - background_jobs (unified job tracking) │ │ - scraping_sessions (web scraping jobs) │ │ - scraping_pages (page-level tracking) │ │ - data_import_jobs (XML import jobs) │ │ - webhook_calls (API call tracking) │ │ - job_checkpoints (PDF recovery data) │ │ - ai_analysis_queue (AI analysis jobs) │ └─────────────────────────────────────────────────────────┘


📊 Database Tables

background_jobs

Main job tracking table. Key columns: id, workspace_id, document_id, job_type, status (pending/processing/completed/failed/retrying/cancelled), progress_percent (0-100), created_at, started_at, completed_at, error_message, and metadata JSONB. A unique constraint exists on (workspace_id, document_id, job_type).

Job Types:

Statuses:


job_progress

Real-time progress tracking for each stage. Key columns: id, job_id, stage, progress_percent, current_step, details JSONB, created_at. A unique constraint exists on (job_id, stage).

Stages:


job_checkpoints

Checkpoint data for recovery. Key columns: id, job_id, stage, checkpoint_data JSONB, metadata JSONB, created_at. A unique constraint exists on (job_id, stage).

Checkpoint Data stored per stage:


image_processing_queue

Queue for image processing jobs. Key columns: id, document_id, image_id, status, priority (normal/high/critical), retry_count, max_retries, created_at, updated_at. A unique constraint exists on image_id.


ai_analysis_queue

Queue for AI analysis jobs. Key columns: id, document_id, chunk_id, analysis_type, status, priority, retry_count, created_at, updated_at. A unique constraint exists on (chunk_id, analysis_type).


🔄 Processing Flow

1. Job Creation

The frontend uploads a PDF, the backend creates a job record in background_jobs with status pending, and returns the job_id to the frontend for tracking.

2. Job Processing

The job monitor detects the pending job and starts processing. It updates the status to processing, then executes the 14-stage pipeline (Stage 0: Product Discovery at 0-15%, through Stage 13: Quality Enhancement at 97-100%). At each stage, the system creates a checkpoint, updates job_progress, and updates background_jobs.progress_percent.

3. Checkpoint Creation

After each successful stage, the checkpoint recovery service stores the stage's output data (e.g., chunk IDs, total chunks, average chunk size) in the job_checkpoints table. This allows the job to resume from that point if it fails.

4. Stuck Job Detection

The job monitor runs every 60 seconds and detects jobs that have been stuck for more than 30 minutes without progress. For each stuck job, it checks whether a valid checkpoint exists, restarts from the checkpoint if available, or marks the job as failed if not.

5. Auto-Recovery

The recovery service checks if a job can resume from its last checkpoint, validates the checkpoint data against the database (verifying that referenced chunks/images still exist), then either auto-restarts the job from the checkpoint (setting status back to pending so it gets picked up) or marks it as failed if the checkpoint is invalid.


🎯 Key Services

AsyncQueueService

Manages job queuing and queue operations. Provides methods for queue_image_processing_jobs(), queue_ai_analysis_jobs(), and update_job_progress() with stage, progress percentage, and item counts.

CheckpointRecoveryService

Handles checkpoint creation and recovery. Provides create_checkpoint(), get_last_checkpoint(), can_resume_from_checkpoint(), auto_restart_stuck_job(), and verify_checkpoint_data().

JobMonitorService

Monitors jobs and performs auto-recovery. Provides start() to begin monitoring, get_health_status() returning monitor status, job counts by status, stuck job count, and overall health string, and force_restart_job() for manual intervention.


📈 Monitoring & Observability

Health Check Endpoint

GET /api/v1/admin/job-monitor/health returns a JSON object with monitor_running boolean, stats (checks performed, stuck jobs detected, jobs restarted, jobs failed, last check timestamp), job_counts (by status), stuck_jobs_count, and overall health string.

Admin Dashboard


⚙️ Configuration

Retry Policy

Maximum 3 attempts with exponential backoff: 1s base delay, 30s max delay, multiplier of 2, with jitter enabled. Resulting delays: Attempt 1: 1s, Attempt 2: 2s, Attempt 3: 4s.

Job Monitor Configuration

The JobMonitorService is configured with check_interval_seconds=60 (check every minute), stuck_job_timeout_minutes=30 (mark as stuck after 30 minutes), and auto_restart_enabled=True.

Priority Levels


🚀 Production Metrics

Performance

Reliability

Capacity


🔧 Troubleshooting

Job Stuck

Symptoms: Job processing >30 minutes without progress

Solution:

  1. Check job_progress table for last update
  2. Verify checkpoint exists
  3. Manual restart: POST /api/v1/admin/jobs/{job_id}/restart

Checkpoint Invalid

Symptoms: Job fails to restart from checkpoint

Solution:

  1. Verify checkpoint_data in job_checkpoints table
  2. Check if referenced chunks/images exist
  3. Cleanup invalid checkpoints
  4. Restart job from beginning

High Memory Usage

Symptoms: Job monitor consuming excessive memory

Solution:

  1. Reduce check_interval_seconds
  2. Reduce stuck_job_timeout_minutes
  3. Cleanup old checkpoints
  4. Restart job monitor service

📚 Related Documentation


🚨 Monitoring & Alerting

Job Monitor Service

The JobMonitorService continuously monitors ALL job types:

PDF Processing Jobs:

Web Scraping Sessions:

XML Import Jobs:

Sentry Integration

All job failures are automatically reported to Sentry with:

See monitoring-and-alerting.md for complete details.


Summary

The Material Kai Vision Platform uses a production-ready, unified job queue system with:

This hybrid approach combines Supabase's reliability with custom recovery logic to ensure robust PDF processing even in the face of failures.