The web scraping system is fully integrated with the platform's unified job tracking infrastructure, providing:
┌─────────────────────────────────────────────────────────────┐ │ Scraping Session │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ scraping_sessions │ │ │ │ - id, source_url, status │ │ │ │ - background_job_id ← Links to job tracking │ │ │ │ - workspace_id │ │ │ │ - progress_percentage, total_pages │ │ │ └────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Unified Job Tracking System │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ background_jobs │ │ │ │ - id, job_type: 'web_scraping' │ │ │ │ - status, progress, current_stage │ │ │ │ - last_heartbeat (updated every 30s) │ │ │ │ - metadata (session_id, scraping_mode, etc.) │ │ │ └────────────────────────────────────────────────────────┘ │ │ │ │ Features: │ │ • Automatic retry (3 attempts with exponential backoff) │ │ • Stuck job detection (no heartbeat > 5 min) │ │ • Dead letter queue for failed jobs │ │ • Circuit breaker for external services │ │ • Sentry integration for crash alerts │ └─────────────────────────────────────────────────────────────┘
When a scraping session is created, the system first inserts a record into background_jobs with job_type: 'web_scraping', status: 'pending', progress: 0, current_stage: 'initializing', and metadata containing session_id, source_url, scraping_mode, and total_pages. Then it inserts a record into scraping_sessions with a background_job_id linking it to the job tracking entry.
The job monitor service automatically:
If a job fails or gets stuck, the system applies automatic retry with exponential backoff (max 3 attempts, 2s base delay, 30s max delay), specifically handling TimeoutError and ConnectionError exceptions.
The job monitor service runs continuously and queries background_jobs for records with status = 'processing' and last_heartbeat < NOW() - INTERVAL '5 minutes'. For each stuck job found, it sends a Sentry alert and attempts recovery from the last checkpoint.
✅ Processing Times: Tracked per page and per session ✅ Success Rates: Calculated from completed vs failed jobs ✅ Error Rates: Tracked in job_history and metrics ✅ Throughput: Jobs processed per minute ✅ Queue Depth: Number of pending jobs
The batch job queue exposes a getMetrics() method returning totalJobs, queuedJobs, processingJobs, completedJobs, failedJobs, throughputPerMinute, errorRate, and averageProcessingTime.
Sentry alerts are automatically sent for:
✅ Batch Processing: Process multiple pages concurrently ✅ Connection Pooling: Supabase client uses connection pooling ✅ Circuit Breaker: Prevents cascading failures ✅ Rate Limiting: Configurable delays between batches ✅ Timeout Protection: Prevents hanging requests
Batch Inserts for Pages: Instead of inserting scraping pages one by one, process them in batches of 100 using a loop with supabase.from('scraping_pages').insert(batch).
Database Query Optimization: Add a composite index on scraping_pages(session_id, status) to speed up common status queries for a given session.
background_job_id link in scraping_sessionsThe scraping system now leverages the existing, battle-tested job infrastructure used by PDF processing!