Monitoring and Alerting System

Overview

The Material KAI Vision Platform now has comprehensive monitoring and alerting across all job types:

All failures are automatically reported to Sentry for real-time alerting and debugging.


🚨 Sentry Integration

Sentry Project

What Gets Reported to Sentry

1. PDF Processing Failures

2. Stuck PDF Jobs

3. Stuck Scraping Sessions

4. Scraping Session Failures

5. Stuck XML Import Jobs

6. XML Import Job Failures


📊 Job Monitor Service

Location

mivaa-pdf-extractor/app/services/job_monitor_service.py

Features

1. Multi-Job Type Monitoring

2. Detection Methods

PDF Jobs:

Scraping Sessions:

Import Jobs:

3. Recovery Strategies

PDF Jobs:

  1. Check for valid checkpoint
  2. If checkpoint exists and is valid → Auto-restart from checkpoint
  3. If no checkpoint or invalid → Mark as failed + Sentry alert

Scraping Sessions:

  1. Mark session as failed
  2. Send Sentry alert with session details
  3. Update linked background_job if exists

Import Jobs:

  1. Mark job as failed
  2. Send Sentry alert with job details
  3. Update linked background_job if exists

4. Configuration

The JobMonitorService is configured with check_interval=60 (check every 60 seconds), stuck_timeout=5 (PDF jobs: 5min timeout), auto_restart=True (enable auto-restart for PDF jobs), and max_restart_attempts=3 (max 3 restart attempts per job).


🔄 Checkpoint Recovery System

Location

mivaa-pdf-extractor/app/services/checkpoint_recovery_service.py

Features

Checkpoint Stages

  1. pdf_loaded - PDF file loaded and validated
  2. text_extracted - Text extraction completed
  3. tiles_generated - Image tiles generated
  4. embeddings_created - Vector embeddings created
  5. materials_extracted - Materials extracted and saved

🌐 Edge Function Monitoring

Sentry Helper

Location: supabase/functions/_shared/sentry.ts

Functions:

The helper is used in edge functions by importing captureException from the shared module, then wrapping processing logic in a try/catch that calls captureException with tags (function name, error type), extra context (job ID, error message), and a fingerprint for Sentry grouping before re-throwing the error.

Monitored Edge Functions

  1. scrape-session-manager - Web scraping orchestration
  2. xml-import-orchestrator - XML import processing

📈 Monitoring Dashboard

Sentry Dashboard

Access at: https://sentry.io/organizations/basilis-kanonidis/issues/

Key Metrics:

Database Tables

background_jobs:

scraping_sessions:

data_import_jobs:


🔧 Troubleshooting

Common Issues

1. Stuck Jobs Not Recovering

2. Sentry Alerts Not Appearing

3. High Failure Rate


📝 Best Practices

  1. Monitor Sentry Daily: Check for new error patterns
  2. Review Stuck Jobs: Investigate why jobs get stuck
  3. Optimize Timeouts: Adjust based on actual processing times
  4. Test Recovery: Periodically test checkpoint recovery
  5. Update Fingerprints: Keep Sentry fingerprints unique for proper grouping