Monitoring & Analytics System

Comprehensive real-time monitoring and analytics infrastructure for the Material Kai Vision Platform.


🎯 Overview

The platform includes a complete monitoring and analytics system that tracks:


📊 Admin Dashboards

1. PDF Processing Monitor (/admin/async-queue-monitor)

Purpose: Real-time monitoring of PDF processing jobs and pipeline stages

Features:

Data Sources:

Metrics Tracked: For pdf_processing, the system tracks pending, processing, completed, failed, retrying, and total job counts, plus success rate and average processing time. At the platform level it tracks total documents, total products created, total chunks created, and total images extracted.

Real-Time Updates:


2. Analytics Dashboard (/admin/analytics)

Purpose: Comprehensive analytics across search, API usage, PDF processing, and quality metrics

Tabs:

Search Analytics:

API Usage:

Agent Chat Analytics:

Quality Metrics:

PDF Processing:


3. AI Monitoring Dashboard (/admin/ai-monitoring)

Purpose: Track AI model usage, costs, and performance

Metrics:

Time Periods: 24h, 7d, 30d, 90d

Models Tracked:


🔄 Real-Time Job Tracking

Background Jobs System

Table: background_jobs

Columns: id, workspace_id, document_id, job_type (pdf_processing, image_analysis, etc.), status (pending, processing, completed, failed, retrying, cancelled), progress (0-100), created_at, started_at, completed_at, failed_at, error, and metadata JSONB.

The metadata JSONB field contains: filename, stage, products_discovered, chunks_created, images_extracted, embeddings_generated, processing_time_ms, ai_model, and retry_count.


📈 Monitoring Integration

Stage 0: Product Discovery

Metrics Tracked:

Stage 0 logs completion details and saves checkpoint metadata including all of the above metrics.


Stage 1: Focused Extraction

Metrics Tracked:

Stage 1 logs pages extracted, extraction rate percentage, and text length in characters.


Stage 3.5: Embedding-to-Text Conversion

Purpose: Convert visual embeddings to text descriptions for enhanced search

Metrics Tracked:

Stage 3.5 logs successful conversions, failed conversions, and AI calls made.


Stage 4: Metadata Consolidation

Purpose: Consolidate metadata from all sources (discovery, extraction, embeddings)

Metrics Tracked:

Stage 4 logs products consolidated, sources merged, and AI calls made.


🎯 Checkpoint System

9 Processing Checkpoints

All stages save comprehensive metrics to checkpoints for recovery:

  1. INITIALIZED - Job created
  2. PDF_EXTRACTED - Stage 1 complete (focused extraction)
  3. CHUNKS_CREATED - Stage 2 complete (chunking)
  4. TEXT_EMBEDDINGS_GENERATED - Stage 3 complete (text embeddings)
  5. IMAGES_EXTRACTED - Stage 5 complete (image extraction)
  6. IMAGE_EMBEDDINGS_GENERATED - Stage 7 complete (CLIP embeddings)
  7. PRODUCTS_DETECTED - Stage 0 complete (product discovery)
  8. PRODUCTS_CREATED - Stage 9 complete (product creation)
  9. COMPLETED - All stages complete

Each checkpoint saves a stage identifier, checkpoint_data (e.g., document_id, images_extracted, material_images), and metadata (e.g., processing_time_ms, ai_model, success_rate).


📊 Sentry Integration

Exception Tracking

All stages integrate with Sentry for exception capture. Any exception raised during processing is captured by sentry_sdk.capture_exception(), logged, and re-raised. Sentry receives the job ID, document ID, current stage, processing metrics, and error stack trace as context.


🔍 Search Analytics

Query Tracking

Table: search_queries

Metrics:

Analytics:


💰 Cost Tracking

AI Model Costs

Per Model Pricing:

Cost calculation multiplies token counts by per-million rates for input and output separately, then sums them.


📈 Performance Metrics

System-Wide Metrics

Uptime: 99.5%+ Users: 5,000+ Search Response: 200-800ms PDF Processing: 1-15 minutes (size-dependent) Concurrent Jobs: Unlimited queue

Accuracy:


🔔 Alerts & Notifications

Alert Types

Critical Alerts:

Warning Alerts:

Notification Channels:


🏥 System Health Monitoring

Overview

Comprehensive health monitoring system that tracks database performance, job monitoring service, and system reliability.

Access: /admin/analytics → System Health tab

Features

1. Database Health Monitoring

Metrics: healthy boolean, connection_test_ms, query_test_ms, error_count, consecutive_failures, uptime_seconds, and performance object with avg_query_time_ms, max_query_time_ms, slow_query_count, and slow_query_threshold_ms (1000).

2. Job Monitor Health

Metrics: monitor_running boolean, stuck_jobs_count, and health string ('healthy', 'degraded', or 'unhealthy').

3. Query Performance Metrics

Metrics: total_queries, slow_queries, slow_query_percentage, avg_query_time_ms, max_query_time_ms, and table_metrics (per-table counts, avg/max times, and slow query counts).

4. Circuit Breaker Status

States:

Metrics: state ('closed', 'open', or 'half_open') and failure_count.

Health Check API Endpoints

GET /health/

Basic health check - returns 200 if service is running.

GET /health/detailed

Comprehensive health status with all subsystems: overall_status, database, job_monitor, query_metrics, circuit_breaker, and timestamp.

GET /health/database

Database connection health only.

GET /health/job-monitor

Job monitoring service health only.

GET /health/metrics

Query performance metrics only.

GET /health/circuit-breakers

Circuit breaker status for all protected services.

POST /health/metrics/reset

Reset query performance metrics (useful for testing).

Database Performance Optimizations

Indexes Added (2025-01-20)

Six critical indexes to optimize job monitoring queries:

  1. Stuck Job Detection (idx_background_jobs_status_updated_at)

    • Query: WHERE status = 'processing' AND updated_at < cutoff_time
    • Performance: 500-900ms → 5-20ms (95-98% faster)
  2. Heartbeat Timeout Detection (idx_background_jobs_status_heartbeat)

    • Query: WHERE status = 'processing' AND last_heartbeat < cutoff_time
    • Performance: 500-900ms → 5-20ms (95-98% faster)
  3. Workspace + Status Queries (idx_background_jobs_workspace_status)

    • Query: WHERE workspace_id = ? AND status = ?
    • Composite index on (workspace_id, status, created_at DESC)
  4. Job Cleanup (idx_background_jobs_status_completed_at)

    • Query: WHERE status = 'completed' AND completed_at < cutoff_time
    • Partial index for completed jobs only
  5. Checkpoint Queries (idx_job_checkpoints_job_created)

    • Query: WHERE job_id = ? ORDER BY created_at DESC
    • Composite index on (job_id, created_at DESC)
  6. Progress Tracking (idx_job_progress_document_updated)

    • Query: WHERE document_id = ? ORDER BY updated_at DESC
    • Composite index on (document_id, updated_at DESC)

Impact:

Resilience Features

1. Retry Logic with Exponential Backoff

2. Circuit Breaker Pattern

3. Graceful Degradation

Monitoring & Alerts

Recommended Alerts

  1. Database Health

    • Alert if database.healthy = false for > 5 minutes
    • Alert if avg_query_time_ms > 500 for > 10 minutes
  2. Circuit Breaker

    • Alert if circuit_breaker.state = "open" for > 5 minutes
  3. Slow Queries

    • Alert if slow_query_percentage > 20%
  4. Job Monitor

    • Alert if stuck_jobs_count > 5
    • Alert if monitor_running = false

Sentry Issues Fixed


Last Updated: 2025-01-20 Version: 2.0.0 Status: Production Coverage: All pipeline stages, admin dashboards, monitoring systems, and health checks