Comprehensive real-time monitoring and analytics infrastructure for the Material Kai Vision Platform.
The platform includes a complete monitoring and analytics system that tracks:
/admin/async-queue-monitor)Purpose: Real-time monitoring of PDF processing jobs and pipeline stages
Features:
Data Sources:
background_jobs table - Main job trackingjob_checkpoints table - Stage-by-stage progressMetrics Tracked: For pdf_processing, the system tracks pending, processing, completed, failed, retrying, and total job counts, plus success rate and average processing time. At the platform level it tracks total documents, total products created, total chunks created, and total images extracted.
Real-Time Updates:
background_jobs table/admin/analytics)Purpose: Comprehensive analytics across search, API usage, PDF processing, and quality metrics
Tabs:
Search Analytics:
search_query_tracking table stores weight_profile, dynamic_weights (JSONB), and weight_profile_source (default, query_understanding, manual_override)API Usage:
Agent Chat Analytics:
Quality Metrics:
PDF Processing:
/admin/ai-monitoring)Purpose: Track AI model usage, costs, and performance
Metrics:
Time Periods: 24h, 7d, 30d, 90d
Models Tracked:
Table: background_jobs
Columns: id, workspace_id, document_id, job_type (pdf_processing, image_analysis, etc.), status (pending, processing, completed, failed, retrying, cancelled), progress (0-100), created_at, started_at, completed_at, failed_at, error, and metadata JSONB.
The metadata JSONB field contains: filename, stage, products_discovered, chunks_created, images_extracted, embeddings_generated, processing_time_ms, ai_model, and retry_count.
Metrics Tracked:
products_discovered - Number of products foundcertificates_discovered - Number of certificates foundlogos_discovered - Number of logos foundspecifications_discovered - Number of specifications foundtotal_entities - Total entities across all categoriesdiscovery_time_ms - Processing timediscovery_model - AI model usedconfidence_score - Overall confidenceStage 0 logs completion details and saves checkpoint metadata including all of the above metrics.
Metrics Tracked:
extracted_pages_count - Number of pages extractedtotal_pages_count - Total pages in PDFtext_length - Length of extracted textextraction_rate - Percentage of pages extractedfocused_extraction - Boolean flagStage 1 logs pages extracted, extraction rate percentage, and text length in characters.
Purpose: Convert visual embeddings to text descriptions for enhanced search
Metrics Tracked:
embedding_to_text_count - Successful conversionsembedding_to_text_failed - Failed conversionsembedding_to_text_ai_calls - AI API calls madevisual_metadata_extracted - Boolean flagStage 3.5 logs successful conversions, failed conversions, and AI calls made.
Purpose: Consolidate metadata from all sources (discovery, extraction, embeddings)
Metrics Tracked:
metadata_consolidation_count - Products consolidatedmetadata_consolidation_failed - Failed consolidationsmetadata_consolidation_ai_calls - AI API calls mademetadata_sources_merged - Number of sources mergedStage 4 logs products consolidated, sources merged, and AI calls made.
All stages save comprehensive metrics to checkpoints for recovery:
Each checkpoint saves a stage identifier, checkpoint_data (e.g., document_id, images_extracted, material_images), and metadata (e.g., processing_time_ms, ai_model, success_rate).
All stages integrate with Sentry for exception capture. Any exception raised during processing is captured by sentry_sdk.capture_exception(), logged, and re-raised. Sentry receives the job ID, document ID, current stage, processing metrics, and error stack trace as context.
Table: search_queries
Metrics:
Analytics:
Per Model Pricing:
Cost calculation multiplies token counts by per-million rates for input and output separately, then sums them.
Uptime: 99.5%+ Users: 5,000+ Search Response: 200-800ms PDF Processing: 1-15 minutes (size-dependent) Concurrent Jobs: Unlimited queue
Accuracy:
Critical Alerts:
Warning Alerts:
Notification Channels:
Comprehensive health monitoring system that tracks database performance, job monitoring service, and system reliability.
Access: /admin/analytics → System Health tab
Metrics: healthy boolean, connection_test_ms, query_test_ms, error_count, consecutive_failures, uptime_seconds, and performance object with avg_query_time_ms, max_query_time_ms, slow_query_count, and slow_query_threshold_ms (1000).
Metrics: monitor_running boolean, stuck_jobs_count, and health string ('healthy', 'degraded', or 'unhealthy').
Metrics: total_queries, slow_queries, slow_query_percentage, avg_query_time_ms, max_query_time_ms, and table_metrics (per-table counts, avg/max times, and slow query counts).
States:
Metrics: state ('closed', 'open', or 'half_open') and failure_count.
GET /health/Basic health check - returns 200 if service is running.
GET /health/detailedComprehensive health status with all subsystems: overall_status, database, job_monitor, query_metrics, circuit_breaker, and timestamp.
GET /health/databaseDatabase connection health only.
GET /health/job-monitorJob monitoring service health only.
GET /health/metricsQuery performance metrics only.
GET /health/circuit-breakersCircuit breaker status for all protected services.
POST /health/metrics/resetReset query performance metrics (useful for testing).
Six critical indexes to optimize job monitoring queries:
Stuck Job Detection (idx_background_jobs_status_updated_at)
WHERE status = 'processing' AND updated_at < cutoff_timeHeartbeat Timeout Detection (idx_background_jobs_status_heartbeat)
WHERE status = 'processing' AND last_heartbeat < cutoff_timeWorkspace + Status Queries (idx_background_jobs_workspace_status)
WHERE workspace_id = ? AND status = ?Job Cleanup (idx_background_jobs_status_completed_at)
WHERE status = 'completed' AND completed_at < cutoff_timeCheckpoint Queries (idx_job_checkpoints_job_created)
WHERE job_id = ? ORDER BY created_at DESCProgress Tracking (idx_job_progress_document_updated)
WHERE document_id = ? ORDER BY updated_at DESCImpact:
Database Health
database.healthy = false for > 5 minutesavg_query_time_ms > 500 for > 10 minutesCircuit Breaker
circuit_breaker.state = "open" for > 5 minutesSlow Queries
slow_query_percentage > 20%Job Monitor
stuck_jobs_count > 5monitor_running = falseLast Updated: 2025-01-20 Version: 2.0.0 Status: Production Coverage: All pipeline stages, admin dashboards, monitoring systems, and health checks