Monitoring System
Kai includes a comprehensive monitoring system that provides real-time insights into the health, performance, and operation of the platform. This system is designed to help administrators identify issues, track system performance, and ensure optimal operation.
Features
System Health Monitoring
- Real-time Health Metrics: Track CPU usage, memory consumption, and service statuses
- Environment Variable Validation: Automatic validation of required environment variables
- Service Status: Monitor individual service health across the platform
- Rate Limit Statistics: Track API usage and rate limiting across different endpoints
Comprehensive Logging
- Centralized Log Collection: All system logs are collected in a central location
- Log Filtering: Filter logs by level, module, date range, and text content
- Error Distribution Analysis: Track error frequency by module to identify problem areas
Admin Dashboard
The monitoring system includes a dedicated admin dashboard that provides:
- System Health Visualization: Real-time charts and metrics for system health
- Log Explorer: Interactive interface for exploring and filtering logs
- Error Analysis: Visual breakdown of errors by module and time period
- Rate Limit Monitoring: Track API usage and rate limiting
Architecture
The monitoring system consists of:
- Backend Services: Collect metrics, logs, and health data
- Admin API: Provides access to monitoring data through dedicated endpoints
- Frontend Dashboard: Visualizes monitoring data for administrators
Prometheus Integration
The KAI platform uses Prometheus for metrics collection, aggregation, and storage. Prometheus is deployed as part of the monitoring stack in the monitoring
namespace.
Key Components
- Prometheus Server: Collects and stores time-series metrics data
- Alert Manager: Handles alerts sent by Prometheus server
- Grafana: Provides visualization and dashboards for Prometheus metrics
- Prometheus Adapter: Exposes Prometheus metrics to Kubernetes for HPA
Metrics Collection
Services expose metrics through annotations in their Kubernetes manifests:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
These annotations enable Prometheus to automatically discover and scrape metrics from the services.
Custom Metrics API
The platform uses the Prometheus Adapter to expose custom metrics to the Kubernetes API, enabling advanced autoscaling based on application-specific metrics:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: kai-ml
data:
config.yaml: |
rules:
# API Request Rate Metrics
- seriesQuery: 'http_requests_total'
resources:
overrides:
kubernetes_namespace:
kubernetes_pod_name:
name:
matches: "^(.*)_total"
as: "$_per_second"
metricsQuery: 'sum(rate(<<.Series>>[2m])) by (<<.GroupBy>>)'
# Queue Depth Metrics for Coordinator
- seriesQuery: 'kai_coordinator_queue_depth'
resources:
overrides:
kubernetes_namespace:
kubernetes_pod_name:
name:
matches: "kai_coordinator_queue_depth"
as: "coordinator_queue_depth"
metricsQuery: 'sum(<<.Series>>) by (<<.GroupBy>>)'
Available Metrics
The platform exposes various metrics through the monitoring service:
-
Workflow Metrics:
workflow_started_total
: Counter for started workflowsworkflow_completed_total
: Counter for completed workflowsworkflow_duration_seconds
: Histogram for workflow durationsworkflow_error_total
: Counter for workflow errors
-
Resource Metrics:
workflow_cpu_usage_cores
: Gauge for CPU usageworkflow_memory_usage_bytes
: Gauge for memory usageworkflow_gpu_usage_percent
: Gauge for GPU utilization
-
Coordinator Metrics:
kai_coordinator_queue_depth
: Gauge for queue depth by prioritykai_coordinator_active_workflows
: Gauge for active workflows by typekai_coordinator_workflow_duration_seconds
: Histogram for workflow durationskai_coordinator_workflow_completed_total
: Counter for completed workflowskai_coordinator_workflow_error_total
: Counter for workflow errorskai_coordinator_resource_utilization
: Gauge for resource utilization
-
Database Connection Metrics:
kai_supabase_connection_pool_active
: Gauge for active connectionskai_supabase_connection_pool_idle
: Gauge for idle connectionskai_supabase_connection_pool_total
: Gauge for total connectionskai_supabase_connection_pool_utilization
: Gauge for connection pool utilizationkai_supabase_connection_pool_waiting_acquires
: Gauge for waiting connection acquireskai_supabase_connection_pool_acquire_success_rate
: Gauge for connection acquisition success ratekai_supabase_connection_pool_average_acquire_time
: Gauge for average connection acquisition timekai_supabase_connection_pool_connection_errors
: Gauge for connection errors
-
Cache Metrics:
workflow_cache_hit_total
: Counter for cache hitsworkflow_stage_duration_seconds
: Histogram for stage durations
Accessing Grafana
Grafana provides visualization of all metrics collected by Prometheus. Here's how to access and use Grafana:
Access Methods
Method 1: Domain Access (if configured)
If Ingress has been set up:
- Navigate to
https://grafana.yourdomain.com
in your browser - You'll be presented with the Grafana login screen
Method 2: Port Forwarding
For direct access:
# Start port-forwarding to access Grafana UI locally
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
Then access Grafana at http://localhost:3000
in your browser.
Login Credentials
- Username:
admin
- Password: Set during installation
If you don't know the password, retrieve it with:
kubectl get secret prometheus-grafana -n monitoring -o jsonpath="" | base64 --decode ; echo
Available Dashboards
The following pre-configured dashboards are available:
-
Kubernetes Dashboard:
- Shows cluster-wide metrics
- Navigate to Dashboards → Browse → Default → Kubernetes Dashboard
-
ML Workflows Dashboard:
- Shows execution times and resource usage of ML pipelines
- Navigate to Dashboards → Browse → Default → ML Workflows Dashboard
-
ML Processing Dashboard:
- Shows metrics for different processing stages
- Navigate to Dashboards → Browse → Default → ML Processing Dashboard
-
Supabase Connection Pool Dashboard:
- Shows database connection pool metrics
- Monitors connection counts, utilization, and performance
- Tracks connection acquisition times and error rates
- Navigate to Dashboards → Browse → Default → Supabase Connection Pool
-
Kubernetes HPA Metrics Dashboard:
- Shows Horizontal Pod Autoscaler metrics
- Monitors replica counts, scaling events, and custom metrics
- Visualizes CPU/memory utilization and queue depths
- Navigate to Dashboards → Browse → Default → Kubernetes HPA Metrics
-
Coordinator Service Dashboard:
- Shows metrics for the Coordinator service
- Monitors queue depths, workflow durations, and error rates
- Tracks resource utilization and processing performance
- Navigate to Dashboards → Browse → Default → Coordinator Service
Exploring Metrics
To explore specific metrics:
- From the left menu, select "Explore"
- Select "Prometheus" as the data source
- Enter PromQL queries to retrieve specific metrics
- Example queries:
rate(workflow_completed_total[5m])
- Workflow completion rateavg(workflow_duration_seconds) by (type)
- Average duration by workflow typesum(workflow_error_total) by (type)
- Total errors by workflow type
Creating Custom Dashboards
You can create custom dashboards for specific monitoring needs:
- Click the "+" icon in the left sidebar
- Select "Dashboard"
- Click "Add new panel"
- Configure the panel with Prometheus queries and appropriate visualizations
Troubleshooting Grafana Access
If you're unable to access Grafana:
- Check if pods are running:
kubectl get pods -n monitoring
- Verify services:
kubectl get svc -n monitoring
- Check ingress (if using domain access):
kubectl get ingress -n monitoring
- Check for port-forwarding issues
API Endpoints
Health Endpoints
Basic Health Check
GET /health
Provides basic system health information including:
- System status
- Uptime information
- Memory usage
- Node.js version
- Environment health status
This endpoint is public and does not require authentication, making it suitable for automated health checks from load balancers or monitoring services.
Detailed Health Check
GET /health/detailed
Provides comprehensive system health data including:
- Detailed system status
- CPU and memory usage statistics
- Component-by-component health status
- Environment variable validation status
This endpoint requires authentication to protect sensitive system information.
Admin Monitoring API
Get System Logs
POST /api/admin/monitoring/logs
Retrieves system logs with filtering options:
- Filter by log level (debug, info, warn, error)
- Filter by module
- Filter by date range
- Full-text search within logs
- Pagination support
Get Error Distribution
GET /api/admin/monitoring/errors
Retrieves error distribution by module over a specified time period.
Get Health Metrics
GET /api/admin/monitoring/health
Retrieves detailed health metrics including CPU usage, memory utilization, service statuses, and rate limit statistics.
Rate Limiting
The system includes a sophisticated rate limiting mechanism to prevent abuse and ensure stability:
- Default API Rate Limit: 100 requests per minute for general API endpoints
- Authentication Rate Limit: 20 requests per minute for authentication endpoints to prevent brute force attacks
- ML Processing Rate Limit: 10 requests per minute for resource-intensive ML operations
- Agent API Rate Limit: 30 requests per minute for AI agent interactions
- PDF Processing Rate Limit: 5 requests per 10 minutes for resource-intensive PDF processing
Rate limit statistics are tracked and visible in the monitoring dashboard.
Environment Validation
The monitoring system includes a sophisticated environment variable validation mechanism:
- Requirement Levels: Variables can be marked as required, optional, development-only, or production-only
- Custom Validators: Each variable can have a custom validation function
- Health Reporting: Environment validation status is included in health checks
Setup and Configuration
To enable all monitoring features, ensure the following:
- Configure environment variables according to the validation rules
- Ensure the logger is properly configured
- Grant appropriate admin access to users who need monitoring capabilities
Best Practices
- Regular Monitoring: Check the monitoring dashboard regularly to identify potential issues
- Alert Configuration: Set up alerts for critical error thresholds
- Log Rotation: Configure log rotation to prevent storage issues
- Permission Management: Restrict monitoring access to authorized administrators
ML Training Monitoring Integration
The monitoring system integrates with the ML Training Monitoring System, providing specialized visualizations and controls for machine learning training processes:
- Training Metrics Visualization: Real-time charts showing loss, accuracy, and custom metrics
- Checkpoint Management: Interface for creating, comparing, and rolling back to model checkpoints
- Parameter Tuning: Controls for adjusting hyperparameters during training
- Training Job Control: Status monitoring and control for training jobs
For complete details on these capabilities, see the Training Monitoring System documentation.