Analytics System
This document outlines the analytics system architecture, implementation details, and data storage considerations for tracking searches, Agent AI prompts, and API requests on our platform.
Overview
The analytics system captures and stores user interaction data to provide insights into:
- Search patterns and most common search queries
- Agent AI usage and popular prompts
- API request volumes, patterns, and performance metrics
- Material view trends and popular materials
This data enables trend analysis, performance monitoring, feature optimization, and predictive analytics based on actual usage patterns.
Data Captured
The analytics system collects the following data points:
- Event Type: The type of interaction (search, agent_prompt, api_request, material_view)
- Timestamp: When the interaction occurred
- User ID: Which user performed the action (if authenticated)
- Resource Type: What resource was accessed (materials, collections, etc.)
- Query: The actual search query or agent prompt text
- Response Time: How long the operation took to complete (ms)
- Response Status: HTTP status code for API requests
- Additional Context: JSON field for variable additional data
Implementation Architecture
The analytics system consists of these key components:
- Analytics Middleware: An Express middleware that automatically captures all API requests
- Analytics Service: Service methods to track specific events (searches, agent prompts, etc.)
- Database Schema: Table structure and indexes for storing analytics data
- Database Functions: PostgreSQL functions that process raw data into insights
- Predictive Analytics: Advanced capabilities for forecasting, anomaly detection, and user behavior prediction
- Real-Time Analytics: WebSocket-based system for processing and delivering analytics events in real-time
- Admin Dashboard: UI for visualizing analytics data, trends, and predictions
Data Storage Options
Option 1: Supabase PostgreSQL (Current Implementation)
Our current implementation uses Supabase's PostgreSQL database:
Advantages
- Seamless Integration: Works within our existing Supabase infrastructure
- Simplicity: Single database system for application and analytics data
- Real-time Access: Low latency for dashboard visualizations
- SQL Power: Full PostgreSQL capabilities for complex queries
- Data Integrity: ACID compliance ensures reliable analytics
- Implementation Speed: Faster to implement with existing Supabase knowledge
Limitations
- Scaling Challenges: May struggle with very high volumes (billions of records)
- Cost Structure: Storage costs increase linearly with data volume
- Mixed Workload: Analytics queries compete with operational queries
Implementation Details
The Supabase implementation uses:
- A dedicated
analytics_events
table with appropriate indexes - PostgreSQL functions for aggregation and trend analysis
- Row-level security policies to protect sensitive data
Database schema details are defined in the migration file:
packages/server/src/services/supabase/migrations/005_analytics_system.sql
Option 2: BigQuery Alternative
For larger-scale deployments, Google BigQuery provides an alternative:
Advantages
- Massive Scale: Handles petabytes of data with ease
- Query Performance: Superior performance for complex analytical queries
- Separation of Concerns: Analytics workload isolated from operational database
- Cost Efficiency: Pay-per-query pricing can be cost-effective for intermittent analysis
- Advanced Features: Machine learning integrations and advanced analytics
- No Storage Management: Serverless architecture requires no capacity planning
Limitations
- Implementation Complexity: Requires ETL pipelines to load data from application
- Higher Latency: Not ideal for real-time dashboard updates
- Learning Curve: Team needs to learn BigQuery-specific SQL dialect
- Additional Service: Adds another external dependency
Implementation Approach
To implement BigQuery storage:
- Create a BigQuery dataset and table schema matching our analytics structure
- Implement a data export service that regularly:
- Queries recent analytics events from Supabase
- Transforms data if needed
- Loads data into BigQuery tables
- Update admin dashboard to query BigQuery for historical analytics
- Optionally implement a hybrid approach:
- Recent data (30-90 days) stays in Supabase for real-time access
- Historical data moves to BigQuery for long-term storage and analysis
Hybrid Approach Considerations
For systems with high analytics volume, consider a hybrid approach:
-
Real-time Tier: Keep recent data (last 30-90 days) in Supabase
- Powers real-time dashboards with low latency
- Handles rapid writes efficiently
- Enables immediate analysis of current activity
-
Historical Tier: Archive older data to BigQuery
- Stores historical data cost-effectively
- Enables complex analytical queries over large datasets
- Removes pressure from operational database
This approach provides the benefits of both systems while minimizing their limitations.
Setup Instructions
Supabase Setup
-
Run the migration script:
cd packages/server
npm run migration:run -
The migration creates:
- The analytics_events table
- Required indexes
- PostgreSQL functions for aggregation
- Row-level security policies
-
The analytics service will use the new table automatically
BigQuery Setup (Optional)
- Create a BigQuery project and dataset
- Create tables matching our analytics schema
- Set up the data export service to run on schedule
- Update dashboard services to query BigQuery for historical data
Using the Analytics System
The system automatically tracks all API requests through the middleware. For additional manual tracking:
// Track a search operation
await analyticsService.trackSearch(
query, // The search query
resourceType, // Type of resource being searched
userId, // Optional user ID
parameters, // Optional additional parameters
responseTimeMs, // Optional response time
responseStatus // Optional HTTP status code
);
// Track an agent prompt
await analyticsService.trackAgentPrompt(
prompt, // The agent prompt
agentType, // Type of agent
userId, // Optional user ID
sessionId, // Optional session ID
parameters // Optional additional parameters
);
// Track a material view
await analyticsService.trackMaterialView(
materialId, // ID of the material being viewed
userId, // Optional user ID
parameters // Optional additional parameters
);
Accessing Analytics Data
Analytics data can be accessed through:
- Admin Dashboard: Navigate to
/admin/analytics
for visualizations - Analytics API: Endpoints for programmatic access:
GET /api/admin/analytics/events
- List raw eventsGET /api/admin/analytics/stats
- Get summary statisticsGET /api/admin/analytics/trends
- Get time-based trendsGET /api/admin/analytics/top-searches
- Get most common searchesGET /api/admin/analytics/top-prompts
- Get most common agent promptsGET /api/admin/analytics/top-materials
- Get most viewed materialsPOST /api/analytics/predictive/forecast
- Generate time-series forecastsPOST /api/analytics/predictive/anomalies
- Detect anomalies in analytics dataPOST /api/analytics/predictive/user-behavior
- Predict user behavior patternsPOST /api/analytics/real-time/event
- Track real-time analytics eventsGET /api/analytics/real-time/events
- Get recent real-time events
Best Practices
-
Data Retention: Define a clear data retention policy:
- How long to keep detailed events
- When to aggregate and discard raw data
- Legal compliance considerations (GDPR, etc.)
-
Privacy Considerations:
- Store only necessary data
- Anonymize sensitive information
- Ensure proper authorization for analytics access
-
Performance Optimization:
- Use time-based partitioning for large datasets
- Create materialized views for common queries
- Consider a read replica for analytics queries
-
Monitoring:
- Set up alerts for unusual patterns
- Monitor storage growth and query performance
- Periodically audit access to analytics data
Making the Decision: Supabase vs BigQuery
The choice between Supabase PostgreSQL and BigQuery depends on several factors:
Choose Supabase When:
- Expected analytics volume is under 100GB or ~50 million events
- Real-time analytics are critical
- Simplicity and fast implementation are priorities
- Team is already familiar with PostgreSQL
- Cost predictability is important
Choose BigQuery When:
- Expected analytics volume exceeds 100GB or ~50 million events
- Complex analytical queries are frequent
- Real-time access is less important than historical analysis
- Separate analytical workloads from operational database
- Team is comfortable with data pipeline management
Choose Hybrid Approach When:
- Both real-time and historical analytics are important
- Data volume is growing rapidly
- Different teams have different analysis needs
- Cost optimization for large datasets is a priority
Predictive Analytics Features
The analytics system includes advanced predictive capabilities that leverage historical data to provide forward-looking insights:
1. Time-Series Forecasting
The time-series forecasting feature predicts future trends based on historical analytics data:
- Forecasting Models: Uses statistical models to predict future values
- Configurable Parameters: Adjustable forecast periods and intervals
- Event Type Filtering: Generate forecasts for specific event types
- Resource Type Filtering: Focus on particular resource categories
- MCP Integration: Leverages the Model Context Protocol for advanced forecasting
- Visualization: Interactive charts in the admin dashboard
2. Anomaly Detection
The anomaly detection system identifies unusual patterns in analytics data:
- Statistical Analysis: Uses standard deviation-based detection
- Severity Classification: Categorizes anomalies as low, medium, or high
- Threshold Configuration: Adjustable sensitivity settings
- Visual Indicators: Highlights anomalies in time-series charts
- Alert Potential: Foundation for anomaly-based alerting
- MCP Integration: Advanced detection through the Model Context Protocol
3. User Behavior Prediction
The user behavior prediction system anticipates user actions and preferences:
- Next Action Prediction: Forecasts likely user actions
- Churn Risk Assessment: Identifies users at risk of disengagement
- Engagement Scoring: Predicts user engagement levels
- Content Preferences: Recommends content based on predicted interests
- User Insights: Provides activity level, interests, and usage patterns
- MCP Integration: Sophisticated prediction through the Model Context Protocol
4. Real-Time Analytics Processing
The real-time analytics system processes events as they happen:
- WebSocket Delivery: Instant updates via WebSocket connections
- Event Buffering: Efficient processing of event batches
- Subscription Model: Clients can subscribe to specific event types
- Filtering Capabilities: Target specific event types or resources
- MCP Integration: Enhanced processing through the Model Context Protocol
- Low Latency: Minimal delay between event occurrence and processing
Conclusion
The analytics system provides valuable insights into user behavior while maintaining system performance. The initial Supabase PostgreSQL implementation offers a good balance of simplicity and functionality, with options to migrate to BigQuery or implement a hybrid approach as data volume grows. The addition of predictive analytics and real-time processing capabilities enhances the system's value by providing forward-looking insights and immediate feedback on user activity.