Analytics System

This document outlines the analytics system architecture, implementation details, and data storage considerations for tracking searches, Agent AI prompts, and API requests on our platform.

Overview

The analytics system captures and stores user interaction data to provide insights into:

Search patterns and most common search queries
Agent AI usage and popular prompts
API request volumes, patterns, and performance metrics
Material view trends and popular materials

This data enables trend analysis, performance monitoring, feature optimization, and predictive analytics based on actual usage patterns.

Data Captured

The analytics system collects the following data points:

Event Type: The type of interaction (search, agent_prompt, api_request, material_view)
Timestamp: When the interaction occurred
User ID: Which user performed the action (if authenticated)
Resource Type: What resource was accessed (materials, collections, etc.)
Query: The actual search query or agent prompt text
Response Time: How long the operation took to complete (ms)
Response Status: HTTP status code for API requests
Additional Context: JSON field for variable additional data

Implementation Architecture

The analytics system consists of these key components:

Analytics Middleware: An Express middleware that automatically captures all API requests
Analytics Service: Service methods to track specific events (searches, agent prompts, etc.)
Database Schema: Table structure and indexes for storing analytics data
Database Functions: PostgreSQL functions that process raw data into insights
Predictive Analytics: Advanced capabilities for forecasting, anomaly detection, and user behavior prediction
Real-Time Analytics: WebSocket-based system for processing and delivering analytics events in real-time
Admin Dashboard: UI for visualizing analytics data, trends, and predictions

Data Storage Options

Option 1: Supabase PostgreSQL (Current Implementation)

Our current implementation uses Supabase's PostgreSQL database:

Advantages

Seamless Integration: Works within our existing Supabase infrastructure
Simplicity: Single database system for application and analytics data
Real-time Access: Low latency for dashboard visualizations
SQL Power: Full PostgreSQL capabilities for complex queries
Data Integrity: ACID compliance ensures reliable analytics
Implementation Speed: Faster to implement with existing Supabase knowledge

Limitations

Scaling Challenges: May struggle with very high volumes (billions of records)
Cost Structure: Storage costs increase linearly with data volume
Mixed Workload: Analytics queries compete with operational queries

Implementation Details

The Supabase implementation uses:

A dedicated analytics_events table with appropriate indexes
PostgreSQL functions for aggregation and trend analysis
Row-level security policies to protect sensitive data

Database schema details are defined in the migration file: packages/server/src/services/supabase/migrations/005_analytics_system.sql

Option 2: BigQuery Alternative

For larger-scale deployments, Google BigQuery provides an alternative:

Advantages

Massive Scale: Handles petabytes of data with ease
Query Performance: Superior performance for complex analytical queries
Separation of Concerns: Analytics workload isolated from operational database
Cost Efficiency: Pay-per-query pricing can be cost-effective for intermittent analysis
Advanced Features: Machine learning integrations and advanced analytics
No Storage Management: Serverless architecture requires no capacity planning

Limitations

Implementation Complexity: Requires ETL pipelines to load data from application
Higher Latency: Not ideal for real-time dashboard updates
Learning Curve: Team needs to learn BigQuery-specific SQL dialect
Additional Service: Adds another external dependency

Implementation Approach

To implement BigQuery storage:

Create a BigQuery dataset and table schema matching our analytics structure
Implement a data export service that regularly:
- Queries recent analytics events from Supabase
- Transforms data if needed
- Loads data into BigQuery tables
Update admin dashboard to query BigQuery for historical analytics
Optionally implement a hybrid approach:
- Recent data (30-90 days) stays in Supabase for real-time access
- Historical data moves to BigQuery for long-term storage and analysis

Hybrid Approach Considerations

For systems with high analytics volume, consider a hybrid approach:

Real-time Tier: Keep recent data (last 30-90 days) in Supabase
- Powers real-time dashboards with low latency
- Handles rapid writes efficiently
- Enables immediate analysis of current activity
Historical Tier: Archive older data to BigQuery
- Stores historical data cost-effectively
- Enables complex analytical queries over large datasets
- Removes pressure from operational database

This approach provides the benefits of both systems while minimizing their limitations.

Setup Instructions

Supabase Setup

Run the migration script:
```
cd packages/server
npm run migration:run
```
The migration creates:
- The analytics_events table
- Required indexes
- PostgreSQL functions for aggregation
- Row-level security policies
The analytics service will use the new table automatically

BigQuery Setup (Optional)

Create a BigQuery project and dataset
Create tables matching our analytics schema
Set up the data export service to run on schedule
Update dashboard services to query BigQuery for historical data

Using the Analytics System

The system automatically tracks all API requests through the middleware. For additional manual tracking:

// Track a search operation
await analyticsService.trackSearch(
  query,                // The search query
  resourceType,         // Type of resource being searched
  userId,               // Optional user ID
  parameters,           // Optional additional parameters
  responseTimeMs,       // Optional response time
  responseStatus        // Optional HTTP status code
);

// Track an agent prompt
await analyticsService.trackAgentPrompt(
  prompt,              // The agent prompt
  agentType,           // Type of agent
  userId,              // Optional user ID
  sessionId,           // Optional session ID
  parameters           // Optional additional parameters
);

// Track a material view
await analyticsService.trackMaterialView(
  materialId,          // ID of the material being viewed
  userId,              // Optional user ID
  parameters           // Optional additional parameters
);

Accessing Analytics Data

Analytics data can be accessed through:

Admin Dashboard: Navigate to /admin/analytics for visualizations
Analytics API: Endpoints for programmatic access:
- GET /api/admin/analytics/events - List raw events
- GET /api/admin/analytics/stats - Get summary statistics
- GET /api/admin/analytics/trends - Get time-based trends
- GET /api/admin/analytics/top-searches - Get most common searches
- GET /api/admin/analytics/top-prompts - Get most common agent prompts
- GET /api/admin/analytics/top-materials - Get most viewed materials
- POST /api/analytics/predictive/forecast - Generate time-series forecasts
- POST /api/analytics/predictive/anomalies - Detect anomalies in analytics data
- POST /api/analytics/predictive/user-behavior - Predict user behavior patterns
- POST /api/analytics/real-time/event - Track real-time analytics events
- GET /api/analytics/real-time/events - Get recent real-time events

Best Practices

Data Retention: Define a clear data retention policy:
- How long to keep detailed events
- When to aggregate and discard raw data
- Legal compliance considerations (GDPR, etc.)
Privacy Considerations:
- Store only necessary data
- Anonymize sensitive information
- Ensure proper authorization for analytics access
Performance Optimization:
- Use time-based partitioning for large datasets
- Create materialized views for common queries
- Consider a read replica for analytics queries
Monitoring:
- Set up alerts for unusual patterns
- Monitor storage growth and query performance
- Periodically audit access to analytics data

Making the Decision: Supabase vs BigQuery

The choice between Supabase PostgreSQL and BigQuery depends on several factors:

Choose Supabase When:

Expected analytics volume is under 100GB or ~50 million events
Real-time analytics are critical
Simplicity and fast implementation are priorities
Team is already familiar with PostgreSQL
Cost predictability is important

Choose BigQuery When:

Expected analytics volume exceeds 100GB or ~50 million events
Complex analytical queries are frequent
Real-time access is less important than historical analysis
Separate analytical workloads from operational database
Team is comfortable with data pipeline management

Choose Hybrid Approach When:

Both real-time and historical analytics are important
Data volume is growing rapidly
Different teams have different analysis needs
Cost optimization for large datasets is a priority

Predictive Analytics Features

The analytics system includes advanced predictive capabilities that leverage historical data to provide forward-looking insights:

1. Time-Series Forecasting

The time-series forecasting feature predicts future trends based on historical analytics data:

Forecasting Models: Uses statistical models to predict future values
Configurable Parameters: Adjustable forecast periods and intervals
Event Type Filtering: Generate forecasts for specific event types
Resource Type Filtering: Focus on particular resource categories
MCP Integration: Leverages the Model Context Protocol for advanced forecasting
Visualization: Interactive charts in the admin dashboard

2. Anomaly Detection

The anomaly detection system identifies unusual patterns in analytics data:

Statistical Analysis: Uses standard deviation-based detection
Severity Classification: Categorizes anomalies as low, medium, or high
Threshold Configuration: Adjustable sensitivity settings
Visual Indicators: Highlights anomalies in time-series charts
Alert Potential: Foundation for anomaly-based alerting
MCP Integration: Advanced detection through the Model Context Protocol

3. User Behavior Prediction

The user behavior prediction system anticipates user actions and preferences:

Next Action Prediction: Forecasts likely user actions
Churn Risk Assessment: Identifies users at risk of disengagement
Engagement Scoring: Predicts user engagement levels
Content Preferences: Recommends content based on predicted interests
User Insights: Provides activity level, interests, and usage patterns
MCP Integration: Sophisticated prediction through the Model Context Protocol

4. Real-Time Analytics Processing

The real-time analytics system processes events as they happen:

WebSocket Delivery: Instant updates via WebSocket connections
Event Buffering: Efficient processing of event batches
Subscription Model: Clients can subscribe to specific event types
Filtering Capabilities: Target specific event types or resources
MCP Integration: Enhanced processing through the Model Context Protocol
Low Latency: Minimal delay between event occurrence and processing

Conclusion

The analytics system provides valuable insights into user behavior while maintaining system performance. The initial Supabase PostgreSQL implementation offers a good balance of simplicity and functionality, with options to migrate to BigQuery or implement a hybrid approach as data volume grows. The addition of predictive analytics and real-time processing capabilities enhances the system's value by providing forward-looking insights and immediate feedback on user activity.

Overview​

Data Captured​

Implementation Architecture​

Data Storage Options​

Option 1: Supabase PostgreSQL (Current Implementation)​

Advantages​

Limitations​

Implementation Details​

Option 2: BigQuery Alternative​

Advantages​

Limitations​

Implementation Approach​

Hybrid Approach Considerations​

Setup Instructions​

Supabase Setup​

BigQuery Setup (Optional)​

Using the Analytics System​

Accessing Analytics Data​

Best Practices​

Making the Decision: Supabase vs BigQuery​

Choose Supabase When:​

Choose BigQuery When:​

Choose Hybrid Approach When:​

Predictive Analytics Features​

1. Time-Series Forecasting​

2. Anomaly Detection​

3. User Behavior Prediction​

4. Real-Time Analytics Processing​

Conclusion​

Overview

Data Captured

Implementation Architecture

Data Storage Options

Option 1: Supabase PostgreSQL (Current Implementation)

Advantages

Limitations

Implementation Details

Option 2: BigQuery Alternative

Advantages

Limitations

Implementation Approach

Hybrid Approach Considerations

Setup Instructions

Supabase Setup

BigQuery Setup (Optional)

Using the Analytics System

Accessing Analytics Data

Best Practices

Making the Decision: Supabase vs BigQuery

Choose Supabase When:

Choose BigQuery When:

Choose Hybrid Approach When:

Predictive Analytics Features

1. Time-Series Forecasting

2. Anomaly Detection

3. User Behavior Prediction

4. Real-Time Analytics Processing

Conclusion