Advanced Alerting Condition Types

This document describes the advanced condition types implemented in the alerting service. These condition types provide more sophisticated alerting capabilities, enabling proactive monitoring and issue detection.

Overview

The alerting service now supports several advanced condition types:

Trend Condition: Detects trends in metrics over time
Anomaly Condition: Detects anomalies in metrics using statistical methods
Composite Condition: Combines multiple conditions with logical operators
Dynamic Threshold Condition: Uses dynamic thresholds based on historical data

Additionally, the service now supports advanced aggregation functions:

Median: Calculates the median value
Percentiles (P90, P95, P99): Calculates percentile values
Standard Deviation: Calculates the standard deviation
Variance: Calculates the variance

Advanced Condition Types

Trend Condition

The trend condition detects trends in metrics over time. It uses linear regression to calculate the slope of the trend line and compares it to a threshold.

Properties

type: AlertRuleConditionType.TREND
metric: The metric to monitor (required)
timeWindow: The time window in seconds (optional)
properties: Additional properties (required)
- trendDirection: The direction of the trend (required)
  - TrendDirection.INCREASING: Increasing trend
  - TrendDirection.DECREASING: Decreasing trend
  - TrendDirection.STABLE: Stable trend
- trendThreshold: The threshold for the trend slope (optional, default: 0)

Example

const trendCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.TREND,
  metric: 'response_time',
  timeWindow: 3600, // 1 hour
  properties: {
    trendDirection: TrendDirection.INCREASING,
    trendThreshold: 0.1
  }
};

Implementation

The trend condition is implemented as follows:

Filter events within the time window if specified
Extract metric values from events (from properties or measurements)
Calculate the linear regression slope
Compare the slope to the trend threshold based on the trend direction

Anomaly Condition

The anomaly condition detects anomalies in metrics using statistical methods. It calculates the z-score of recent values compared to historical values and triggers if the z-score exceeds a threshold.

Properties

type: AlertRuleConditionType.ANOMALY
metric: The metric to monitor (required)
properties: Additional properties (optional)
- sensitivity: The sensitivity of the anomaly detection (optional, default: 0.5)
- trainingWindow: The training window in seconds (optional, default: 24 hours)

Example

const anomalyCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.ANOMALY,
  metric: 'response_time',
  properties: {
    sensitivity: 0.7,
    trainingWindow: 86400 // 24 hours
  }
};

Implementation

The anomaly condition is implemented as follows:

Filter events within the training window
Split events into training and test sets
Calculate the mean and standard deviation of the training values
Calculate the z-scores of the test values
Check if any z-score exceeds the threshold (based on sensitivity)

Composite Condition

The composite condition combines multiple conditions with logical operators. It allows for complex conditions that can't be expressed using a single condition.

Properties

type: AlertRuleConditionType.COMPOSITE
logicalOperator: The logical operator to use (required)
- LogicalOperator.AND: All conditions must be met
- LogicalOperator.OR: At least one condition must be met
- LogicalOperator.NOT: The condition must not be met
conditions: The child conditions to combine (required)

Example

const compositeCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.COMPOSITE,
  logicalOperator: LogicalOperator.AND,
  conditions: [
    {
      type: AlertRuleConditionType.THRESHOLD,
      metric: 'response_time',
      threshold: 1000,
      operator: 'gt'
    },
    {
      type: AlertRuleConditionType.FREQUENCY,
      timeWindow: 300, // 5 minutes
      minCount: 5
    }
  ]
};

Implementation

The composite condition is implemented as follows:

Evaluate each child condition
Combine the results based on the logical operator

Dynamic Threshold Condition

The dynamic threshold condition uses dynamic thresholds based on historical data. It calculates the mean and standard deviation of historical values and triggers if the current value exceeds the dynamic threshold.

Properties

type: AlertRuleConditionType.DYNAMIC_THRESHOLD
metric: The metric to monitor (required)
operator: The comparison operator (required)
properties: Additional properties (optional)
- baselinePeriod: The baseline period in seconds (optional, default: 24 hours)
- deviationFactor: The deviation factor (optional, default: 2)
- aggregation: The aggregation function for current values (optional, default: 'avg')

Example

const dynamicThresholdCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.DYNAMIC_THRESHOLD,
  metric: 'response_time',
  operator: 'gt',
  properties: {
    baselinePeriod: 86400, // 24 hours
    deviationFactor: 3,
    aggregation: AggregationFunction.AVG
  }
};

Implementation

The dynamic threshold condition is implemented as follows:

Filter events within the baseline period
Calculate the mean and standard deviation of the baseline values
Calculate the dynamic threshold (mean ± deviationFactor * stdDev)
Calculate the aggregate value for current values
Compare the current value to the dynamic threshold

Advanced Aggregation Functions

The alerting service now supports several advanced aggregation functions:

Median

Calculates the median value of a set of metrics.

const thresholdCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.THRESHOLD,
  metric: 'response_time',
  threshold: 1000,
  operator: 'gt',
  properties: {
    aggregation: AggregationFunction.MEDIAN
  }
};

Percentiles (P90, P95, P99)

Calculates the 90th, 95th, or 99th percentile of a set of metrics.

const thresholdCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.THRESHOLD,
  metric: 'response_time',
  threshold: 1000,
  operator: 'gt',
  properties: {
    aggregation: AggregationFunction.P95
  }
};

Standard Deviation

Calculates the standard deviation of a set of metrics.

const thresholdCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.THRESHOLD,
  metric: 'response_time',
  threshold: 100,
  operator: 'gt',
  properties: {
    aggregation: AggregationFunction.STDDEV
  }
};

Variance

Calculates the variance of a set of metrics.

const thresholdCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.THRESHOLD,
  metric: 'response_time',
  threshold: 10000,
  operator: 'gt',
  properties: {
    aggregation: AggregationFunction.VARIANCE
  }
};

Multiple Metrics Support

The alerting service now supports conditions that involve multiple metrics. This allows for more complex conditions that compare multiple metrics.

const compositeCondition: AlertRuleCondition = {
  type: AlertRuleConditionType.COMPOSITE,
  logicalOperator: LogicalOperator.AND,
  conditions: [
    {
      type: AlertRuleConditionType.THRESHOLD,
      metric: 'response_time',
      threshold: 1000,
      operator: 'gt'
    },
    {
      type: AlertRuleConditionType.THRESHOLD,
      metric: 'error_rate',
      threshold: 0.05,
      operator: 'gt'
    }
  ]
};

Benefits

The implementation of advanced condition types provides several benefits:

Sophisticated Alerting: Support for advanced condition types allows for more sophisticated alerting
Trend Detection: Trend conditions allow for detecting trends in metrics over time
Anomaly Detection: Anomaly conditions allow for detecting anomalies in metrics
Complex Conditions: Composite conditions allow for complex conditions that combine multiple conditions
Dynamic Thresholds: Dynamic threshold conditions allow for thresholds that adapt to historical data
Advanced Aggregation: Support for advanced aggregation functions allows for more precise alerting
Multiple Metrics: Support for multiple metrics allows for conditions that compare multiple metrics

Next Steps

The following steps are recommended to further improve the alerting service:

Add More Condition Types: Add support for more condition types (seasonality, correlation, etc.)
Improve Anomaly Detection: Improve the anomaly detection algorithm with more sophisticated methods
Add Support for Machine Learning: Add support for machine learning models for anomaly detection
Add Support for Time Series Forecasting: Add support for time series forecasting for predictive alerting
Add Support for Alert Correlation: Add support for correlating alerts to reduce noise
Add Support for Alert Suppression: Add support for suppressing alerts based on maintenance windows or other criteria
Add Support for Alert Escalation: Add support for escalating alerts based on severity and time

Overview​

Advanced Condition Types​

Trend Condition​

Properties​

Example​

Implementation​

Anomaly Condition​

Properties​

Example​

Implementation​

Composite Condition​

Properties​

Example​

Implementation​

Dynamic Threshold Condition​

Properties​

Example​

Implementation​

Advanced Aggregation Functions​

Median​

Percentiles (P90, P95, P99)​

Standard Deviation​

Variance​

Multiple Metrics Support​

Benefits​

Next Steps​

Overview

Advanced Condition Types

Trend Condition

Properties

Example

Implementation

Anomaly Condition

Properties

Example

Implementation

Composite Condition

Properties

Example

Implementation

Dynamic Threshold Condition

Properties

Example

Implementation

Advanced Aggregation Functions

Median

Percentiles (P90, P95, P99)

Standard Deviation

Variance

Multiple Metrics Support

Benefits

Next Steps