Advanced Alerting Condition Types
This document describes the advanced condition types implemented in the alerting service. These condition types provide more sophisticated alerting capabilities, enabling proactive monitoring and issue detection.
Overview
The alerting service now supports several advanced condition types:
- Trend Condition: Detects trends in metrics over time
- Anomaly Condition: Detects anomalies in metrics using statistical methods
- Composite Condition: Combines multiple conditions with logical operators
- Dynamic Threshold Condition: Uses dynamic thresholds based on historical data
Additionally, the service now supports advanced aggregation functions:
- Median: Calculates the median value
- Percentiles (P90, P95, P99): Calculates percentile values
- Standard Deviation: Calculates the standard deviation
- Variance: Calculates the variance
Advanced Condition Types
Trend Condition
The trend condition detects trends in metrics over time. It uses linear regression to calculate the slope of the trend line and compares it to a threshold.
Properties
- type:
AlertRuleConditionType.TREND
- metric: The metric to monitor (required)
- timeWindow: The time window in seconds (optional)
- properties: Additional properties (required)
- trendDirection: The direction of the trend (required)
TrendDirection.INCREASING
: Increasing trendTrendDirection.DECREASING
: Decreasing trendTrendDirection.STABLE
: Stable trend
- trendThreshold: The threshold for the trend slope (optional, default: 0)
- trendDirection: The direction of the trend (required)
Example
const trendCondition: AlertRuleCondition = {
type: AlertRuleConditionType.TREND,
metric: 'response_time',
timeWindow: 3600, // 1 hour
properties: {
trendDirection: TrendDirection.INCREASING,
trendThreshold: 0.1
}
};
Implementation
The trend condition is implemented as follows:
- Filter events within the time window if specified
- Extract metric values from events (from properties or measurements)
- Calculate the linear regression slope
- Compare the slope to the trend threshold based on the trend direction
Anomaly Condition
The anomaly condition detects anomalies in metrics using statistical methods. It calculates the z-score of recent values compared to historical values and triggers if the z-score exceeds a threshold.
Properties
- type:
AlertRuleConditionType.ANOMALY
- metric: The metric to monitor (required)
- properties: Additional properties (optional)
- sensitivity: The sensitivity of the anomaly detection (optional, default: 0.5)
- trainingWindow: The training window in seconds (optional, default: 24 hours)
Example
const anomalyCondition: AlertRuleCondition = {
type: AlertRuleConditionType.ANOMALY,
metric: 'response_time',
properties: {
sensitivity: 0.7,
trainingWindow: 86400 // 24 hours
}
};
Implementation
The anomaly condition is implemented as follows:
- Filter events within the training window
- Split events into training and test sets
- Calculate the mean and standard deviation of the training values
- Calculate the z-scores of the test values
- Check if any z-score exceeds the threshold (based on sensitivity)
Composite Condition
The composite condition combines multiple conditions with logical operators. It allows for complex conditions that can't be expressed using a single condition.
Properties
- type:
AlertRuleConditionType.COMPOSITE
- logicalOperator: The logical operator to use (required)
LogicalOperator.AND
: All conditions must be metLogicalOperator.OR
: At least one condition must be metLogicalOperator.NOT
: The condition must not be met
- conditions: The child conditions to combine (required)
Example
const compositeCondition: AlertRuleCondition = {
type: AlertRuleConditionType.COMPOSITE,
logicalOperator: LogicalOperator.AND,
conditions: [
{
type: AlertRuleConditionType.THRESHOLD,
metric: 'response_time',
threshold: 1000,
operator: 'gt'
},
{
type: AlertRuleConditionType.FREQUENCY,
timeWindow: 300, // 5 minutes
minCount: 5
}
]
};
Implementation
The composite condition is implemented as follows:
- Evaluate each child condition
- Combine the results based on the logical operator
Dynamic Threshold Condition
The dynamic threshold condition uses dynamic thresholds based on historical data. It calculates the mean and standard deviation of historical values and triggers if the current value exceeds the dynamic threshold.
Properties
- type:
AlertRuleConditionType.DYNAMIC_THRESHOLD
- metric: The metric to monitor (required)
- operator: The comparison operator (required)
- properties: Additional properties (optional)
- baselinePeriod: The baseline period in seconds (optional, default: 24 hours)
- deviationFactor: The deviation factor (optional, default: 2)
- aggregation: The aggregation function for current values (optional, default: 'avg')
Example
const dynamicThresholdCondition: AlertRuleCondition = {
type: AlertRuleConditionType.DYNAMIC_THRESHOLD,
metric: 'response_time',
operator: 'gt',
properties: {
baselinePeriod: 86400, // 24 hours
deviationFactor: 3,
aggregation: AggregationFunction.AVG
}
};
Implementation
The dynamic threshold condition is implemented as follows:
- Filter events within the baseline period
- Calculate the mean and standard deviation of the baseline values
- Calculate the dynamic threshold (mean ± deviationFactor * stdDev)
- Calculate the aggregate value for current values
- Compare the current value to the dynamic threshold
Advanced Aggregation Functions
The alerting service now supports several advanced aggregation functions:
Median
Calculates the median value of a set of metrics.
const thresholdCondition: AlertRuleCondition = {
type: AlertRuleConditionType.THRESHOLD,
metric: 'response_time',
threshold: 1000,
operator: 'gt',
properties: {
aggregation: AggregationFunction.MEDIAN
}
};
Percentiles (P90, P95, P99)
Calculates the 90th, 95th, or 99th percentile of a set of metrics.
const thresholdCondition: AlertRuleCondition = {
type: AlertRuleConditionType.THRESHOLD,
metric: 'response_time',
threshold: 1000,
operator: 'gt',
properties: {
aggregation: AggregationFunction.P95
}
};
Standard Deviation
Calculates the standard deviation of a set of metrics.
const thresholdCondition: AlertRuleCondition = {
type: AlertRuleConditionType.THRESHOLD,
metric: 'response_time',
threshold: 100,
operator: 'gt',
properties: {
aggregation: AggregationFunction.STDDEV
}
};
Variance
Calculates the variance of a set of metrics.
const thresholdCondition: AlertRuleCondition = {
type: AlertRuleConditionType.THRESHOLD,
metric: 'response_time',
threshold: 10000,
operator: 'gt',
properties: {
aggregation: AggregationFunction.VARIANCE
}
};
Multiple Metrics Support
The alerting service now supports conditions that involve multiple metrics. This allows for more complex conditions that compare multiple metrics.
const compositeCondition: AlertRuleCondition = {
type: AlertRuleConditionType.COMPOSITE,
logicalOperator: LogicalOperator.AND,
conditions: [
{
type: AlertRuleConditionType.THRESHOLD,
metric: 'response_time',
threshold: 1000,
operator: 'gt'
},
{
type: AlertRuleConditionType.THRESHOLD,
metric: 'error_rate',
threshold: 0.05,
operator: 'gt'
}
]
};
Benefits
The implementation of advanced condition types provides several benefits:
- Sophisticated Alerting: Support for advanced condition types allows for more sophisticated alerting
- Trend Detection: Trend conditions allow for detecting trends in metrics over time
- Anomaly Detection: Anomaly conditions allow for detecting anomalies in metrics
- Complex Conditions: Composite conditions allow for complex conditions that combine multiple conditions
- Dynamic Thresholds: Dynamic threshold conditions allow for thresholds that adapt to historical data
- Advanced Aggregation: Support for advanced aggregation functions allows for more precise alerting
- Multiple Metrics: Support for multiple metrics allows for conditions that compare multiple metrics
Next Steps
The following steps are recommended to further improve the alerting service:
- Add More Condition Types: Add support for more condition types (seasonality, correlation, etc.)
- Improve Anomaly Detection: Improve the anomaly detection algorithm with more sophisticated methods
- Add Support for Machine Learning: Add support for machine learning models for anomaly detection
- Add Support for Time Series Forecasting: Add support for time series forecasting for predictive alerting
- Add Support for Alert Correlation: Add support for correlating alerts to reduce noise
- Add Support for Alert Suppression: Add support for suppressing alerts based on maintenance windows or other criteria
- Add Support for Alert Escalation: Add support for escalating alerts based on severity and time