Predictive Incident Detection

Predictive Incident Detection allows Operyn to identify issues before they impact users by analyzing telemetry trends and using AI to forecast future states.

How it Works

  1. Trend Detection: The AnomalyDetector in the Incident Engine monitors incoming metrics for rapid changes (e.g., CPU increasing at >5% per minute).
  2. AI Forecasting: When a suspicious trend is detected, the Incident Engine calls the AI Engine's /forecast endpoint.
  3. Future State Prediction: The ForecastingAgent uses an LLM to analyze historical data points and predict if a threshold breach is likely within the next 15 minutes.
  4. Early Warning Incidents: If a breach is predicted with high confidence, Operyn creates a HIGH severity incident with the [PREDICTIVE] prefix, allowing SREs to intervene before an outage occurs.

Configuration

Predictive detection behaviors can be tuned via the AnomalyDetector configuration:

  • Slope Threshold: Minimum rate of change to trigger analysis.
  • Prediction Window: How far into the future (in minutes) to predict. Default is 15 minutes.

Benefits

  • Reduced MTTR: By catching issues early, we can often fix them before they become critical.
  • Proactive Scaling: Predictive signals can be used to trigger automated scaling before load causes latency spikes.
  • Noise Reduction: AI filtering ensures that transient spikes don't trigger unnecessary alerts unless they are part of a growing trend.