Incidents
Incidents are the core unit of work in Operyn. They represent a detected problem — something that requires investigation and potentially remediation.
How Events Become Incidents
The flow from raw telemetry to an incident:
- Ingestion: logs and metrics arrive via the Ingestion API and are enqueued.
- Detection: the Incident Engine consumes events and evaluates them against detection rules.
- Deduplication: events matching an existing open incident's deduplication key are merged rather than creating a new incident.
- Correlation: related events (same service, same time window, same error pattern) are grouped using configurable correlation rules.
- Creation: when thresholds are breached and no duplicate exists, a new incident is created in PostgreSQL.
Detection Rules
The Incident Engine applies configurable rules to determine when events constitute an incident:
- Error rate threshold — e.g. if
error-level logs for a service exceed N per minute. - Metric anomaly — e.g. CPU utilisation above
ANOMALY_CPU_THRESHOLD(default 90%). - Pattern matching — known failure signatures in log messages.
Deduplication
When a new event triggers incident creation, the engine first checks whether an open incident with the same deduplication key already exists within a configurable time window.
The deduplication key is generated from the signal type and affected service:
{signalType}:{service}
For example, error-rate:billing-service or cpu-anomaly:api-gateway.
If a matching open incident is found:
- The incident's
duplicateCountis incremented. - The
lastSeenAttimestamp is updated. - No new incident is created.
This prevents alert storms from flooding the incident list with identical entries.
Correlation Rules
Correlation rules define how the system groups related but distinct incidents. Rules are configured per-team and evaluated in priority order.
Each rule has:
| Field | Description |
|---|---|
name | Human-readable rule name |
conditions | Match criteria: which fields to compare, time window, optional severity/signal filters |
action | What to do when incidents match: merge, group, or suppress |
priority | Evaluation order (higher = evaluated first) |
enabled | Whether the rule is active |
Match Conditions
| Condition | Description |
|---|---|
matchFields | Fields to compare between incidents: service, signalType, severity |
timeWindowMinutes | How close in time incidents must be to correlate |
severities | Optional filter — only match incidents of these severities |
signalTypes | Optional filter — only match specific signal types |
Correlation Actions
| Action | Behaviour |
|---|---|
merge | Combine into a single incident. The newer incident becomes a child (parentIncidentId is set) |
group | Link incidents as related without merging. Both remain independent but are queryable via the related incidents API |
suppress | Discard the newer incident entirely if it matches an existing one |
Correlation rules can be managed from the Correlation Rules page in the dashboard.
Incident Lifecycle
An incident moves through these statuses:
| Status | Meaning |
|---|---|
OPEN | Newly detected, not yet investigated |
INVESTIGATING | An SRE or the AI engine is diagnosing the issue |
REMEDIATING | A remediation action has been approved and is executing |
RESOLVED | The fix has been applied and the issue is no longer occurring |
CLOSED | Final state — incident is archived |
Incident Schema
Each incident contains:
| Field | Description |
|---|---|
id | Unique identifier (UUID) |
title | Human-readable summary |
description | Detailed description of the problem |
severity | LOW, MEDIUM, HIGH, or CRITICAL |
status | Current lifecycle status |
affectedServices | List of impacted service names |
triggeringEventId | The event that first triggered the incident |
assigneeId | UUID of the assigned team member (optional) |
deduplicationKey | Key used for deduplication (e.g. error-rate:billing-service) |
duplicateCount | Number of duplicate events merged into this incident |
lastSeenAt | Timestamp of the most recent duplicate event |
parentIncidentId | UUID of the parent incident if this was correlated via merge |
externalTicketId | ID of a linked external ticket (Jira, etc.) |
externalTicketUrl | URL to the external ticket |
diagnosis | AI-generated diagnosis (optional, see below) |
resolvedAt | When the incident was resolved (if applicable) |
createdAt | When the incident was detected |
updatedAt | Last status change |
AI Diagnosis
When an incident is created, the Incident Engine can request a diagnosis from the AI Engine:
- The AI Engine receives the incident context: recent logs, metrics, affected services.
- It uses LangChain/LangGraph to run a multi-step reasoning process:
- Gather and summarise evidence.
- Form a root cause hypothesis.
- Validate against available data.
- Optionally request typed diagnostic probe evidence when confidence is low.
- Retry if validation fails (cyclic graph).
- The result is a structured
AiDiagnosiscontaining:- Root cause — what went wrong.
- Summary — human-readable explanation.
- Impacted services — which services are affected.
- Suggested fixes — actionable remediation steps with automation flags.
- Confidence score — 0 to 1 indicating how confident the AI is.
Probe runs are read-only and allowlisted, and return structured evidence summaries (summary, keyPatterns, rawRef) instead of raw unbounded output in the main incident context.
Severity Levels
| Severity | Criteria |
|---|---|
CRITICAL | Complete service outage or data loss risk |
HIGH | Major degradation affecting many users |
MEDIUM | Partial degradation or elevated error rates |
LOW | Minor issues, informational anomalies |
Incident Comments
Team members can add comments to incidents for collaboration and post-incident context. Comments are threaded per-incident and support:
- Adding notes during investigation.
- Recording decisions and observations.
- Deletion of individual comments.
Comments are available via the Incidents API and the incident detail page in the dashboard.
Dashboard Features
The dashboard provides a full-featured incident management interface:
Incident List
- Search — filter by title, service, or description.
- Severity filter — show only incidents of a specific severity level.
- Status filter — filter by lifecycle status (OPEN, INVESTIGATING, etc.).
- Service filter — filter by affected service name.
- Assignee filter — filter by assigned team member.
- Deduplication toggle — group duplicates to show only root incidents.
- Sort — by recency or severity.
Incident Detail
Each incident has a dedicated detail page showing:
- Full incident metadata and timeline.
- AI diagnosis with root cause, confidence score, and suggested fixes.
- Provider-aware diagnostic probes (Kubernetes + AWS) with typed, read-only evidence runs.
- Remediation approval/rejection workflow.
- Related/correlated incidents.
- Comment thread for team collaboration.
- Assignment to team members.
Incident Velocity Chart
The overview dashboard includes an incident velocity chart showing incident frequency over time with configurable ranges:
- 24h — last 24 hours (hourly buckets).
- 7d — last 7 days.
- All — last 30 days.
Notifications
When an incident is created or its status changes, the core-platform notification flow dispatches alerts to configured channels:
- Slack — posts to a configured channel with incident details.
- Discord — posts incident updates to a Discord channel.
- Jira — core-platform creates or updates a ticket (using encrypted per-org Jira credentials).
- Email — sends an alert to the configured recipient.
- Webhook — fires a generic HTTP POST to a custom endpoint.