Incidents

Incidents are the core unit of work in Operyn. They represent a detected problem — something that requires investigation and potentially remediation.

How Events Become Incidents

The flow from raw telemetry to an incident:

  1. Ingestion: logs and metrics arrive via the Ingestion API and are enqueued.
  2. Detection: the Incident Engine consumes events and evaluates them against detection rules.
  3. Deduplication: events matching an existing open incident's deduplication key are merged rather than creating a new incident.
  4. Correlation: related events (same service, same time window, same error pattern) are grouped using configurable correlation rules.
  5. Creation: when thresholds are breached and no duplicate exists, a new incident is created in PostgreSQL.

Detection Rules

The Incident Engine applies configurable rules to determine when events constitute an incident:

  • Error rate threshold — e.g. if error-level logs for a service exceed N per minute.
  • Metric anomaly — e.g. CPU utilisation above ANOMALY_CPU_THRESHOLD (default 90%).
  • Pattern matching — known failure signatures in log messages.

Deduplication

When a new event triggers incident creation, the engine first checks whether an open incident with the same deduplication key already exists within a configurable time window.

The deduplication key is generated from the signal type and affected service:

{signalType}:{service}

For example, error-rate:billing-service or cpu-anomaly:api-gateway.

If a matching open incident is found:

  • The incident's duplicateCount is incremented.
  • The lastSeenAt timestamp is updated.
  • No new incident is created.

This prevents alert storms from flooding the incident list with identical entries.

Correlation Rules

Correlation rules define how the system groups related but distinct incidents. Rules are configured per-team and evaluated in priority order.

Each rule has:

FieldDescription
nameHuman-readable rule name
conditionsMatch criteria: which fields to compare, time window, optional severity/signal filters
actionWhat to do when incidents match: merge, group, or suppress
priorityEvaluation order (higher = evaluated first)
enabledWhether the rule is active

Match Conditions

ConditionDescription
matchFieldsFields to compare between incidents: service, signalType, severity
timeWindowMinutesHow close in time incidents must be to correlate
severitiesOptional filter — only match incidents of these severities
signalTypesOptional filter — only match specific signal types

Correlation Actions

ActionBehaviour
mergeCombine into a single incident. The newer incident becomes a child (parentIncidentId is set)
groupLink incidents as related without merging. Both remain independent but are queryable via the related incidents API
suppressDiscard the newer incident entirely if it matches an existing one

Correlation rules can be managed from the Correlation Rules page in the dashboard.

Incident Lifecycle

An incident moves through these statuses:

StatusMeaning
OPENNewly detected, not yet investigated
INVESTIGATINGAn SRE or the AI engine is diagnosing the issue
REMEDIATINGA remediation action has been approved and is executing
RESOLVEDThe fix has been applied and the issue is no longer occurring
CLOSEDFinal state — incident is archived

Incident Schema

Each incident contains:

FieldDescription
idUnique identifier (UUID)
titleHuman-readable summary
descriptionDetailed description of the problem
severityLOW, MEDIUM, HIGH, or CRITICAL
statusCurrent lifecycle status
affectedServicesList of impacted service names
triggeringEventIdThe event that first triggered the incident
assigneeIdUUID of the assigned team member (optional)
deduplicationKeyKey used for deduplication (e.g. error-rate:billing-service)
duplicateCountNumber of duplicate events merged into this incident
lastSeenAtTimestamp of the most recent duplicate event
parentIncidentIdUUID of the parent incident if this was correlated via merge
externalTicketIdID of a linked external ticket (Jira, etc.)
externalTicketUrlURL to the external ticket
diagnosisAI-generated diagnosis (optional, see below)
resolvedAtWhen the incident was resolved (if applicable)
createdAtWhen the incident was detected
updatedAtLast status change

AI Diagnosis

When an incident is created, the Incident Engine can request a diagnosis from the AI Engine:

  1. The AI Engine receives the incident context: recent logs, metrics, affected services.
  2. It uses LangChain/LangGraph to run a multi-step reasoning process:
    • Gather and summarise evidence.
    • Form a root cause hypothesis.
    • Validate against available data.
    • Optionally request typed diagnostic probe evidence when confidence is low.
    • Retry if validation fails (cyclic graph).
  3. The result is a structured AiDiagnosis containing:
    • Root cause — what went wrong.
    • Summary — human-readable explanation.
    • Impacted services — which services are affected.
    • Suggested fixes — actionable remediation steps with automation flags.
    • Confidence score — 0 to 1 indicating how confident the AI is.

Probe runs are read-only and allowlisted, and return structured evidence summaries (summary, keyPatterns, rawRef) instead of raw unbounded output in the main incident context.

Severity Levels

SeverityCriteria
CRITICALComplete service outage or data loss risk
HIGHMajor degradation affecting many users
MEDIUMPartial degradation or elevated error rates
LOWMinor issues, informational anomalies

Incident Comments

Team members can add comments to incidents for collaboration and post-incident context. Comments are threaded per-incident and support:

  • Adding notes during investigation.
  • Recording decisions and observations.
  • Deletion of individual comments.

Comments are available via the Incidents API and the incident detail page in the dashboard.

Dashboard Features

The dashboard provides a full-featured incident management interface:

Incident List

  • Search — filter by title, service, or description.
  • Severity filter — show only incidents of a specific severity level.
  • Status filter — filter by lifecycle status (OPEN, INVESTIGATING, etc.).
  • Service filter — filter by affected service name.
  • Assignee filter — filter by assigned team member.
  • Deduplication toggle — group duplicates to show only root incidents.
  • Sort — by recency or severity.

Incident Detail

Each incident has a dedicated detail page showing:

  • Full incident metadata and timeline.
  • AI diagnosis with root cause, confidence score, and suggested fixes.
  • Provider-aware diagnostic probes (Kubernetes + AWS) with typed, read-only evidence runs.
  • Remediation approval/rejection workflow.
  • Related/correlated incidents.
  • Comment thread for team collaboration.
  • Assignment to team members.

Incident Velocity Chart

The overview dashboard includes an incident velocity chart showing incident frequency over time with configurable ranges:

  • 24h — last 24 hours (hourly buckets).
  • 7d — last 7 days.
  • All — last 30 days.

Notifications

When an incident is created or its status changes, the core-platform notification flow dispatches alerts to configured channels:

  • Slack — posts to a configured channel with incident details.
  • Discord — posts incident updates to a Discord channel.
  • Jira — core-platform creates or updates a ticket (using encrypted per-org Jira credentials).
  • Email — sends an alert to the configured recipient.
  • Webhook — fires a generic HTTP POST to a custom endpoint.