AI Workflows
The AI workflows inside core-platform are responsible for diagnosing incidents. They analyse logs, metrics, and context to determine root causes, assess impact, and suggest remediation actions.
Active Investigation Model
Operyn now supports an active investigation path in addition to passive telemetry analysis.
- The AI can request typed diagnostic tools when confidence is low.
- Tool execution is read-only, allowlisted, and incident-scoped.
- Outputs are shaped into structured evidence (
summary,keyPatterns,rawRef) instead of raw unbounded logs. - Each execution is audited with tool type, sanitized parameters, timing, and status.
This keeps the workflow safe while enabling investigator-style diagnosis.
How Diagnosis Works
When the incidents module creates a new incident, it sends a diagnosis request to the AI workflows with:
- The incident details (title, severity, affected services).
- Recent log events from the affected services.
- Recent metric data points.
The AI workflows process this through a multi-step reasoning pipeline:
Gather Context → Build Prompt → LLM Analysis → Validate Output → Return Diagnosis
▲ │
└──────────── Retry on validation failure ─────┘
When needed, the workflow can enrich context via the diagnostic tool runner:
Detect low confidence → Request typed tool run → Receive structured evidence → Refine diagnosis
Diagnosis Status Tracking
The system tracks the lifecycle of an AI diagnosis to provide a seamless user experience and persist progress across sessions:
| Status | Description |
|---|---|
PENDING | Initial state. No diagnosis has been requested yet. |
IN_PROGRESS | AI analysis is currently processing logs and metrics. |
COMPLETED | Diagnosis finished successfully and results are available. |
FAILED | An error occurred during diagnosis (e.g. LLM timeout). |
Updates are broadcast in real-time via Server-Sent Events (SSE), allowing the dashboard to transition from a "Generate" button to a progress spinner and finally to the result card without manual refreshes.
Multi-step Reasoning with LangGraph
The AI module uses LangGraph (a stateful graph framework built on LangChain) to implement the diagnosis flow as a state machine. This enables:
- Cycles: if the LLM output fails Zod validation or has low confidence, the graph loops back to retry with refined context.
- State persistence: the
DiagnosticStateobject tracks logs, metrics, intermediate LLM responses, and validation results across steps. - Observability: graph traces show exactly where the reasoning process went and where it may have failed.
LLM Providers
Operyn supports multiple LLM providers, configured via the LLM_PROVIDER environment variable:
| Provider | Value | Notes |
|---|---|---|
| OpenAI | openai | GPT-4o and similar models |
| Anthropic | anthropic | Claude models |
| Google Gemini | gemini | Gemini Pro and similar |
| Mock | mock | Returns simulated diagnoses for testing |
The domain logic depends only on LangChain's BaseChatModel interface, making provider swaps a single config change.
Diagnosis Output
The AI workflows return a structured AiDiagnosis:
| Field | Type | Description |
|---|---|---|
incidentId | string | The incident being diagnosed |
rootCause | string | Identified root cause |
summary | string | Human-readable summary of the diagnosis |
impactedServices | string[] | Services affected by the issue |
suggestedFixes | SuggestedFix[] | Actionable remediation steps |
diagnosisStatus | string | Current state: PENDING, IN_PROGRESS, COMPLETED, FAILED |
diagnosis | AiDiagnosis | The structured diagnosis object (only if COMPLETED) |
createdAt | string | When the diagnosis was generated |
Suggested Fixes
Each suggested fix contains:
| Field | Type | Description |
|---|---|---|
action | string | Short description of the fix |
description | string | Detailed explanation |
automatable | boolean | Whether this can be automated by the remediation workflows |
remediationActionType | string | Maps to a supported remediation action type (optional) |
Example Diagnosis
For an incident titled "Spike in payment failures on billing-service", the AI Engine might return:
{
"incidentId": "inc-123",
"rootCause": "A recent deployment of billing-service v2.4.1 introduced a regression in the payment gateway client, causing connection pool exhaustion under load.",
"summary": "billing-service v2.4.1 has a connection pool bug that causes payment timeouts when request volume exceeds 500 req/s. Rolling back to v2.4.0 and restarting pods should resolve the issue.",
"impactedServices": ["billing-service", "checkout-service"],
"suggestedFixes": [
{
"action": "Rollback billing-service to v2.4.0",
"description": "Revert the deployment to the last known good version to restore payment processing.",
"automatable": true,
"remediationActionType": "rollback-deployment"
},
{
"action": "Restart billing-service pods",
"description": "Clear exhausted connection pools by restarting all pods.",
"automatable": true,
"remediationActionType": "restart-service"
}
],
"confidenceScore": 0.87,
"createdAt": "2026-03-11T10:15:23.123Z"
}
Configuration
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER | mock | Which LLM provider to use |
OPENAI_API_KEY | — | API key for OpenAI |
ANTHROPIC_API_KEY | — | API key for Anthropic |
GEMINI_API_KEY | — | API key for Google Gemini |