AI Workflows

The AI workflows inside core-platform are responsible for diagnosing incidents. They analyse logs, metrics, and context to determine root causes, assess impact, and suggest remediation actions.

Active Investigation Model

Operyn now supports an active investigation path in addition to passive telemetry analysis.

  • The AI can request typed diagnostic tools when confidence is low.
  • Tool execution is read-only, allowlisted, and incident-scoped.
  • Outputs are shaped into structured evidence (summary, keyPatterns, rawRef) instead of raw unbounded logs.
  • Each execution is audited with tool type, sanitized parameters, timing, and status.

This keeps the workflow safe while enabling investigator-style diagnosis.

How Diagnosis Works

When the incidents module creates a new incident, it sends a diagnosis request to the AI workflows with:

  • The incident details (title, severity, affected services).
  • Recent log events from the affected services.
  • Recent metric data points.

The AI workflows process this through a multi-step reasoning pipeline:

Gather Context → Build Prompt → LLM Analysis → Validate Output → Return Diagnosis
       ▲                                              │
       └──────────── Retry on validation failure ─────┘

When needed, the workflow can enrich context via the diagnostic tool runner:

Detect low confidence → Request typed tool run → Receive structured evidence → Refine diagnosis

Diagnosis Status Tracking

The system tracks the lifecycle of an AI diagnosis to provide a seamless user experience and persist progress across sessions:

StatusDescription
PENDINGInitial state. No diagnosis has been requested yet.
IN_PROGRESSAI analysis is currently processing logs and metrics.
COMPLETEDDiagnosis finished successfully and results are available.
FAILEDAn error occurred during diagnosis (e.g. LLM timeout).

Updates are broadcast in real-time via Server-Sent Events (SSE), allowing the dashboard to transition from a "Generate" button to a progress spinner and finally to the result card without manual refreshes.

Multi-step Reasoning with LangGraph

The AI module uses LangGraph (a stateful graph framework built on LangChain) to implement the diagnosis flow as a state machine. This enables:

  • Cycles: if the LLM output fails Zod validation or has low confidence, the graph loops back to retry with refined context.
  • State persistence: the DiagnosticState object tracks logs, metrics, intermediate LLM responses, and validation results across steps.
  • Observability: graph traces show exactly where the reasoning process went and where it may have failed.

LLM Providers

Operyn supports multiple LLM providers, configured via the LLM_PROVIDER environment variable:

ProviderValueNotes
OpenAIopenaiGPT-4o and similar models
AnthropicanthropicClaude models
Google GeminigeminiGemini Pro and similar
MockmockReturns simulated diagnoses for testing

The domain logic depends only on LangChain's BaseChatModel interface, making provider swaps a single config change.

Diagnosis Output

The AI workflows return a structured AiDiagnosis:

FieldTypeDescription
incidentIdstringThe incident being diagnosed
rootCausestringIdentified root cause
summarystringHuman-readable summary of the diagnosis
impactedServicesstring[]Services affected by the issue
suggestedFixesSuggestedFix[]Actionable remediation steps
diagnosisStatusstringCurrent state: PENDING, IN_PROGRESS, COMPLETED, FAILED
diagnosisAiDiagnosisThe structured diagnosis object (only if COMPLETED)
createdAtstringWhen the diagnosis was generated

Suggested Fixes

Each suggested fix contains:

FieldTypeDescription
actionstringShort description of the fix
descriptionstringDetailed explanation
automatablebooleanWhether this can be automated by the remediation workflows
remediationActionTypestringMaps to a supported remediation action type (optional)

Example Diagnosis

For an incident titled "Spike in payment failures on billing-service", the AI Engine might return:

{
  "incidentId": "inc-123",
  "rootCause": "A recent deployment of billing-service v2.4.1 introduced a regression in the payment gateway client, causing connection pool exhaustion under load.",
  "summary": "billing-service v2.4.1 has a connection pool bug that causes payment timeouts when request volume exceeds 500 req/s. Rolling back to v2.4.0 and restarting pods should resolve the issue.",
  "impactedServices": ["billing-service", "checkout-service"],
  "suggestedFixes": [
    {
      "action": "Rollback billing-service to v2.4.0",
      "description": "Revert the deployment to the last known good version to restore payment processing.",
      "automatable": true,
      "remediationActionType": "rollback-deployment"
    },
    {
      "action": "Restart billing-service pods",
      "description": "Clear exhausted connection pools by restarting all pods.",
      "automatable": true,
      "remediationActionType": "restart-service"
    }
  ],
  "confidenceScore": 0.87,
  "createdAt": "2026-03-11T10:15:23.123Z"
}

Configuration

VariableDefaultDescription
LLM_PROVIDERmockWhich LLM provider to use
OPENAI_API_KEYAPI key for OpenAI
ANTHROPIC_API_KEYAPI key for Anthropic
GEMINI_API_KEYAPI key for Google Gemini