What is Operyn?
Operyn is an AI-powered incident response and self-healing platform. It ingests observability signals — logs, metrics, and events — from your infrastructure, correlates them into actionable incidents, diagnoses root causes using AI, and orchestrates automated or human-approved remediations.
Why Operyn?
Modern infrastructure generates massive volumes of telemetry. SRE teams deal with:
- Alert fatigue — hundreds of noisy alerts that bury real incidents.
- Slow diagnosis — manually correlating logs, metrics, and traces across services.
- Repetitive fixes — applying the same runbooks to the same failure modes.
Operyn addresses all three by closing the loop from detection to resolution:
Telemetry → Ingestion → Incident Detection → AI Diagnosis → Remediation → Resolution
Core Components
Ingestion Service
The entry point for all telemetry. Accepts logs and metrics via a REST API, normalises them into a consistent schema, indexes them in OpenSearch for search, and enqueues them for downstream processing.
POST /events/logs— single log eventPOST /events/logs/batch— batch of log eventsPOST /events/metrics— single metric data point
Incident Module (core-platform)
Consumes events from the ingestion queue, applies detection rules and anomaly thresholds, and creates incidents when conditions are met. Persists incidents to PostgreSQL and streams updates via SSE.
AI Diagnosis Module (core-platform)
Receives diagnosis requests from the incident module. Uses LangChain and LangGraph to orchestrate multi-step reasoning: gather context, hypothesise a root cause, validate against evidence, and produce a structured diagnosis with confidence scores and suggested fixes. When confidence is low, the workflow can request typed, read-only diagnostic probe runs (Kubernetes + AWS) through the safe tool runner.
Remediation Module (core-platform)
Executes approved remediation actions against your infrastructure. Supports actions like restarting services, scaling pods, rolling back deployments, clearing caches, and running custom scripts. Operates in simulate mode (log-only) or live mode (execute on Kubernetes). Every action is recorded in an audit log.
Change Insight (NEW)
The Change Insight engine automatically correlates incidents with recent infrastructure, deployment, and configuration changes. It helps SREs answer "What changed recently that could have caused this?" by providing a timeline of events leading up to the incident.
Predictive Detection (NEW)
Operyn uses trend analysis and AI forecasting to identify potential incidents before they impact users. By detecting rapid deteriorations in metrics (e.g., memory leaks or load spikes), it can create early warning incidents to allow for proactive intervention.
Automated Postmortems (NEW)
After an incident is resolved, Operyn generates comprehensive AI-powered postmortem reports, summarizing the timeline, root cause, impact, and remediation steps taken, saving hours of manual documentation.
Notification Service
Dispatches incident alerts and status updates to your team's communication channels: Slack, Jira, Email, and generic webhooks.
Dashboard
A Next.js operations console for SREs. Provides:
- Real-time incident feed with advanced filtering (severity, status, service, assignee) and deduplication grouping.
- Incident detail views with AI diagnosis, remediation approval workflows, comment threads, and team assignment.
- Provider-aware investigation probes in incident detail, with structured evidence summaries and run history.
- Incident velocity chart with configurable time ranges (24h, 7d, All).
- Correlation rules management for configuring how related incidents are grouped.
- Team management with on-call tracking and incident assignment.
Architecture Diagram
Logs / Metrics
│
▼
┌─────────────┐ BullMQ ┌──────────────────────────────┐
│ Ingestion │ ──────────────► │ Core Platform │
│ Service │ │ detect/correlate/diagnose │
└─────────────┘ │ diagnostic probes + notify │
│ remediate + audit │
└──────────────────────────────┘
│
▼
┌──────────────────┐
│ Dashboard │
│ (Next.js 15) │
└──────────────────┘
Tech Stack
| Layer | Technology |
|---|---|
| Runtime | Bun |
| Monorepo | Turborepo |
| Services | NestJS + Fastify |
| Frontend | Next.js 15 + Tailwind CSS v4 |
| AI | LangChain + LangGraph (OpenAI / Anthropic / Gemini) |
| Database | PostgreSQL |
| Queue | Redis (BullMQ) |
| Search | OpenSearch |
Next Steps
- Quickstart — get Operyn running locally in minutes.
- Ingestion API — send your first events.
- Incidents API Reference — manage incidents, correlation rules, team members, and diagnostic investigations.
- Service Catalog API — map services to teams, tier, and escalation context.
- Pricing & Billing — understand tiers, service limits, and retention.
- AWS CloudWatch Integration — connect your AWS infrastructure.
- Engineering Observability — monitor queues, traces, and system health.