Remediation
Operyn's remediation module executes fixes against your infrastructure — either automatically or with human approval. Every action is recorded in an immutable audit log.
Remediation Modes
The system operates in one of two modes, controlled by the REMEDIATION_MODE environment variable:
| Mode | Behaviour |
|---|---|
simulate | Logs what the remediation would do without executing. Safe for testing and initial rollout. |
live | Executes the action against real infrastructure (Kubernetes, scripts, etc.). |
Start with simulate to build confidence, then switch to live when you trust the detection and diagnosis pipeline.
Read-only diagnosis probes (Kubernetes + AWS) are a separate investigation path and do not mutate infrastructure state.
Supported Actions
| Action Type | Description | Example |
|---|---|---|
restart-service | Restarts pods or containers for a service | Restart billing-service in prod |
scale-pods | Adjusts replica count | Scale api-gateway from 3 to 6 replicas |
rollback-deployment | Reverts to a previous deployment revision | Rollback billing-service to revision 4 |
clear-cache | Purges a cache layer | Clear Redis cache for session-store |
run-script | Executes a custom remediation script | Run fix-payment-gateway.sh |
Approval Workflow
Remediations follow an approval workflow before execution:
AI suggests fix → PENDING_APPROVAL → SRE approves → QUEUED → EXECUTING → COMPLETED / FAILED
│
└→ SRE rejects → REJECTED
Statuses
| Status | Meaning |
|---|---|
PENDING_APPROVAL | Action proposed, waiting for human approval |
QUEUED | Approved or auto-approved, waiting for worker execution |
EXECUTING | Worker is currently executing the remediation |
COMPLETED | Execution completed successfully |
FAILED | Execution failed |
REJECTED | Rejected by an SRE, no action taken |
Legacy audit exports may contain APPROVED, RUNNING, SUCCEEDED, or SIMULATED; new remediation writes use the statuses above.
Approving via the Dashboard
- Navigate to the incident detail page.
- Review the AI diagnosis and suggested fixes.
- Review the policy note to understand whether the action was blocked, auto-approved, or sent for manual approval.
- Click Approve or Reject on each proposed action.
- Optionally enter a decision reason so the audit trail captures the operator's intent.
- Approved actions move to
QUEUED; worker execution then updates status toEXECUTING,COMPLETED, orFAILED.
You can also manage the same queue from /console/remediation, which provides a central approval surface across incidents.
Policy Guardrails
Operyn evaluates remediation requests against layered guardrails before an action executes:
- Global rules by environment, action type, and severity.
- Service-specific overrides for higher-risk services or stricter production controls.
- Environment-aware evaluation so the same action can be allowed in
stagingand gated inproduction.
Policy outcomes are visible in the dashboard so responders can see not only that approval is required, but also which rule matched.
When a service-level override is involved, Operyn surfaces that explicitly in the UI. The incident view and remediation queue show:
- whether the match came from a service override, global policy, or built-in guardrail
- the affected service when a
serviceRules.<service>override matched - the exact policy rule path that triggered the decision
The policy editor also shows recent matches for each service override so operators can see which rules are actively gating or auto-approving remediation decisions.
Remediation Schema
Each remediation action contains:
| Field | Description |
|---|---|
id | Unique identifier (UUID) |
incidentId | The incident this action addresses |
actionType | One of the supported action types above |
parameters | Action-specific parameters (e.g. replica count, deployment name) |
status | Current approval/execution status |
requestedBy | Who or what requested the action (e.g. core-platform/ai) |
approvedBy | Who approved (if applicable) |
simulationMode | Whether this ran in simulate mode |
result | Execution output or error message |
createdAt | When the action was proposed |
executedAt | When execution completed (if applicable) |
Audit Log
Every remediation action — approved, rejected, simulated, or executed — is recorded in the audit log. The audit entry includes:
- Actor: who initiated or approved the action.
- Action: what was done (e.g.
remediation.approved,remediation.executed). - Resource: which remediation and incident it relates to.
- Metadata: parameters, results, timing.
- Timestamp: when it happened.
Manual approval and rejection entries can also include:
- Decision reason: the responder's written rationale for approving or rejecting the action.
- Policy context: the exact policy rule or service override that originally gated or blocked the remediation.
- Override trail: who made the final decision when a human overrode the automated policy outcome.
This provides a complete, immutable record for post-incident reviews and compliance.
Best Practices
- Start in simulate mode — let the system detect and diagnose for a few days before enabling live remediation.
- Review AI suggestions — use the dashboard to validate that suggested fixes make sense before approving.
- Scope narrowly — begin with low-risk actions (restart a single pod) before enabling broader actions (rollback deployments).
- Monitor after remediation — check that the incident actually resolves after the fix is applied.
- Use the audit log — review past remediations in post-incident reviews to improve detection rules and fix quality.