Remediation

Operyn's remediation module executes fixes against your infrastructure — either automatically or with human approval. Every action is recorded in an immutable audit log.

Remediation Modes

The system operates in one of two modes, controlled by the REMEDIATION_MODE environment variable:

ModeBehaviour
simulateLogs what the remediation would do without executing. Safe for testing and initial rollout.
liveExecutes the action against real infrastructure (Kubernetes, scripts, etc.).

Start with simulate to build confidence, then switch to live when you trust the detection and diagnosis pipeline.

Read-only diagnosis probes (Kubernetes + AWS) are a separate investigation path and do not mutate infrastructure state.

Supported Actions

Action TypeDescriptionExample
restart-serviceRestarts pods or containers for a serviceRestart billing-service in prod
scale-podsAdjusts replica countScale api-gateway from 3 to 6 replicas
rollback-deploymentReverts to a previous deployment revisionRollback billing-service to revision 4
clear-cachePurges a cache layerClear Redis cache for session-store
run-scriptExecutes a custom remediation scriptRun fix-payment-gateway.sh

Approval Workflow

Remediations follow an approval workflow before execution:

AI suggests fix → PENDING_APPROVAL → SRE approves → QUEUED → EXECUTING → COMPLETED / FAILED
                                    │
                                    └→ SRE rejects → REJECTED

Statuses

StatusMeaning
PENDING_APPROVALAction proposed, waiting for human approval
QUEUEDApproved or auto-approved, waiting for worker execution
EXECUTINGWorker is currently executing the remediation
COMPLETEDExecution completed successfully
FAILEDExecution failed
REJECTEDRejected by an SRE, no action taken

Legacy audit exports may contain APPROVED, RUNNING, SUCCEEDED, or SIMULATED; new remediation writes use the statuses above.

Approving via the Dashboard

  1. Navigate to the incident detail page.
  2. Review the AI diagnosis and suggested fixes.
  3. Review the policy note to understand whether the action was blocked, auto-approved, or sent for manual approval.
  4. Click Approve or Reject on each proposed action.
  5. Optionally enter a decision reason so the audit trail captures the operator's intent.
  6. Approved actions move to QUEUED; worker execution then updates status to EXECUTING, COMPLETED, or FAILED.

You can also manage the same queue from /console/remediation, which provides a central approval surface across incidents.

Policy Guardrails

Operyn evaluates remediation requests against layered guardrails before an action executes:

  • Global rules by environment, action type, and severity.
  • Service-specific overrides for higher-risk services or stricter production controls.
  • Environment-aware evaluation so the same action can be allowed in staging and gated in production.

Policy outcomes are visible in the dashboard so responders can see not only that approval is required, but also which rule matched.

When a service-level override is involved, Operyn surfaces that explicitly in the UI. The incident view and remediation queue show:

  • whether the match came from a service override, global policy, or built-in guardrail
  • the affected service when a serviceRules.<service> override matched
  • the exact policy rule path that triggered the decision

The policy editor also shows recent matches for each service override so operators can see which rules are actively gating or auto-approving remediation decisions.

Remediation Schema

Each remediation action contains:

FieldDescription
idUnique identifier (UUID)
incidentIdThe incident this action addresses
actionTypeOne of the supported action types above
parametersAction-specific parameters (e.g. replica count, deployment name)
statusCurrent approval/execution status
requestedByWho or what requested the action (e.g. core-platform/ai)
approvedByWho approved (if applicable)
simulationModeWhether this ran in simulate mode
resultExecution output or error message
createdAtWhen the action was proposed
executedAtWhen execution completed (if applicable)

Audit Log

Every remediation action — approved, rejected, simulated, or executed — is recorded in the audit log. The audit entry includes:

  • Actor: who initiated or approved the action.
  • Action: what was done (e.g. remediation.approved, remediation.executed).
  • Resource: which remediation and incident it relates to.
  • Metadata: parameters, results, timing.
  • Timestamp: when it happened.

Manual approval and rejection entries can also include:

  • Decision reason: the responder's written rationale for approving or rejecting the action.
  • Policy context: the exact policy rule or service override that originally gated or blocked the remediation.
  • Override trail: who made the final decision when a human overrode the automated policy outcome.

This provides a complete, immutable record for post-incident reviews and compliance.

Best Practices

  1. Start in simulate mode — let the system detect and diagnose for a few days before enabling live remediation.
  2. Review AI suggestions — use the dashboard to validate that suggested fixes make sense before approving.
  3. Scope narrowly — begin with low-risk actions (restart a single pod) before enabling broader actions (rollback deployments).
  4. Monitor after remediation — check that the incident actually resolves after the fix is applied.
  5. Use the audit log — review past remediations in post-incident reviews to improve detection rules and fix quality.