Internal Rollout Guide
This guide walks through deploying Operyn within your own AWS infrastructure as a first internal pilot.
Phase 1: Local Validation (Week 1)
Goal
Verify the end-to-end pipeline works with real-ish data before touching production infrastructure.
Steps
-
Run Operyn locally
bun run dev -
Send sample CloudWatch-format logs Use the
curlexamples from the Quickstart to send logs that match your real services. -
Verify the pipeline
- Logs appear in OpenSearch (
operyn-logsindex). - Incident Engine creates incidents from error patterns.
- AI Engine produces diagnoses (use
LLM_PROVIDER=mockor a real key). - Dashboard shows incidents and remediation suggestions.
- Logs appear in OpenSearch (
-
Adjust detection thresholds Review the Incident Engine's detection rules and tune thresholds to match your expected event volume.
Phase 2: CloudWatch Integration (Week 2)
Goal
Forward logs from one or two non-critical AWS services into Operyn.
Steps
-
Deploy the Lambda forwarder Follow the AWS CloudWatch Integration guide to create the Lambda and subscription filters.
-
Start with low-risk log groups Pick services that are well-understood and generate manageable log volumes:
- A staging environment API.
- A non-critical internal service.
- A Lambda function with known error patterns.
-
Set
OPERYN_API_KEYConfigure the API key on both the Lambda (env var) and the Operyn ingestion service. -
Monitor ingestion
- Check the Lambda's CloudWatch Logs for errors.
- Verify events appear in Operyn's OpenSearch index.
- Watch the Dashboard for detected incidents.
-
Use
REMEDIATION_MODE=simulateDo not enable live remediation yet. Let Operyn detect and diagnose while you validate accuracy.
Phase 3: Expand Coverage (Week 3-4)
Goal
Add more services and refine detection quality.
Steps
-
Add production log groups Expand the Lambda's subscription filters to cover:
- EKS/ECS application logs.
- API Gateway access logs.
- RDS error logs.
-
Refine service naming Ensure the
SERVICE_MAPin the Lambda's environment correctly maps log groups to meaningful service names that your team recognises. -
Add recommended metadata Ensure logs include
env,region,cluster, andteamin themetadatafield for better incident correlation. -
Tune detection rules Based on the first few weeks of real data:
- Adjust error rate thresholds.
- Add or remove pattern matching rules.
- Reduce false positive incidents.
-
Evaluate AI diagnosis + probe quality Review AI-generated diagnoses and diagnostic probe evidence. If using
mock, switch to a real LLM provider (openai,anthropic, orgemini) and assess whether outputs are actionable.- Validate Kubernetes and AWS probe summaries against known incidents.
- Ensure incident resource mappings are set so AWS tools are scoped correctly.
- Confirm no unsafe broad queries are required for responders.
Phase 4: Enable Remediation (Week 5+)
Goal
Transition from observe-only to active remediation.
Steps
-
Start with manual approval only Keep
REMEDIATION_MODE=simulateand use the Dashboard to review what Operyn would do. Look for:- Are the suggested actions correct?
- Would they have resolved the incident?
- Are there any actions that could be harmful?
-
Enable live remediation for safe actions Switch to
REMEDIATION_MODE=livebut start with low-risk actions only:restart-service(restart pods).clear-cache(purge cache).
-
Require approval for high-risk actions Keep manual approval required for:
rollback-deployment.scale-pods(significant scale changes).run-script(custom scripts).
-
Set up notifications Configure Slack and/or Jira so the team is alerted when:
- An incident is detected.
- A remediation is proposed and waiting for approval.
- A remediation is executed (or fails).
Feedback Loop
Throughout the pilot:
- Weekly review: look at incidents detected, diagnoses produced, probe runs executed, and actions taken. Are they accurate? Useful?
- Tune thresholds: reduce noise by adjusting detection sensitivity based on your actual event patterns.
- Document learnings: note which service names, metadata fields, and detection rules work best for your team.
- Iterate on docs: update the internal CloudWatch integration docs based on what you learn. These become the template for other teams.
Success Criteria
Before expanding beyond the pilot:
- Operyn correctly detects known incident patterns from your infrastructure.
- AI diagnoses and probe summaries provide useful root cause information (even if not always perfect).
- Remediation suggestions match what an SRE would actually do.
- False positive rate is acceptable (< 10% of incidents are noise).
- Team is comfortable with the approval workflow.
- Notifications arrive reliably on Slack/Jira.