Internal Rollout Guide

This guide walks through deploying Operyn within your own AWS infrastructure as a first internal pilot.

Phase 1: Local Validation (Week 1)

Goal

Verify the end-to-end pipeline works with real-ish data before touching production infrastructure.

Steps

  1. Run Operyn locally

    bun run dev
    
  2. Send sample CloudWatch-format logs Use the curl examples from the Quickstart to send logs that match your real services.

  3. Verify the pipeline

    • Logs appear in OpenSearch (operyn-logs index).
    • Incident Engine creates incidents from error patterns.
    • AI Engine produces diagnoses (use LLM_PROVIDER=mock or a real key).
    • Dashboard shows incidents and remediation suggestions.
  4. Adjust detection thresholds Review the Incident Engine's detection rules and tune thresholds to match your expected event volume.

Phase 2: CloudWatch Integration (Week 2)

Goal

Forward logs from one or two non-critical AWS services into Operyn.

Steps

  1. Deploy the Lambda forwarder Follow the AWS CloudWatch Integration guide to create the Lambda and subscription filters.

  2. Start with low-risk log groups Pick services that are well-understood and generate manageable log volumes:

    • A staging environment API.
    • A non-critical internal service.
    • A Lambda function with known error patterns.
  3. Set OPERYN_API_KEY Configure the API key on both the Lambda (env var) and the Operyn ingestion service.

  4. Monitor ingestion

    • Check the Lambda's CloudWatch Logs for errors.
    • Verify events appear in Operyn's OpenSearch index.
    • Watch the Dashboard for detected incidents.
  5. Use REMEDIATION_MODE=simulate Do not enable live remediation yet. Let Operyn detect and diagnose while you validate accuracy.

Phase 3: Expand Coverage (Week 3-4)

Goal

Add more services and refine detection quality.

Steps

  1. Add production log groups Expand the Lambda's subscription filters to cover:

    • EKS/ECS application logs.
    • API Gateway access logs.
    • RDS error logs.
  2. Refine service naming Ensure the SERVICE_MAP in the Lambda's environment correctly maps log groups to meaningful service names that your team recognises.

  3. Add recommended metadata Ensure logs include env, region, cluster, and team in the metadata field for better incident correlation.

  4. Tune detection rules Based on the first few weeks of real data:

    • Adjust error rate thresholds.
    • Add or remove pattern matching rules.
    • Reduce false positive incidents.
  5. Evaluate AI diagnosis + probe quality Review AI-generated diagnoses and diagnostic probe evidence. If using mock, switch to a real LLM provider (openai, anthropic, or gemini) and assess whether outputs are actionable.

    • Validate Kubernetes and AWS probe summaries against known incidents.
    • Ensure incident resource mappings are set so AWS tools are scoped correctly.
    • Confirm no unsafe broad queries are required for responders.

Phase 4: Enable Remediation (Week 5+)

Goal

Transition from observe-only to active remediation.

Steps

  1. Start with manual approval only Keep REMEDIATION_MODE=simulate and use the Dashboard to review what Operyn would do. Look for:

    • Are the suggested actions correct?
    • Would they have resolved the incident?
    • Are there any actions that could be harmful?
  2. Enable live remediation for safe actions Switch to REMEDIATION_MODE=live but start with low-risk actions only:

    • restart-service (restart pods).
    • clear-cache (purge cache).
  3. Require approval for high-risk actions Keep manual approval required for:

    • rollback-deployment.
    • scale-pods (significant scale changes).
    • run-script (custom scripts).
  4. Set up notifications Configure Slack and/or Jira so the team is alerted when:

    • An incident is detected.
    • A remediation is proposed and waiting for approval.
    • A remediation is executed (or fails).

Feedback Loop

Throughout the pilot:

  • Weekly review: look at incidents detected, diagnoses produced, probe runs executed, and actions taken. Are they accurate? Useful?
  • Tune thresholds: reduce noise by adjusting detection sensitivity based on your actual event patterns.
  • Document learnings: note which service names, metadata fields, and detection rules work best for your team.
  • Iterate on docs: update the internal CloudWatch integration docs based on what you learn. These become the template for other teams.

Success Criteria

Before expanding beyond the pilot:

  • Operyn correctly detects known incident patterns from your infrastructure.
  • AI diagnoses and probe summaries provide useful root cause information (even if not always perfect).
  • Remediation suggestions match what an SRE would actually do.
  • False positive rate is acceptable (< 10% of incidents are noise).
  • Team is comfortable with the approval workflow.
  • Notifications arrive reliably on Slack/Jira.