{"openapi":"3.1.0","info":{"title":"Cloud World Model - RL Training API","version":"1.0.0","description":"RESTful API for training reinforcement learning agents on cloud infrastructure autoscaling.\n\nThis API enables external AI agents to learn optimal autoscaling policies through trial and error.\nAgents can create training environments, execute actions, receive rewards, and observe system state.\n\n**Authentication:**\n- **Simulation endpoints** (`/simulations`, `/events`, `/traffic-patterns`, `/failure-injections`): Public, no authentication required\n- **RL Environment endpoints** (`/rl/environments/*`): Require API key authentication via Bearer token\n- **API Key Management** (`/keys`): Public for demonstration purposes (should be secured in production)\n\n**Use Cases:**\n- Train agents for cost-efficient autoscaling\n- Test scaling policies before production deployment\n- Optimize multi-cloud resource allocation\n- Simulate months of production traffic in minutes\n\n**Episode Lifecycle:**\n1. Create a simulation with cloud resources (no auth required)\n2. Generate an API key for RL training (no auth required for demo)\n3. Create an RL environment linked to the simulation (requires API key)\n4. Training loop: observe → select action → step → receive reward (requires API key)\n5. Reset episode when done or max steps reached (requires API key)\n\n## Simulation Lifecycle\n\nThe simulation API provides a complete lifecycle for creating, evolving, and analyzing virtual\ncloud environments without touching real infrastructure. Use this flow for load testing,\narchitecture validation, chaos experiments, and generating realistic training data for RL agents.\n\nThe snippets below use shell variables — set them once and every subsequent command works\nend-to-end. Step 1 accepts an optional API key (omit for a guest demo simulation; include one\nto link it to your account and unlock unlimited steps). Step 2 requires auth for owned\nsimulations; demo (keyless) simulations can step without a key, up to 20 steps. Steps 3–5\nand 7–8 require an API key; Step 6 (bottleneck analysis) accepts an optional key. Step 9\n(RL training) is an optional advanced branch that forks off after Step 5 — run it before or\ninstead of cleanup.\n\nResponse samples below are illustrative abbreviations; see each endpoint's schema in this\nspec for the full payload shape. The shell variable `SIM_ID` below holds the simulation\nUUID returned by Step 1; it maps to the `{simulationId}` path parameter in all subsequent\nAPI calls.\n\n```bash\nexport BASE_URL=\"https://your-app.replit.app\"\n\n# One-time bootstrap: mint your first admin key.\n# Requires BOOTSTRAP_SECRET to be set as an environment variable on the server.\n# Only succeeds when no admin key exists yet — subsequent calls return 409.\n# If bootstrap is already consumed, use an existing key or an admin-issued\n# registration token (POST /api/keys/register with a token from POST /api/register-tokens).\nexport API_KEY=$(curl -s -X POST \"$BASE_URL/api/keys/bootstrap-admin\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{ \"bootstrapSecret\": \"your-bootstrap-secret\" }' \\\n  | python3 -c \"import sys,json; print(json.load(sys.stdin)['key'])\")\n\n# Canvas Cloud AI users: exchange your CCA token instead:\n#   export API_KEY=$(curl -s -X POST \"$BASE_URL/api/keys/register\" \\\n#     -H \"Content-Type: application/json\" \\\n#     -d '{ \"token\": \"cca_live_...\" }' \\\n#     | python3 -c \"import sys,json; print(json.load(sys.stdin)['key'])\")\n```\n\n**Step 1 — Create a simulation** (`POST /api/simulations`)\n\nProvision a named virtual environment with cloud resources (compute, database, network, storage)\nand receive a `simulationId`. Include the `resources` array in the request body to configure\nprovider-specific settings (instance type, region, autoscaling bounds). Mix AWS, GCP, Azure,\nOCI, and DigitalOcean resources in a single simulation to model multi-cloud topologies.\nNo authentication required; pass an API key to link the simulation to your account.\n\n```bash\n# Authenticated (owned) simulation — unlimited steps:\nSIM_ID=$(curl -s -X POST \"$BASE_URL/api/simulations\" \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"name\": \"my-first-sim\",\n    \"resources\": [\n      {\n        \"id\": \"web-1\",\n        \"name\": \"Web Server\",\n        \"type\": \"compute\",\n        \"provider\": \"aws\",\n        \"characteristics\": {\n          \"instanceType\": \"t3.medium\",\n          \"region\": \"us-east-1\",\n          \"minInstances\": 1,\n          \"maxInstances\": 5\n        }\n      }\n    ]\n  }' | python3 -c \"import sys,json; print(json.load(sys.stdin)['id'])\")\necho \"Simulation ID: $SIM_ID\"\n# Response: { \"id\": \"sim_abc123\", \"name\": \"my-first-sim\", \"resources\": [...], ... }\n\n# Guest/demo mode (omit Authorization header) — no key needed, capped at 20 steps:\n# SIM_ID=$(curl -s -X POST \"$BASE_URL/api/simulations\" \\\n#   -H \"Content-Type: application/json\" \\\n#   -d '{ \"name\": \"demo-sim\", \"resources\": [...] }' \\\n#   | python3 -c \"import sys,json; print(json.load(sys.stdin)['id'])\")\n```\n\n**Step 2 — Advance the simulation** (`POST /api/simulations/{simulationId}/step`)\n\nDrive the simulation forward one timestep. The hybrid prediction engine applies registered\ntraffic patterns, evaluates autoscaling rules, calculates CPU utilization / error-rate /\nthroughput metrics, and returns updated resource states. Call this in a loop to model minutes,\nhours, or months of production traffic in seconds. No request body is needed. Auth is required\nfor owned simulations; demo (keyless) simulations can step without a key (limited to 20 steps).\n\n```bash\ncurl -s -X POST \"$BASE_URL/api/simulations/$SIM_ID/step\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: { \"simulation\": { ... }, \"metrics\": { \"cpuUsage\": 42.1, \"latencyP95\": 180,\n#             \"errorRate\": 0.002, \"throughput\": 298 }, \"events\": [] }\n```\n\n**Step 3 — Inspect metrics and events**\n\n- `GET /api/simulations/{simulationId}/metrics` — time-series performance metrics (CPU\n  utilization, error rate, throughput, latency) indexed by simulation step.\n- `GET /api/simulations/{simulationId}/events` — the event log: scale-out decisions, failure\n  triggers, cost spikes, and autoscaling threshold crossings.\n\nUse these to validate that the simulation is behaving as expected before running expensive\nanalysis jobs.\n\n```bash\n# Fetch time-series metrics\ncurl -s \"$BASE_URL/api/simulations/$SIM_ID/metrics\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: [{ \"timestamp\": 1, \"cpuUsage\": 42.1, \"latencyP95\": 180, \"errorRate\": 0.002,\n#              \"throughput\": 298 }, ...]\n\n# Fetch the event log\ncurl -s \"$BASE_URL/api/simulations/$SIM_ID/events\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: [{ \"id\": \"evt_1\", \"type\": \"scale_out\", \"message\": \"Scaled out to 2 instances\",\n#              \"severity\": \"info\", \"timestamp\": \"2026-05-01T12:00:00Z\" }, ...]\n```\n\n**Step 4 — Add and activate traffic patterns**\n\n- `POST /api/simulations/{simulationId}/patterns` — register a named traffic pattern (ramp,\n  burst, step, wave, or custom) that is applied on every subsequent `step` call. Multiple\n  patterns compose automatically.\n- `POST /api/simulations/{simulationId}/inject-traffic` — inject an immediate random traffic\n  spike into the simulation, independent of registered patterns.\n\n```bash\n# Register a ramp-up traffic pattern (applied on every /step call)\ncurl -s -X POST \"$BASE_URL/api/simulations/$SIM_ID/patterns\" \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"name\": \"gradual-ramp\",\n    \"type\": \"ramp\",\n    \"startTime\": 0,\n    \"parameters\": { \"startTraffic\": 100, \"endTraffic\": 900, \"duration\": 20 }\n  }'\n# Response: { \"id\": \"pat_xyz\", \"name\": \"gradual-ramp\", \"type\": \"ramp\", ... }\n\n# Inject an immediate one-off traffic spike\ncurl -s -X POST \"$BASE_URL/api/simulations/$SIM_ID/inject-traffic\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: { \"simulation\": { ... }, \"event\": { \"message\": \"Traffic spike injected\", ... } }\n```\n\n**Step 5 — Add and trigger failure injections**\n\n- `POST /api/simulations/{simulationId}/failures` — register a failure scenario (database crash,\n  zone outage, network partition, CPU stress) against the simulation.\n- `POST /api/simulations/{simulationId}/inject-failure` — trigger the failure injection.\n\nPair with `GET /api/simulations/{simulationId}/events` to observe how the simulation detects,\nreacts to, and recovers from each failure.\n\n```bash\n# Register a zone-outage failure scenario\ncurl -s -X POST \"$BASE_URL/api/simulations/$SIM_ID/failures\" \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"name\": \"us-east-1a outage\",\n    \"type\": \"az_outage\",\n    \"targetResourceId\": \"web-1\",\n    \"severity\": \"severe\",\n    \"startTime\": 0,\n    \"parameters\": { \"errorRateIncrease\": 0.4 }\n  }'\n# Response: { \"id\": \"fail_abc\", \"name\": \"us-east-1a outage\", \"type\": \"az_outage\", ... }\n\n# Trigger a node failure on a random healthy instance\ncurl -s -X POST \"$BASE_URL/api/simulations/$SIM_ID/inject-failure\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: { \"simulation\": { ... }, \"event\": { \"message\": \"Node failure injected\", ... } }\n```\n\n**Step 6 — Analyze bottlenecks** (`POST /api/simulations/{simulationId}/analyze-bottlenecks`)\n\nRun an AI-backed bottleneck analysis over the current simulation state. The engine identifies\nsaturated resources, latency hotspots, and single points of failure, and returns natural-language\nrecommendations. Pass `beginnerMode: true` to receive simplified explanations suitable for\ndevelopers who are new to cloud architecture. Authentication is optional on this endpoint\n(the API key is accepted but not required).\n\n```bash\ncurl -s -X POST \"$BASE_URL/api/simulations/$SIM_ID/analyze-bottlenecks\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{ \"beginnerMode\": false }'\n# Response: { \"analysis\": \"web-1 is running at 91% CPU. Consider scaling out to 3\n#             instances or upgrading to c5.large before traffic doubles.\",\n#             \"doRecommendation\": \"Add a second c5.large instance to distribute load.\" }\n```\n\n**Step 7 — Optimize the architecture** (`POST /api/analysis/optimize`)\n\nSubmit an asynchronous optimization job. The engine evaluates cost, performance, and reliability\ntrade-offs and returns ranked recommendations with projected savings and risk scores. Provide a\n`webhookUrl` to receive the result asynchronously instead of polling\n`GET /api/analysis/jobs/{jobId}`.\n\n```bash\nJOB_ID=$(curl -s -X POST \"$BASE_URL/api/analysis/optimize\" \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"simulationId\": \"'\"$SIM_ID\"'\",\n    \"goals\": {\n      \"primary\": \"balance\",\n      \"weights\": { \"cost\": 0.4, \"performance\": 0.4, \"stability\": 0.2 }\n    },\n    \"testScenario\": {\n      \"traffic_pattern\": \"spike\",\n      \"duration_steps\": 10,\n      \"include_failures\": false\n    }\n  }' | python3 -c \"import sys,json; print(json.load(sys.stdin)['job']['id'])\")\necho \"Optimization job: $JOB_ID\"\n# Poll for status:\ncurl -s \"$BASE_URL/api/analysis/jobs/$JOB_ID\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: { \"id\": \"opt_xyz789\", \"status\": \"completed\", \"variantsGenerated\": 47,\n#             \"variantsCompleted\": 47 }\n# Fetch recommendations once status === \"completed\":\ncurl -s \"$BASE_URL/api/analysis/jobs/$JOB_ID/recommendations\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: { \"recommendations\": [{ \"rank\": 1, \"name\": \"Serverless First\",\n#             \"costSavingsPercent\": 18, \"score\": 0.87 }], \"totalVariants\": 47 }\n```\n\n**Step 8 — Clean up** (`DELETE /api/simulations/{simulationId}`)\n\nDelete the simulation and all its associated resources when the experiment is complete.\n\n```bash\ncurl -s -o /dev/null -w \"%{http_code}\" -X DELETE \"$BASE_URL/api/simulations/$SIM_ID\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: HTTP 204 No Content (empty body)\n```\n\n**Step 9 — RL Training *(optional branch — fork here after Step 5, before or instead of Step 8)***\n\nOnce the simulation is populated and behaving realistically (steps 1–5 above), you can attach\nan RL environment to it and start training your agent. Run this before Step 8 (cleanup) or\nskip it entirely if you only need the analysis features. Requires an API key (see preamble).\n\n1. `POST /api/rl/environments` — create an RL environment linked to the simulation\n2. `POST /api/rl/environments/{environmentId}/step` — execute actions and observe rewards in a loop\n3. `POST /api/rl/environments/{environmentId}/reset` — begin a new episode when the current one ends\n\nThis lets you pre-warm a simulation with a realistic traffic baseline before starting RL\ntraining, so your agent begins from a meaningful initial state rather than an empty environment.\n\n```bash\n# Prerequisites: API_KEY and SIM_ID are already set (see preamble + Step 1 above).\n\n# 1. Create an RL environment linked to the simulation\nENV_ID=$(curl -s -X POST \"$BASE_URL/api/rl/environments\" \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"simulationId\": \"'\"$SIM_ID\"'\",\n    \"episodeConfig\": {\n      \"maxSteps\": 100,\n      \"initialTraffic\": 1000,\n      \"targetSLA\": { \"maxLatencyP95\": 200, \"maxErrorRate\": 1 },\n      \"enableFailures\": false\n    }\n  }' | python3 -c \"import sys,json; print(json.load(sys.stdin)['environment']['id'])\")\necho \"RL Environment: $ENV_ID\"\n# Response: { \"environment\": { \"id\": \"env_abc\", \"simulationId\": \"...\", ... },\n#             \"observation\": { \"metrics\": {...}, \"resources\": [...], \"traffic\": 1000 } }\n\n# 2. Training loop — execute an action and receive next obs + reward\ncurl -s -X POST \"$BASE_URL/api/rl/environments/$ENV_ID/step\" \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{ \"action\": { \"type\": \"scale_out\", \"parameters\": {} } }'\n# Response: { \"t\": 1, \"obs\": { \"rps\": 1000, \"cpu_util\": 0.45, \"instances\": 2, ... },\n#             \"metrics\": { \"cost_usd_hr\": 0.38, \"latency_p95\": 112, \"error_rate\": 0.003, ... },\n#             \"reward\": 0.72, \"reward_components\": { \"performance\": 0.8, ... },\n#             \"done\": false, \"info\": {} }\n\n# 3. Reset to start a new episode when done=true\ncurl -s -X POST \"$BASE_URL/api/rl/environments/$ENV_ID/reset\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n# Response: { \"environment\": { \"currentStep\": 0, ... }, \"observation\": { \"metrics\": {...} } }\n```\n\n**Python training-loop example**\n\nA self-contained Python script that wires all three steps above into a\nrunnable multi-episode training loop is available at\n[`examples/rl_training_loop.py`](examples/rl_training_loop.py).\nIt uses only the Python standard library (no third-party packages) and\ndemonstrates how to read `obs[\"cpu_util\"]`, `metrics[\"latency_p95\"]`,\n`reward`, and `done` from each step response, reset between episodes, and\nprint per-episode reward totals:\n\n```bash\n# Run against the local dev server (--token must have admin scope):\npython examples/rl_training_loop.py --token $ADMIN_KEY\n\n# Already have a write-scoped key? Skip key minting with --skip-mint:\npython examples/rl_training_loop.py --token $API_KEY --skip-mint\n\n# Run against a deployed instance with custom episode count:\npython examples/rl_training_loop.py \\\n  --base-url https://your-deployment.replit.app \\\n  --token $ADMIN_KEY \\\n  --episodes 5 \\\n  --steps 50\n```\n\n**JavaScript/Node.js training-loop example**\n\nA parallel Node.js script is available at\n[`examples/rl_training_loop.js`](examples/rl_training_loop.js) for\nJS-first developers.  It mirrors the Python script exactly — same episode\nflow, same action set, same printed output — and uses only Node.js\nbuilt-ins (`node:https` / `node:http`), so no `npm install` is required:\n\n```bash\n# Run against the local dev server (--token must have admin scope):\nnode examples/rl_training_loop.js --token $ADMIN_KEY\n\n# Already have a write-scoped key? Skip key minting with --skip-mint:\nnode examples/rl_training_loop.js --token $API_KEY --skip-mint\n\n# Run against a deployed instance with custom episode count:\nnode examples/rl_training_loop.js \\\n  --base-url https://your-deployment.replit.app \\\n  --token $ADMIN_KEY \\\n  --episodes 5 \\\n  --steps 50\n```\n\n**Fetch API training-loop example (Node.js 18+ / Deno / Bun)**\n\nA self-contained counterpart to the Node.js script above is available at\n[`examples/rl_training_loop_fetch.mjs`](examples/rl_training_loop_fetch.mjs).\nIt mirrors the Node.js script exactly — same episode flow, same action set,\nsame printed output — but uses the standard Fetch API throughout instead of\n`node:https` built-ins, so it runs unchanged on Node.js 18+, Deno, and Bun\nwith no `npm install` required:\n\n```bash\n# Node.js 18+ (run against the local dev server):\nnode examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY\n\n# Already have a write-scoped key? Skip key minting with --skip-mint:\nnode examples/rl_training_loop_fetch.mjs --token $API_KEY --skip-mint\n\n# Deno (requires --allow-net):\ndeno run --allow-net examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY\n\n# Bun:\nbun examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY\n\n# Run against a deployed instance with custom episode count:\nnode examples/rl_training_loop_fetch.mjs \\\n  --base-url https://your-deployment.replit.app \\\n  --token $ADMIN_KEY \\\n  --episodes 5 \\\n  --steps 50\n```\n\nIf you only need a minimal inline snippet (e.g. to embed in a browser\nscript or REPL), here is a compact fetch-based step loop:\n\n```js\n// fetch-based RL step loop — works in browser, Deno, Bun, Node.js 18+\n// (wrapped in an async IIFE so no top-level await is required)\nconst BASE_URL = \"https://your-deployment.replit.app\"; // or http://localhost:5000\nconst API_KEY  = \"your-write-scoped-api-key\";\nconst ENV_ID   = \"your-environment-id\"; // from POST /api/rl/environments\n\nasync function rlStep(action) {\n  const res = await fetch(`${BASE_URL}/api/rl/environments/${ENV_ID}/step`, {\n    method: \"POST\",\n    headers: {\n      \"Content-Type\": \"application/json\",\n      \"Authorization\": `Bearer ${API_KEY}`,\n    },\n    body: JSON.stringify({ action }),\n  });\n  if (!res.ok) throw new Error(`Step failed: ${res.status} ${await res.text()}`);\n  return res.json();\n}\n\n(async () => {\n  for (let i = 0; i < 20; i++) {\n    const { observation: obs, metrics, reward, done } = await rlStep(\"scale_up\");\n    console.log(\n      `step ${i + 1} | cpu=${(obs.cpu_util * 100).toFixed(1)}%` +\n      ` | p95=${metrics.latency_p95}ms | reward=${reward.toFixed(3)}`\n    );\n    if (done) { console.log(\"Episode finished.\"); break; }\n  }\n})();\n```\n\n**TypeScript training-loop example**\n\nA typed counterpart is available at\n[`examples/rl_training_loop.ts`](examples/rl_training_loop.ts) for\nTypeScript projects.  It imports the generated SDK types from\n`sdk/typescript/src/openapi-types.ts` so every request body and step\nresponse is fully typed — enabling autocomplete and compile-time safety.\nRun it with `npx tsx` (no separate compile step needed):\n\n```bash\n# Run against the local dev server (--token must have admin scope):\nnpx tsx examples/rl_training_loop.ts --token $ADMIN_KEY\n\n# Already have a write-scoped key? Skip key minting with --skip-mint:\nnpx tsx examples/rl_training_loop.ts --token $API_KEY --skip-mint\n\n# Run against a deployed instance with custom episode count:\nnpx tsx examples/rl_training_loop.ts \\\n  --base-url https://your-deployment.replit.app \\\n  --token $ADMIN_KEY \\\n  --episodes 5 \\\n  --steps 50\n```\n\n---\n\n## Webhook Notifications\n\nThe Cloud World Model API supports webhook notifications for asynchronous job completion events.\nInstead of polling job status endpoints, you can provide a webhook URL when creating jobs, and\nthe API will send an HTTP POST request to your endpoint when the job completes.\n\n**Supported Jobs:**\n- Infrastructure Optimization jobs (`POST /api/analysis/optimize`)\n- Chaos Engineering tests (`POST /api/chaos/run`)\n- Batch Chaos Engineering tests (`POST /api/chaos/batch`)\n- Predictive Scaling validation (`POST /api/predictions/validate`)\n- Predictive Scaling threshold optimization (`POST /api/predictions/optimize-thresholds`)\n- Multi-Cloud Strategy exploration (`POST /api/multi-cloud/explore`)\n- RL Environment episode completion (`POST /api/rl/environments`)\n\n**How to Use Webhooks:**\n\nWhen creating a job, include two optional fields in your request:\n- `webhookUrl` (string): HTTPS URL where the webhook should be delivered\n- `webhookSecret` (string): Secret used to sign the webhook payload (for verification)\n\n**Webhook Delivery Mechanism:**\n\n- **Asynchronous**: Webhooks are sent asynchronously when the job completes (status: completed or failed)\n- **Fire-and-forget**: The API does not wait for your webhook endpoint to respond before marking the job complete\n- **Retry Logic**: Up to 3 delivery attempts with exponential backoff (0s, 2s, 8s)\n- **Timeout**: Each delivery attempt has a 10-second timeout\n- **HTTPS Only**: Webhook URLs must use HTTPS (HTTP URLs are rejected for security)\n- **SSRF Protection**: Private IP addresses and localhost are blocked to prevent server-side request forgery\n\n**Webhook Payload:**\n\nThe webhook payload is a JSON object containing:\n- `event`: Event type (e.g., \"optimization.completed\", \"chaos.completed\", \"rl_episode.completed\")\n- `jobId`: Unique identifier for the job\n- `status`: Final status (\"completed\" or \"failed\")\n- `data`: Job-specific result data (structure varies by job type)\n- `timestamp`: ISO 8601 timestamp when the webhook was sent\n\n**Security - Signature Verification:**\n\nAll webhooks include an `X-Webhook-Signature` header containing an HMAC-SHA256 signature.\nYou should verify this signature to ensure the webhook came from the Cloud World Model API:\n\n1. Extract the raw request body as bytes\n2. Compute HMAC-SHA256 using your `webhookSecret` as the key and the raw body as the message\n3. Compare the computed signature with the `X-Webhook-Signature` header value\n4. Use constant-time comparison to prevent timing attacks\n\nExample verification (Python):\n```python\nimport hmac\nimport hashlib\n\ndef verify_webhook_signature(payload: bytes, signature: str, secret: str) -> bool:\n    expected_signature = hmac.new(\n        secret.encode(),\n        payload,\n        hashlib.sha256\n    ).hexdigest()\n    return hmac.compare_digest(signature, expected_signature)\n```\n\n**Webhook Delivery Status:**\n\nAll job response objects include webhook delivery tracking fields:\n- `webhookDeliveryStatus`: \"pending\", \"delivered\", or \"failed\"\n- `webhookDeliveryAttempts`: Number of delivery attempts made\n- `webhookDeliveryError`: Error message if delivery failed (e.g., timeout, connection refused)\n- `webhookDeliveredAt`: ISO 8601 timestamp of successful delivery\n\n**Best Practices:**\n\n- Use a unique `webhookSecret` for each job or use a rotating secret system\n- Always verify the webhook signature before processing the payload\n- Return a 2xx status code from your webhook endpoint to acknowledge receipt\n- Process webhooks asynchronously to avoid blocking the delivery request\n- Store webhook payloads for debugging and audit trails\n- Implement idempotency using the `jobId` (webhooks may be delivered multiple times)\n\n---\n\n## Handling Failures\n\nWhen a webhook arrives with `\"status\": \"failed\"`, the `data.error` field contains a\nhuman-readable message that tells you exactly why the job could not complete. Agent\nrecovery logic should inspect this message and classify the failure before deciding\nwhether to retry the same request, fix the input and resubmit, or escalate.\n\n### Two Categories of Job Failure\n\n| Category | When to use | Agent action |\n|---|---|---|\n| **Invalid input** | The error message describes a specific problem with the request parameters (missing resource, malformed data, etc.) | Fix the input; do **not** retry the same request |\n| **Transient / engine error** | The error message mentions an internal error, unexpected computation result, or does not identify a user-correctable input problem | Wait, then retry the same request up to 3 times with exponential backoff (2 s, 8 s, 30 s) |\n\nA useful rule of thumb: if the error message ends with \"resubmit\" after listing corrective\nsteps, it is an **invalid-input failure**. If the message does not provide corrective steps\nor mentions internal state, treat it as **transient** and retry.\n\n### Error Types and Recovery Steps\n\n**1. Simulation not found**\n\nExample error text: `\"Simulation 'sim_abc123' not found…\"`\n\n- **Category**: Invalid input (non-retryable as-is)\n- **Cause**: The `simulationId` supplied when creating the job references a simulation\n  that no longer exists (deleted between job submission and execution) or was never created.\n- **Recovery**:\n  1. Call `POST /api/simulations` to create a new simulation with the same resource\n     configuration.\n  2. Re-submit the job using the new `simulationId`.\n- **Do not** retry the original job request; it will fail again with the same error.\n\n**2. Simulation contains no resources**\n\nExample error text: `\"Simulation '…' contains no resources and cannot be validated…\"`\n\n- **Category**: Invalid input (non-retryable as-is)\n- **Cause**: The referenced simulation exists but has no compute or database resources attached.\n- **Recovery**:\n  1. Add at least one compute resource and one database resource to the simulation\n     via `POST /api/simulations/{simulationId}/resources`.\n  2. Re-submit the job.\n\n**3. simulationId belongs to a different API key scope**\n\nExample error text: `\"simulationId '…' does not exist or belongs to a different API key scope…\"`\n\n- **Category**: Invalid input / authorization (non-retryable as-is)\n- **Cause**: The API key used to submit the job does not have `read` access to the\n  simulation referenced by `simulationId`.\n- **Recovery**:\n  1. Verify that you are using the correct API key for the target simulation.\n  2. If multiple keys are in use, ensure the key used to create the simulation is the\n     same key (or a key with the same scope) used to submit the job.\n  3. Re-submit with the correct key.\n\n**4. Traffic forecast malformed**\n\nExample error text: `\"Traffic forecast '…' is malformed: timestamps are not strictly increasing…\"`\n\n- **Category**: Invalid input (non-retryable as-is)\n- **Cause**: The traffic forecast data provided in the request is structurally invalid\n  (e.g., non-monotonic timestamps, missing fields, duplicate step numbers).\n- **Recovery**:\n  1. Inspect the forecast array and sort steps so timestamps are strictly increasing.\n  2. Remove any duplicate step entries.\n  3. Re-submit the job with the corrected forecast.\n\n**5. Traffic forecast has insufficient data points**\n\nExample error text: `\"…contains only N data points spanning M simulation steps…requires at least 5 data points covering a minimum of 60 steps…\"`\n\n- **Category**: Invalid input (non-retryable as-is)\n- **Cause**: The traffic forecast is too short for the engine to evaluate scale-out and\n  scale-in thresholds across a complete ramp-and-drain traffic cycle.\n- **Recovery**:\n  1. Extend the forecast to at least 5 distinct load-level steps covering a minimum of\n     60 simulation steps.\n  2. Make sure the forecast includes a clear ramp-up phase, a sustained peak, and a\n     ramp-down (drain) phase.\n  3. Re-submit the job.\n\n**6. No valid threshold combination found**\n\nExample error text: `\"No valid threshold combination found…all N candidate combinations…produced peak error rates above the SLA limit…\"`\n\n- **Category**: Infrastructure constraint (non-retryable without parameter changes)\n- **Cause**: Every threshold combination the optimizer tested exceeded the SLA error-rate\n  limit for the given traffic pattern. This means the current infrastructure configuration\n  (instance sizes, instance counts, or both) cannot handle the forecast load regardless of\n  autoscaling thresholds.\n- **Recovery** (choose one or more):\n  - Increase `maxInstances` in the simulation's autoscaling config so the optimizer has\n    more headroom to test higher-scale configurations.\n  - Raise the minimum instance count (`minInstances`) so the pool can absorb the initial\n    traffic burst before autoscaling adds capacity.\n  - Upgrade the node or instance SKU to a larger size in the simulation resource definition.\n  - If the spike is extremely sudden (viral traffic), increase `minInstances` first since\n    autoscaling provisioning time may exceed the ramp duration.\n  - After making any of the above changes, re-submit the optimization job.\n- **Do not** retry without changing the infrastructure parameters; the optimizer will\n  produce the same result.\n\n**7. Validation engine internal error**\n\nExample error text: `\"Validation engine encountered an internal error…capacity model returned a negative throughput value at step N…\"`\n\n- **Category**: Likely invalid input (resource misconfiguration), occasionally transient\n- **Cause**: The simulation engine encountered an inconsistency it could not recover from.\n  This is usually caused by a resource configuration that produces a logically impossible\n  state (e.g., zero or negative instance counts, throughput capacity below zero).\n- **Recovery**:\n  1. Check that all resources in the simulation have positive, non-zero values for\n     instance counts, vCPU allocations, and memory.\n  2. Verify that replica counts and node pool sizes are set correctly.\n  3. Re-submit the job.\n  4. If the error persists after verifying the configuration, treat it as transient and\n     retry up to 3 times total with exponential backoff (2 s, 8 s, 30 s).\n  5. If all retries fail, escalate by recording the `jobId` and full error payload for\n     support investigation.\n\n### Webhook Delivery Failures vs. Job Failures\n\nJob failures (described above) are different from webhook *delivery* failures. The platform\nretries webhook delivery up to 3 times with exponential backoff (0 s, 2 s, 8 s). If all\ndelivery attempts fail, the job response object reflects this:\n\n- `webhookDeliveryStatus`: `\"failed\"`\n- `webhookDeliveryAttempts`: `3`\n- `webhookDeliveryError`: description of the network error (e.g., \"connection refused\", \"timeout\")\n\nIn this case the **job itself may have completed successfully**; only the notification\nfailed to reach your endpoint. Agent recovery steps:\n\n1. Poll the job status endpoint (e.g., `GET /api/predictions/optimize-thresholds/{jobId}`)\n   to retrieve the final result directly.\n2. Inspect `status` in the polled response:\n   - `\"completed\"` → process the result as you would a successful webhook payload.\n   - `\"failed\"` → apply the job-failure recovery steps above.\n3. Fix your webhook endpoint (connectivity, TLS certificate, response code) so future\n   deliveries succeed.\n\nThe `jobId` included in every webhook payload is stable and idempotent — you can safely\npoll the same `jobId` multiple times without triggering side effects.\n\n### Retryable vs. Non-Retryable Quick Reference\n\n| Error pattern in `data.error` | Category | Retry same request? |\n|---|---|---|\n| `\"…not found…\"` (simulation or resource) | Invalid input | No — fix `simulationId` first |\n| `\"…no resources…\"` | Invalid input | No — add resources first |\n| `\"…different API key scope…\"` | Authorization | No — fix key/scope first |\n| `\"…malformed…\"` or `\"…not strictly increasing…\"` | Invalid input | No — fix forecast first |\n| `\"…insufficient data points…\"` or `\"…too short…\"` | Invalid input | No — extend forecast first |\n| `\"No valid threshold combination found…\"` | Infrastructure constraint | No — change infra params first |\n| `\"…internal error…\"` or unexpected computation message | Transient | Yes — retry up to 3× with exponential backoff (2 s, 8 s, 30 s) |\n| Any other unrecognized error | Unknown | Retry once; escalate if it recurs |\n","contact":{"name":"Cloud World Model API Support","url":"https://github.com/your-org/cloud-world-model"},"license":{"name":"MIT","url":"https://opensource.org/licenses/MIT"}},"servers":[{"url":"http://localhost:5000/api","description":"Local development server"},{"url":"https://your-production-domain.com/api","description":"Production server"}],"tags":[{"name":"RL Environments","description":"Manage reinforcement learning training environments.\n\n\nThis walkthrough shows the complete episode lifecycle for a **DigitalOcean** simulation:\n**create → reset → step → observe**. DigitalOcean simulations use Droplet-based compute\nand Managed Database resources. The hybrid prediction engine automatically models\nprovider-specific behaviour — Droplet cold-start overhead (~30 s provisioning latency on\nfirst request after a scale-out) and shared-tenant network jitter — so your agent learns\nrealistic scaling dynamics without spending real cloud budget.\n\n**Step 1 — Create a DigitalOcean simulation** *(no auth required)*\n\n`POST /simulations` with resources that include at least one Droplet\n(`provider: \"digitalocean\"`, `type: \"compute\"`) and, optionally, a Managed PostgreSQL\nnode and a Load Balancer. Record the returned `id` — this is your `simulationId`.\n\n**Step 2 — Mint an API key** *(no auth required for demo)*\n\n`POST /keys` → copy the `key` field from the response.\n\n**Step 3 — Create the RL environment** *(Bearer auth required)*\n\n```http\nPOST /rl/environments\nAuthorization: Bearer <your-key>\nContent-Type: application/json\n\n{\n  \"simulationId\": \"<id-from-step-1>\",\n  \"episodeConfig\": {\n    \"maxSteps\": 200,\n    \"targetTrafficPattern\": \"wave\",\n    \"initialTraffic\": 1500,\n    \"targetSLA\": { \"maxLatencyP95\": 180, \"maxErrorRate\": 1.0 },\n    \"costBudgetPerHour\": 3.50\n  }\n}\n```\n\nThe response contains the environment `id` and the initial `observation` — your agent's\nfirst view of the Droplet cluster state.\n\n**Step 4 — Training loop** *(Bearer auth required)*\n\n```\nrepeat until done == true:\n  POST /rl/environments/{environmentId}/step\n  { \"action\": { \"type\": \"adjust_threshold\",\n                \"parameters\": { \"cpuThreshold\": 65, \"throughputThreshold\": 70 } } }\n  ← { t, obs, metrics, reward, reward_components, done, info }\n```\n\n**Step 5 — Reset for the next episode**\n\n```\nPOST /rl/environments/{environmentId}/reset\n```\n\n**DigitalOcean-specific notes:**\n- `costPerHour` in observations reflects Droplet + Managed Database pricing from the\n  `nyc3` region benchmark data.\n- `action.type: \"scale_out\"` provisions a new Droplet replica; the first observation\n  after scaling models the ~30 s cold-start latency overhead automatically.\n- `action.type: \"adjust_threshold\"` tunes CPU/throughput triggers on the DigitalOcean\n  autoscaling profile (default: CPU-weighted scoring, 180 s cooldown).\n- Droplet `s-2vcpu-4gb` (`maxThroughput: 1800 req/s`) is the recommended starting\n  instance type for moderate workloads. Use `s-2vcpu-4gb-amd` for the Premium AMD NVMe\n  variant (same vCPU/RAM, NVMe storage, slightly higher hourly rate). Upgrade to\n  `c-4 CPU-Optimized` when your agent consistently saturates CPU.\n\n\nThis walkthrough shows the complete episode lifecycle for a **GCP** simulation:\n**create → reset → step → observe**. GCP simulations use GCE compute instances and\nCloud SQL for managed databases in the `us-central1` region. The hybrid prediction engine\nmodels GCP-specific behaviour including managed instance group warm-up latency (~45 s for\nthe first scale-out) and Cloud SQL connection pooling characteristics.\n\n**Step 1 — Create a GCP simulation** *(no auth required)*\n\n`POST /simulations` with resources using `provider: \"gcp\"`. Include at least one GCE\ninstance (`serviceFamily: \"gce\"`) and optionally a Cloud SQL node and Cloud Load Balancing\nfrontend. Record the returned `id` — this is your `simulationId`.\n\n**Step 2 — Mint an API key** *(no auth required for demo)*\n\n`POST /keys` → copy the `key` field from the response.\n\n**Step 3 — Create the RL environment** *(Bearer auth required)*\n\n```http\nPOST /rl/environments\nAuthorization: Bearer <your-key>\nContent-Type: application/json\n\n{\n  \"simulationId\": \"<id-from-step-1>\",\n  \"episodeConfig\": {\n    \"maxSteps\": 150,\n    \"targetTrafficPattern\": \"ramp\",\n    \"initialTraffic\": 4000,\n    \"targetSLA\": { \"maxLatencyP95\": 180, \"maxErrorRate\": 1.0 },\n    \"costBudgetPerHour\": 6.0\n  }\n}\n```\n\n**Step 4 — Training loop** *(Bearer auth required)*\n\n```\nrepeat until done == true:\n  POST /rl/environments/{environmentId}/step\n  { \"action\": { \"type\": \"adjust_threshold\",\n                \"parameters\": { \"cpuThreshold\": 68, \"throughputThreshold\": 72 } } }\n  ← { t, obs, metrics, reward, reward_components, done, info }\n```\n\n**Step 5 — Reset for the next episode**\n\n```\nPOST /rl/environments/{environmentId}/reset\n```\n\n**GCP-specific notes:**\n- `costPerHour` in observations reflects GCE `e2-standard-4` + Cloud SQL `db-standard-4`\n  pricing from the `us-central1` region benchmark.\n- `action.type: \"scale_out\"` provisions a new GCE instance; managed instance group warm-up\n  adds ~45 s latency overhead to the first observation after scaling.\n- `action.type: \"adjust_threshold\"` tunes the GCP autoscaling profile (default:\n  CPU-weighted scoring, 120 s cooldown on Compute Engine autoscaler).\n- Upgrade GCE instances from `e2-standard-4` to `n2-standard-8` in the resource definition\n  when your agent consistently saturates CPU.\n\n\nThis walkthrough shows the complete episode lifecycle for an **Azure** simulation:\n**create → reset → step → observe**. Azure simulations use Azure VM compute and\nAzure SQL Database in the `East US` region. The hybrid prediction engine models\nAzure-specific behaviour including VM Scale Set provisioning latency (~60 s for the\nfirst scale-out) and Azure SQL DTU burst characteristics.\n\n**Step 1 — Create an Azure simulation** *(no auth required)*\n\n`POST /simulations` with resources using `provider: \"azure\"`. Include at least one Azure VM\n(`serviceFamily: \"azure_vm\"`, `size: \"Standard_D4s_v3\"`) and optionally an Azure SQL\nDatabase node and Azure Load Balancer. Record the returned `id`.\n\n**Step 2 — Mint an API key** *(no auth required for demo)*\n\n`POST /keys` → copy the `key` field from the response.\n\n**Step 3 — Create the RL environment** *(Bearer auth required)*\n\n```http\nPOST /rl/environments\nAuthorization: Bearer <your-key>\nContent-Type: application/json\n\n{\n  \"simulationId\": \"<id-from-step-1>\",\n  \"episodeConfig\": {\n    \"maxSteps\": 150,\n    \"targetTrafficPattern\": \"wave\",\n    \"initialTraffic\": 4500,\n    \"targetSLA\": { \"maxLatencyP95\": 200, \"maxErrorRate\": 1.0 },\n    \"costBudgetPerHour\": 7.0\n  }\n}\n```\n\n**Step 4 — Training loop** *(Bearer auth required)*\n\n```\nrepeat until done == true:\n  POST /rl/environments/{environmentId}/step\n  { \"action\": { \"type\": \"adjust_threshold\",\n                \"parameters\": { \"cpuThreshold\": 70, \"throughputThreshold\": 75 } } }\n  ← { t, obs, metrics, reward, reward_components, done, info }\n```\n\n**Step 5 — Reset for the next episode**\n\n```\nPOST /rl/environments/{environmentId}/reset\n```\n\n**Azure-specific notes:**\n- `costPerHour` in observations reflects Azure VM `Standard_D4s_v3` + Azure SQL\n  `General Purpose 4 vCores` pricing from the `East US` region benchmark.\n- `action.type: \"scale_out\"` provisions a new `Standard_D4s_v3` VM via VM Scale Set;\n  the first observation after scaling models the ~60 s warm-up latency.\n- `action.type: \"adjust_threshold\"` tunes the Azure autoscaling profile (default:\n  CPU-weighted scoring, 300 s cooldown on Azure Monitor autoscale).\n- Consider `Standard_F8s_v2` (compute-optimized) when your agent consistently saturates\n  CPU on `Standard_D4s_v3`.\n\n\nThis walkthrough shows the complete episode lifecycle for an **OCI** simulation:\n**create → reset → step → observe**. OCI simulations use OCI Compute (VM.Standard3.Flex)\nand Autonomous Database in the `us-ashburn-1` region. The hybrid prediction engine models\nOCI-specific behaviour including Flex OCPU scaling dynamics and Autonomous Database\nauto-scaling characteristics.\n\n**Step 1 — Create an OCI simulation** *(no auth required)*\n\n`POST /simulations` with resources using `provider: \"oci\"`. Include at least one OCI VM\n(`serviceFamily: \"oci_vm\"`, `size: \"VM.Standard3.Flex\"`) and optionally an Autonomous\nDatabase node and OCI Load Balancer. Record the returned `id`.\n\n**Step 2 — Mint an API key** *(no auth required for demo)*\n\n`POST /keys` → copy the `key` field from the response.\n\n**Step 3 — Create the RL environment** *(Bearer auth required)*\n\n```http\nPOST /rl/environments\nAuthorization: Bearer <your-key>\nContent-Type: application/json\n\n{\n  \"simulationId\": \"<id-from-step-1>\",\n  \"episodeConfig\": {\n    \"maxSteps\": 150,\n    \"targetTrafficPattern\": \"burst\",\n    \"initialTraffic\": 5000,\n    \"targetSLA\": { \"maxLatencyP95\": 160, \"maxErrorRate\": 0.5 },\n    \"costBudgetPerHour\": 4.0\n  }\n}\n```\n\n**Step 4 — Training loop** *(Bearer auth required)*\n\n```\nrepeat until done == true:\n  POST /rl/environments/{environmentId}/step\n  { \"action\": { \"type\": \"adjust_threshold\",\n                \"parameters\": { \"cpuThreshold\": 65, \"throughputThreshold\": 70 } } }\n  ← { t, obs, metrics, reward, reward_components, done, info }\n```\n\n**Step 5 — Reset for the next episode**\n\n```\nPOST /rl/environments/{environmentId}/reset\n```\n\n**OCI-specific notes:**\n- `costPerHour` in observations reflects OCI `VM.Standard3.Flex` + Autonomous Database\n  pricing from the `us-ashburn-1` region benchmark.\n- OCI VM.Standard3.Flex uses flexible OCPU/memory allocation; the simulation models\n  4 OCPUs / 64 GB RAM per instance by default.\n- Autonomous Database auto-scales OCPU and storage independently, so database cost varies\n  with query load rather than staying fixed.\n- OCI typically offers the lowest per-OCPU compute cost among the five providers, making\n  it attractive for cost-optimization agents.\n","externalDocs":{"description":"Live RL Environment Status viewer","url":"/admin/rl-environments"}},{"name":"Simulations","description":"Create and manage cloud infrastructure simulations"},{"name":"Scenarios","description":"Browse pre-built infrastructure scenario templates"},{"name":"Pricing History","description":"Cloud provider pricing history and trends"},{"name":"API Keys","description":"Manage API keys for authentication"},{"name":"Infrastructure Optimization","description":"Automated infrastructure analysis and optimization"},{"name":"Predictive Scaling","description":"Test infrastructure against traffic forecasts and optimize autoscaling thresholds"},{"name":"Chaos Engineering","description":"Test infrastructure resilience by injecting failures and analyzing recovery"},{"name":"Multi-Cloud Strategy","description":"Explore and compare multi-cloud deployment strategies for optimal cost, performance, and vendor independence"},{"name":"Discovery","description":"Machine-readable description of the simulator for AI agents and onboarding tooling"},{"name":"API Key Self-Service","description":"Bootstrap and self-service path for obtaining API keys without direct database access.\n\n\n1. **Platform operator** runs the bootstrap script once:\n   ```\n   npx tsx scripts/bootstrap-admin-key.ts\n   ```\n   This prints a one-time admin key. Store it as a secret immediately.\n\n2. **Admin** mints a scoped, time-limited registration token for the external client:\n   ```http\n   POST /register-tokens\n   Authorization: Bearer <admin-key>\n   Content-Type: application/json\n   { \"name\": \"canvas-cloud-ai\", \"scopes\": [\"read\",\"write\"], \"expiresAt\": \"2026-06-01T00:00:00Z\" }\n   ```\n\n3. **External client** exchanges the token once (manually or via a one-off script):\n   ```http\n   POST /keys/register\n   Content-Type: application/json\n   { \"token\": \"<registration-token>\", \"name\": \"canvas-cloud-ai-prod\" }\n   ```\n   The response contains a permanent API key. Store it as an environment secret.\n\n4. The token is **burned on use** and can never be reused.\n   All subsequent API calls use the permanent key directly.\n"}],"paths":{"/keys/register":{"x-stability":"stable","post":{"tags":["API Key Self-Service"],"summary":"Exchange a registration token for a permanent API key","description":"Public endpoint — no authentication required.\n\nAccepts a single-use registration token minted by an admin via `POST /register-tokens`\nand returns a permanent API key pre-scoped to whatever the admin specified when the\ntoken was created.\n\n**The token is burned immediately on success.** It cannot be reused even if the caller\nloses the returned key — in that case the admin must mint a new token.\n\n**Intended usage pattern (Canvas Cloud AI / external agents):**\n1. Admin mints a registration token (expiry set for safety).\n2. Client calls this endpoint once to receive its permanent key.\n3. Client stores the key as an environment secret (`CWM_API_KEY`).\n4. All subsequent API calls use that key directly — no runtime token exchange ever occurs.\n","operationId":"registerApiKey","requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["token","name"],"properties":{"token":{"type":"string","description":"The registration token received from the platform admin (starts with `cwm_reg_`)","example":"cwm_reg_a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2"},"name":{"type":"string","description":"A descriptive name for the API key that will be created","example":"canvas-cloud-ai-prod"}}},"example":{"token":"cwm_reg_a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2","name":"canvas-cloud-ai-prod"}}}},"responses":{"201":{"description":"API key created successfully. Store the `key` field immediately — it will not be shown again.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ApiKeyCreatedResponse"},"example":{"id":"3f9a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c","key":"cwm_live_a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2","keyPrefix":"cwm_live_a1b2c3d4e5f6a1b2...","name":"canvas-cloud-ai-prod","scopes":["read","write"],"rateLimit":1000,"createdAt":"2026-05-10T17:00:00.000Z","message":"Store this API key securely. You won't be able to see it again."}}}},"400":{"description":"Invalid request body or token format","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"401":{"description":"Registration token not found or HMAC mismatch","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"410":{"description":"Registration token is expired, revoked, or already used","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/register-tokens":{"x-stability":"stable","post":{"tags":["API Key Self-Service"],"summary":"Mint a new registration token (admin)","description":"Requires `admin` scope. Creates a named, scoped, time-limited single-use invite token.\nThe plain token is returned **once** in the response and must be shared with the intended\nclient immediately — it cannot be retrieved again.\n\nThe client uses the token at `POST /keys/register` to exchange it for a permanent API key.\n","operationId":"createRegistrationToken","security":[{"BearerAuth":[]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["name"],"properties":{"name":{"type":"string","description":"Human-readable label identifying the intended token recipient","example":"canvas-cloud-ai"},"scopes":{"type":"array","items":{"type":"string","enum":["read","write","admin"]},"default":["read","write"],"description":"Scopes that the resulting API key will have"},"rateLimit":{"type":"integer","default":1000,"description":"Requests-per-hour limit on the resulting API key"},"expiresAt":{"type":"string","format":"date-time","description":"Optional ISO 8601 expiry for the registration token. If omitted, the token never expires.","example":"2026-06-01T00:00:00Z"}}},"example":{"name":"canvas-cloud-ai","scopes":["read","write"],"rateLimit":1000,"expiresAt":"2026-06-01T00:00:00Z"}}}},"responses":{"201":{"description":"Registration token created. The `token` field is shown once — share it with the recipient now.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/RegistrationTokenCreatedResponse"},"example":{"id":"7a8b9c0d-1e2f-3a4b-5c6d-7e8f9a0b1c2d","token":"cwm_reg_a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2","tokenPrefix":"cwm_reg_a1b2c3d4e5f6a1b2...","name":"canvas-cloud-ai","scopes":["read","write"],"rateLimit":1000,"expiresAt":"2026-06-01T00:00:00Z","createdAt":"2026-05-10T17:00:00.000Z","message":"Share this token with the client. It can only be used once and will not be shown again."}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Insufficient scope — admin required","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}},"get":{"tags":["API Key Self-Service"],"summary":"List all registration tokens (admin)","description":"Requires `admin` scope. Returns all registration tokens including their status:\n`pending` (unused and not expired), `used`, `expired`, or `revoked`.\nThe plain token value is never returned in list responses.\n","operationId":"listRegistrationTokens","security":[{"BearerAuth":[]}],"responses":{"200":{"description":"List of all registration tokens","content":{"application/json":{"schema":{"type":"array","items":{"$ref":"#/components/schemas/RegistrationTokenSummary"}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Insufficient scope — admin required","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/register-tokens/{id}":{"x-stability":"stable","delete":{"tags":["API Key Self-Service"],"summary":"Revoke an unused registration token (admin)","description":"Requires `admin` scope. Marks an unused, pending token as revoked so it can no longer\nbe exchanged. Returns `409 Conflict` if the token has already been used (the resulting\nAPI key must be revoked separately via `DELETE /keys/{id}`).\n","operationId":"revokeRegistrationToken","security":[{"BearerAuth":[]}],"parameters":[{"name":"id","in":"path","required":true,"schema":{"type":"string"},"description":"The registration token ID returned when the token was created","example":"7a8b9c0d-1e2f-3a4b-5c6d-7e8f9a0b1c2d"}],"responses":{"200":{"description":"Token revoked successfully","content":{"application/json":{"schema":{"type":"object","properties":{"message":{"type":"string","example":"Registration token revoked successfully"}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Insufficient scope — admin required","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"409":{"description":"Token has already been used and cannot be revoked","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"},"example":{"error":"Cannot revoke a token that has already been used"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/description":{"x-stability":"experimental","get":{"tags":["Discovery"],"summary":"Get simulator description","description":"Returns a structured JSON document describing the Cloud World Model simulator —\nits purpose, core concepts, workflow, capabilities, quick-start call sequence,\nauthentication requirements, entry points, intended use cases, and a pointer\nto the OpenAPI spec. Intended to be consumed by AI agents before calling any\nother endpoint so they can reason about available functionality.\n","operationId":"getDescription","responses":{"200":{"description":"Simulator description document","content":{"application/json":{"schema":{"type":"object","properties":{"name":{"type":"string","example":"Cloud World Model Simulator"},"version":{"type":"string","example":"1.0.0"},"purpose":{"type":"string","description":"High-level description of what the simulator does"},"core_concepts":{"type":"object","description":"Key/value map of domain term definitions","additionalProperties":{"type":"string"}},"workflow":{"type":"array","description":"Ordered list of steps describing the typical usage flow","items":{"type":"string"}},"capabilities":{"type":"array","description":"Feature areas exposed by the simulator","items":{"type":"object","properties":{"id":{"type":"string","example":"simulations"},"name":{"type":"string","example":"Simulations"},"description":{"type":"string"}}}},"quick_start":{"type":"array","description":"Step-by-step quick-start call sequence","items":{"type":"object","properties":{"step":{"type":"integer","example":1},"label":{"type":"string","example":"Create a simulation"},"method":{"type":"string","example":"POST"},"path":{"type":"string","example":"/api/simulations"},"description":{"type":"string"}}}},"authentication":{"type":"object","properties":{"type":{"type":"string","example":"Bearer token (API key)"},"required_for":{"type":"array","items":{"type":"string"}},"notes":{"type":"string"}}},"entry_points":{"type":"object","description":"Named map of core interaction entry points","additionalProperties":{"type":"object","properties":{"method":{"type":"string","example":"POST"},"path":{"type":"string","example":"/api/simulations"}}}},"intended_use_cases":{"type":"array","items":{"type":"string"}},"openapi_spec_url":{"type":"string","example":"/api-docs"}}}}}}}}},"/simulations":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"Create a new cloud infrastructure simulation","description":"Creates a simulation with cloud resources (compute, network, database, storage).\nThe simulation serves as the environment for RL training.\n\n**Ownership:** If an `Authorization: Bearer <key>` header is provided, the\nnew simulation is immediately assigned to that API key and is fully write-accessible\nwithout any additional steps. Without authentication the simulation is created\nas a \"demo\" simulation (`apiKeyId: null`), subject to a 20-step limit; use\n`POST /simulations/{simulationId}/claim` to adopt it once you have a key.\n","operationId":"createSimulation","requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["name","resources"],"properties":{"name":{"type":"string","description":"Human-readable name for the simulation","example":"Production Web App Autoscaling"},"description":{"type":"string","description":"Optional description of the simulation","example":"Multi-tier web application with load balancer and database"},"resources":{"type":"array","description":"Cloud resources in the simulation","items":{"$ref":"#/components/schemas/Resource"}},"connections":{"type":"array","description":"Network connections between resources","default":[],"items":{"$ref":"#/components/schemas/Connection"}},"traffic":{"type":"number","description":"Initial traffic load (requests per second)","default":1000,"example":5000}}},"examples":{"awsWebApp":{"summary":"AWS multi-tier web application","value":{"name":"AWS Production Web App","description":"Multi-tier web application on AWS with load balancer and RDS","traffic":5000,"resources":[{"id":"r1","name":"ALB","type":"network","provider":"aws","serviceFamily":"elb","region":"us-east-1","config":{"tier":"standard","targetCapacity":10000}},{"id":"r2","name":"App Server","type":"compute","provider":"aws","serviceFamily":"ec2","region":"us-east-1","config":{"instanceType":"t3.large","instances":4,"autoScaling":true,"minInstances":2,"maxInstances":12}},{"id":"r3","name":"Primary DB","type":"database","provider":"aws","serviceFamily":"rds","region":"us-east-1","config":{"instanceType":"db.r5.large","multiAZ":true}},{"id":"r4","name":"Static Assets","type":"storage","provider":"aws","serviceFamily":"s3","region":"us-east-1","config":{"storageGB":500}}]}},"digitalOceanWebApp":{"summary":"DigitalOcean web application (Droplets + Managed PostgreSQL)","value":{"name":"DO Production Web App","description":"Multi-tier web application on DigitalOcean with Droplets, Managed PostgreSQL, Spaces, and Load Balancer","traffic":3000,"resources":[{"id":"r1","name":"DO Load Balancer","type":"network","provider":"digitalocean","serviceFamily":"load_balancer","region":"nyc3","config":{"tier":"standard","targetCapacity":5000}},{"id":"r2","name":"App Droplets","type":"compute","provider":"digitalocean","serviceFamily":"droplets","region":"nyc3","config":{"instanceType":"s-4vcpu-8gb","instances":3,"autoScaling":true,"minInstances":2,"maxInstances":8}},{"id":"r3","name":"Managed PostgreSQL","type":"database","provider":"digitalocean","serviceFamily":"managed_postgresql","region":"nyc3","config":{"instanceType":"db-s-2vcpu-4gb","multiAZ":true}},{"id":"r4","name":"Spaces Object Storage","type":"storage","provider":"digitalocean","serviceFamily":"spaces","region":"nyc3","config":{"storageGB":500}}]}},"digitalOceanTrafficSpike":{"summary":"DigitalOcean traffic spike — triggers DOKS HPA Droplet scale-out","value":{"name":"DO Traffic Spike Autoscaling","description":"Demonstrates DigitalOcean DOKS Horizontal Pod Autoscaler (HPA) scaling\nDroplets during a sudden traffic surge. Starting traffic (80 000 RPS) is\ndeliberately above the capacity of a single s-2vcpu-4gb Droplet to\nguarantee that CPU breaches the 75 % HPA threshold and triggers automatic\nDroplet provisioning within one simulation step (150 s cooldown window).\nRun the simulation and observe DOKS HPA scale-out events in the event log.\n","traffic":80000,"resources":[{"id":"r1","name":"DO Load Balancer","type":"network","provider":"digitalocean","serviceFamily":"load_balancer","region":"nyc3","config":{"tier":"standard","targetCapacity":100000}},{"id":"r2","name":"droplet-01","type":"compute","provider":"digitalocean","serviceFamily":"droplets","region":"nyc3","config":{"instanceType":"s-2vcpu-4gb","instances":1,"autoScaling":true,"minInstances":1,"maxInstances":10}},{"id":"r3","name":"Managed PostgreSQL","type":"database","provider":"digitalocean","serviceFamily":"managed_postgresql","region":"nyc3","config":{"instanceType":"db-s-1vcpu-1gb","multiAZ":false}}]}}}}}},"responses":{"201":{"description":"Simulation created successfully","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Simulation"}}}},"400":{"$ref":"#/components/responses/BadRequest"},"500":{"$ref":"#/components/responses/InternalError"}}},"get":{"tags":["Simulations"],"summary":"List all simulations for the authenticated API key","description":"Returns all simulations owned by the authenticated API key.\nAdmin-scoped keys receive all simulations across all keys.\nRequires `read` scope.\n","operationId":"listSimulations","security":[{"BearerAuth":[]}],"responses":{"200":{"description":"Array of simulations","content":{"application/json":{"schema":{"type":"array","items":{"$ref":"#/components/schemas/Simulation"}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}":{"x-stability":"stable","parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID","example":"550e8400-e29b-41d4-a716-446655440000"}],"get":{"tags":["Simulations"],"summary":"Get a simulation by ID","description":"Returns the full state of a simulation including its resources,\ncurrent time step, traffic load, and autoscaling history.\nRequires `read` scope and ownership.\n","operationId":"getSimulation","security":[{"BearerAuth":[]}],"responses":{"200":{"description":"Simulation state","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Simulation"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied — caller does not own this simulation","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}},"patch":{"tags":["Simulations"],"summary":"Update a simulation","description":"Partially updates a simulation's properties (e.g. name, description,\ntraffic, resources). Only the fields provided are changed.\nRequires `write` scope and ownership.\n","operationId":"updateSimulation","security":[{"BearerAuth":[]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","properties":{"name":{"type":"string","example":"Updated Simulation Name"},"description":{"type":"string"},"traffic":{"type":"number","description":"Current traffic load (RPS)","example":8000},"resources":{"type":"array","items":{"$ref":"#/components/schemas/Resource"}}}},"examples":{"updateTraffic":{"summary":"Scale traffic up for a load test","value":{"name":"AWS Production Web App — Load Test","traffic":12000}},"addResource":{"summary":"Append a new compute node to an existing simulation","value":{"resources":[{"id":"r5","name":"Overflow Server","type":"compute","provider":"aws","serviceFamily":"ec2","region":"us-east-1","config":{"instanceType":"t3.large","instances":2,"autoScaling":true,"minInstances":1,"maxInstances":6}}]}}}}}},"responses":{"200":{"description":"Updated simulation","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Simulation"}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"tags":["Simulations"],"summary":"Delete a simulation","description":"Permanently deletes a simulation and all associated data (metrics,\nevents, patterns, failure injections).\nRequires `write` scope and ownership.\n","operationId":"deleteSimulation","security":[{"BearerAuth":[]}],"responses":{"204":{"description":"Simulation deleted successfully"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/claim":{"post":{"tags":["Simulations"],"summary":"Claim ownership of an unowned simulation","description":"Assigns ownership of a simulation that was created without an API key\n(a \"demo\" simulation, `apiKeyId: null`) to the authenticated caller.\n\nThis endpoint fixes the natural workflow of: create (no auth) →\nexperiment → start mutating with a key.  Once claimed, the simulation\nbehaves exactly like one that was created with a Bearer token from the\nstart — it is no longer subject to the demo step limit and all\nwrite-scoped endpoints respect the new owner.\n\nThis call is **idempotent**: if the caller already owns the simulation,\nit returns `200` with the current simulation state.\n\n**Tip:** Pass an `Authorization: Bearer <key>` header when calling\n`POST /simulations` to assign ownership immediately and skip this step.\n\n**Status codes:**\n- `200` — claim succeeded (or the caller already owns the simulation — idempotent)\n- `409` — simulation is already owned by a different API key\n- `404` — simulation not found\n","operationId":"claimSimulation","security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"Simulation UUID","example":"550e8400-e29b-41d4-a716-446655440000"}],"responses":{"200":{"description":"Simulation claimed (or already owned by this key — idempotent)","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Simulation"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"},"409":{"description":"Simulation is already owned by a different API key","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/step":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"Advance a simulation by one time step","description":"Executes one simulation step: applies active traffic patterns, runs the\ncapacity model, applies active failure injections, triggers autoscaling,\nand saves the resulting metrics and events.\n\n**Authentication:** Optional. Unauthenticated callers may only step\n\"demo\" simulations (those created without an API key) up to 20 times.\nAuthenticated callers with `write` scope may step any simulation they own\nwithout limit.\n\nReturns the updated simulation state, the step metrics, and any events\ngenerated during the step (autoscale events, cost spikes, failure\nrecoveries, etc.).\n","operationId":"stepSimulation","parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"security":[],"requestBody":{"required":false,"content":{"application/json":{"schema":{"type":"object","properties":{"steps":{"type":"integer","description":"Number of time steps to advance in a single call (1–500, default 1).\nEach sub-step evaluates traffic patterns at the correct elapsed time,\nso a ramp or wave pattern reaches the same final value whether driven\nas N×1 individual calls or a single batch call of N.\n","minimum":1,"maximum":500,"default":1,"example":5}}}}}},"responses":{"200":{"description":"Step result","content":{"application/json":{"schema":{"type":"object","properties":{"simulation":{"$ref":"#/components/schemas/Simulation"},"metrics":{"type":"object","description":"Metrics snapshot for this step","properties":{"simulationId":{"type":"string"},"cpuUsage":{"type":"number","description":"CPU utilization (%)","example":72.4},"latencyP50":{"type":"number","description":"Median latency (ms)","example":45},"latencyP95":{"type":"number","description":"P95 latency (ms)","example":120},"errorRate":{"type":"number","description":"Error rate (%)","example":0.3},"throughput":{"type":"number","description":"Effective throughput (RPS)","example":4850},"costPerHour":{"type":"number","description":"Estimated infrastructure cost per hour (USD)","example":2.4},"cacheHitRate":{"type":"number","description":"Cache hit rate (%) — only present when cache resources exist"},"queueDepth":{"type":"number","description":"Queue depth (messages) — only present when queue resources exist"},"k8sNodeUtilization":{"type":"number","description":"Kubernetes node CPU utilization (%) — only present when Kubernetes resources exist"},"storageIopsUtilization":{"type":"number","description":"OCI Block Volume IOPS utilization (%) — only present when OCI Block Volume storage resources exist; 80 % triggers a warning status"},"connectionPressure":{"type":"number","description":"DB connection-pool pressure ratio (activeConnections / maxConnections), capped at 3.0. Only present when the simulation contains database resources. Values > 1.0 indicate pool exhaustion; values > 1.5 indicate severe saturation.\n","example":1.24}}},"events":{"type":"array","description":"Events generated during this step","items":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"timestamp":{"type":"string","format":"date-time"},"severity":{"type":"string","enum":["info","warning","error","success"]},"message":{"type":"string"},"resource":{"type":"string","description":"Name of the affected resource (if applicable)"},"metadata":{"type":"object","description":"Structured data for events that carry additional context. For database_overload events this object contains: activeConnections (int), maxConnections (int), connectionUtilizationPct (float).\n","properties":{"activeConnections":{"type":"integer","description":"Estimated active connections at the time of overload"},"maxConnections":{"type":"integer","description":"Maximum connections allowed by the database resource"},"connectionUtilizationPct":{"type":"number","description":"Percentage of the connection limit in use (e.g. 200.0 means 2× the limit)"}}}}}}}}}}},"403":{"description":"Access denied or demo step limit reached","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"429":{"description":"Demo step limit reached (20 steps for unauthenticated simulations)","headers":{"Retry-After":{"description":"Seconds until the rate-limit window resets","schema":{"type":"integer"}}},"content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string","example":"Demo step limit reached"},"message":{"type":"string"},"limit":{"type":"integer","example":20}}}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/step-hybrid":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"Advance a simulation using hybrid (rule + ML) decision logic","description":"Steps the simulation using the Hybrid Prediction Engine, which blends a\ndeterministic rule-based simulation with a simulated ML-based prediction.\nThe engine uses a confidence threshold to decide how much weight to give\neach path, and falls back to pure rules when ML confidence is low.\n\nReturns the updated simulation state, step metrics, events, the hybrid\ndecision record (including ML confidence and blending rationale), and the\nupdated cumulative hybrid result object.\n\nRequires `write` scope and ownership.\n","operationId":"stepSimulationHybrid","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/step-hybrid \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"config\": {\"mlWeight\": 0.6, \"confidenceThreshold\": 0.5}}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/step-hybrid\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\"config\": {\"mlWeight\": 0.6, \"confidenceThreshold\": 0.5}},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(\"ML confidence:\", data[\"hybridDecision\"][\"mlPrediction\"][\"overallConfidence\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/step-hybrid`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({ config: { mlWeight: 0.6, confidenceThreshold: 0.5 } }),\n});\nconst data = await resp.json();\nconsole.log(\"ML confidence:\", data.hybridDecision.mlPrediction.overallConfidence);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":false,"content":{"application/json":{"schema":{"type":"object","properties":{"steps":{"type":"integer","description":"Number of time steps to advance in a single call (1–500, default 1).\nEach sub-step evaluates traffic patterns at the correct elapsed time,\nso a ramp or wave pattern reaches the same final value whether driven\nas N×1 individual calls or a single batch call of N.\n","minimum":1,"maximum":500,"default":1,"example":5},"config":{"type":"object","description":"Optional hybrid engine configuration overrides","properties":{"mlWeight":{"type":"number","minimum":0,"maximum":1,"description":"Weight given to ML prediction vs rule-based (0 = pure rules, 1 = pure ML)","example":0.6},"confidenceThreshold":{"type":"number","minimum":0,"maximum":1,"description":"Minimum ML confidence to apply blending; below this threshold pure rules are used","example":0.5}}}}},"examples":{"defaultBlend":{"summary":"Default hybrid blend (60% ML, 40% rules)","value":{"config":{"mlWeight":0.6,"confidenceThreshold":0.5}}},"rulesHeavy":{"summary":"Rules-heavy blend — prefer deterministic path","value":{"config":{"mlWeight":0.2,"confidenceThreshold":0.7}}}}}}},"responses":{"200":{"description":"Hybrid step result","content":{"application/json":{"schema":{"type":"object","properties":{"simulation":{"$ref":"#/components/schemas/Simulation"},"metrics":{"type":"object","description":"Metrics snapshot for this step"},"events":{"type":"array","items":{"type":"object"}},"hybridDecision":{"type":"object","description":"Hybrid decision metadata for this step","properties":{"blendingApplied":{"type":"boolean"},"fallbackUsed":{"type":"boolean"},"mlPrediction":{"type":"object","properties":{"overallConfidence":{"type":"number","example":0.78},"bottlenecks":{"type":"array","items":{"type":"string"}},"eventLikelihoods":{"type":"array","items":{"type":"object"}}}}}},"hybridResult":{"type":"object","description":"Cumulative hybrid result history"}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/hybrid-result":{"x-stability":"stable","get":{"tags":["Simulations"],"summary":"Get the hybrid simulation result history","description":"Returns the cumulative hybrid decision history and summary statistics for\na simulation that has been stepped via `POST /simulations/{simulationId}/step-hybrid`.\n\nIncludes the full list of per-step decisions (ML confidence, blending applied,\nfallback used, bottlenecks, event likelihoods) and aggregate summary metrics\n(average ML confidence, total blended steps, total fallback steps).\n\nRequires `read` scope and ownership.\n","operationId":"getHybridResult","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/simulations/sim-abc123/hybrid-result \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.get(\n    f\"{BASE_URL}/simulations/sim-abc123/hybrid-result\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(\"Average ML confidence:\", data[\"summary\"][\"mlConfidenceAvg\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/hybrid-result`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst data = await resp.json();\nconsole.log(\"Average ML confidence:\", data.summary.mlConfidenceAvg);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Hybrid result history","content":{"application/json":{"schema":{"type":"object","properties":{"simulationId":{"type":"string"},"config":{"type":"object","description":"Hybrid engine configuration used"},"decisions":{"type":"array","description":"Per-step hybrid decisions","items":{"type":"object"}},"summary":{"type":"object","description":"Aggregate statistics across all steps","properties":{"totalSteps":{"type":"integer","example":50},"mlConfidenceAvg":{"type":"number","example":0.74},"blendedSteps":{"type":"integer","example":38},"fallbackSteps":{"type":"integer","example":12},"bottlenecksDetected":{"type":"integer","example":5},"eventsDetected":{"type":"integer","example":3}}}}},"example":{"simulationId":"sim-abc123","config":{"mlWeight":0.6,"confidenceThreshold":0.5},"decisions":[{"step":1,"blendingApplied":true,"fallbackUsed":false,"mlConfidence":0.82,"bottlenecks":[],"eventLikelihoods":[]},{"step":2,"blendingApplied":false,"fallbackUsed":true,"mlConfidence":0.41,"bottlenecks":["cpu_saturation"],"eventLikelihoods":[{"event":"autoscale_triggered","probability":0.78}]}],"summary":{"totalSteps":2,"mlConfidenceAvg":0.615,"blendedSteps":1,"fallbackSteps":1,"bottlenecksDetected":1,"eventsDetected":1}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"description":"No hybrid result found for this simulation (no hybrid steps run yet)","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/inject-traffic":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"Inject a random traffic spike","description":"Immediately injects a sudden traffic spike into the simulation by multiplying\nthe current traffic load by a random factor (typically 2×–5×). The spike is\napplied to the simulation state and recorded as a warning event.\n\nUseful for testing autoscaling responsiveness without configuring a full\ntraffic pattern. The effect persists until the next step adjusts traffic\nback via active patterns.\n\nRequires `write` scope and ownership.\n","operationId":"injectTraffic","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/inject-traffic \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/inject-traffic\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(\"Spike event:\", data[\"event\"][\"message\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/inject-traffic`, {\n  method: \"POST\",\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst data = await resp.json();\nconsole.log(\"Spike event:\", data.event.message);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Traffic spike applied","content":{"application/json":{"schema":{"type":"object","properties":{"simulation":{"$ref":"#/components/schemas/Simulation"},"event":{"type":"object","description":"Event recording the traffic spike","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"severity":{"type":"string","example":"warning"},"message":{"type":"string","example":"Traffic spike injected: 1000 → 4200 RPS"}}}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/inject-failure":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"Inject a random node failure","description":"Randomly selects a healthy compute node in the simulation and marks it as\nfailed, updating the resource state and recording a warning event.\n\nReturns 400 if no healthy nodes are available to fail.\nFor fine-grained control over failure type, duration, and target, use\n`POST /simulations/{simulationId}/failures` instead.\n\nRequires `write` scope and ownership.\n","operationId":"injectFailure","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/inject-failure \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/inject-failure\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(\"Failure event:\", data[\"event\"][\"message\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/inject-failure`, {\n  method: \"POST\",\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst data = await resp.json();\nconsole.log(\"Failure event:\", data.event.message);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Node failure injected","content":{"application/json":{"schema":{"type":"object","properties":{"simulation":{"$ref":"#/components/schemas/Simulation"},"event":{"type":"object","description":"Event recording the node failure"}}}}}},"400":{"description":"No healthy nodes available to fail","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/patterns":{"x-stability":"stable","get":{"tags":["Simulations"],"summary":"List traffic patterns for a simulation","description":"Returns all traffic patterns configured for the simulation.\nPatterns are applied on every simulation step to vary the traffic load\n(e.g. sine waves, ramps, step functions).\n\nRequires `read` scope and ownership.\n","operationId":"listTrafficPatterns","security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Array of traffic patterns","content":{"application/json":{"schema":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"type":{"type":"string","enum":["constant","ramp","sine","step","spike","custom"],"description":"Pattern shape"},"amplitude":{"type":"number","description":"Traffic multiplier amplitude"},"period":{"type":"number","description":"Pattern period in simulation steps"},"isActive":{"type":"boolean"}}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}},"post":{"tags":["Simulations"],"summary":"Create a traffic pattern for a simulation","description":"Adds a new traffic pattern that will be applied on every subsequent\nsimulation step. Multiple patterns can be active simultaneously; their\neffects are composed.\n\nRequires `write` scope and ownership.\n","operationId":"createTrafficPattern","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/patterns \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"type\": \"sine\", \"amplitude\": 0.4, \"period\": 24, \"isActive\": true}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/patterns\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\"type\": \"sine\", \"amplitude\": 0.4, \"period\": 24, \"isActive\": True},\n)\nresp.raise_for_status()\npattern = resp.json()\nprint(\"Created pattern:\", pattern[\"id\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/patterns`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({ type: \"sine\", amplitude: 0.4, period: 24, isActive: true }),\n});\nconst pattern = await resp.json();\nconsole.log(\"Created pattern:\", pattern.id);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["type"],"properties":{"type":{"type":"string","enum":["constant","ramp","sine","step","spike","custom"],"description":"Pattern shape","example":"sine"},"amplitude":{"type":"number","description":"Traffic multiplier amplitude (e.g. 0.5 = ±50% variation)","example":0.3},"period":{"type":"number","description":"Pattern period in simulation steps","example":20},"isActive":{"type":"boolean","default":true}}},"examples":{"sineWave":{"summary":"Daily sine-wave traffic pattern (24-step period)","value":{"type":"sine","amplitude":0.4,"period":24,"isActive":true}},"rampUp":{"summary":"Gradual ramp-up over 20 steps","value":{"type":"ramp","amplitude":1.5,"period":20,"isActive":true}}}}}},"responses":{"201":{"description":"Traffic pattern created","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"type":{"type":"string"},"amplitude":{"type":"number"},"period":{"type":"number"},"isActive":{"type":"boolean"}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"}}}},"/patterns/{patternId}":{"x-stability":"stable","parameters":[{"name":"patternId","in":"path","required":true,"schema":{"type":"string"},"description":"Traffic pattern UUID"}],"patch":{"tags":["Simulations"],"summary":"Update a traffic pattern","description":"Partially updates a traffic pattern's fields (type, amplitude, period,\nisActive). Ownership of the parent simulation is enforced.\nRequires `write` scope.\n","operationId":"updateTrafficPattern","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X PATCH https://your-production-domain.com/api/patterns/pat-001 \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"isActive\": false}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.patch(\n    f\"{BASE_URL}/patterns/pat-001\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\"isActive\": False},\n)\nresp.raise_for_status()\nprint(\"Pattern updated:\", resp.json())\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/patterns/pat-001`, {\n  method: \"PATCH\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({ isActive: false }),\n});\nconst pattern = await resp.json();\nconsole.log(\"Pattern updated:\", pattern);\n"}],"security":[{"BearerAuth":[]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","properties":{"type":{"type":"string","enum":["constant","ramp","sine","step","spike","custom"]},"amplitude":{"type":"number"},"period":{"type":"number"},"isActive":{"type":"boolean"}}},"examples":{"deactivate":{"summary":"Temporarily deactivate a traffic pattern","value":{"isActive":false}},"adjustAmplitude":{"summary":"Reduce amplitude to calm down traffic variance","value":{"amplitude":0.2,"period":30}}}}}},"responses":{"200":{"description":"Updated traffic pattern","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"type":{"type":"string"},"amplitude":{"type":"number"},"period":{"type":"number"},"isActive":{"type":"boolean"}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"tags":["Simulations"],"summary":"Delete a traffic pattern","description":"Permanently removes a traffic pattern. The pattern will no longer be\napplied on subsequent simulation steps. Ownership of the parent simulation\nis enforced.\nRequires `write` scope.\n","operationId":"deleteTrafficPattern","security":[{"BearerAuth":[]}],"responses":{"204":{"description":"Pattern deleted"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/failures":{"x-stability":"stable","get":{"tags":["Simulations"],"summary":"List failure injections for a simulation","description":"Returns all scheduled failure injections for the simulation, including\nboth active and inactive ones.\n\nFailure injections are applied automatically on each simulation step when\ntheir time window is active.\n\nRequires `read` scope and ownership.\n","operationId":"listFailureInjections","security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Array of failure injections","content":{"application/json":{"schema":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"name":{"type":"string","example":"AZ Outage - us-east-1a"},"type":{"type":"string","enum":["instance_kill","az_outage","database_overload","network_latency"]},"isActive":{"type":"boolean"},"startTime":{"type":"integer","description":"Simulation time step at which the failure begins"},"endTime":{"type":"integer","nullable":true,"description":"Simulation time step at which the failure ends (null = permanent until removed)"},"targetResourceId":{"type":"string","nullable":true},"targetZone":{"type":"string","nullable":true}}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}},"post":{"tags":["Simulations"],"summary":"Schedule a failure injection","description":"Creates a new failure injection that will be applied to the simulation\nduring the specified time window. The injection is applied immediately\nto the resource state (the affected resource's status changes) and an\nevent is recorded.\n\n**Failure types:**\n- `instance_kill` — kills a specific compute instance\n- `az_outage` — simulates an availability zone outage\n- `database_overload` — spikes database latency and error rate\n- `network_latency` — adds latency to all network resources\n\nRequires `write` scope and ownership.\n","operationId":"createFailureInjection","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/failures \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"name\": \"AZ Outage — us-east-1a\", \"type\": \"az_outage\", \"startTime\": 5, \"endTime\": 25, \"targetZone\": \"us-east-1a\", \"isActive\": true}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/failures\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\n        \"name\": \"AZ Outage — us-east-1a\",\n        \"type\": \"az_outage\",\n        \"startTime\": 5,\n        \"endTime\": 25,\n        \"targetZone\": \"us-east-1a\",\n        \"isActive\": True,\n    },\n)\nresp.raise_for_status()\nfailure = resp.json()\nprint(\"Created failure injection:\", failure[\"id\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/failures`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({\n    name: \"AZ Outage — us-east-1a\",\n    type: \"az_outage\",\n    startTime: 5,\n    endTime: 25,\n    targetZone: \"us-east-1a\",\n    isActive: true,\n  }),\n});\nconst failure = await resp.json();\nconsole.log(\"Created failure injection:\", failure.id);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["name","type","startTime"],"properties":{"name":{"type":"string","description":"Human-readable label for this failure","example":"Simulate AZ outage"},"type":{"type":"string","enum":["instance_kill","az_outage","database_overload","network_latency"],"description":"Type of failure to inject","example":"az_outage"},"startTime":{"type":"integer","description":"Simulation time step at which to begin the failure","example":10},"endTime":{"type":"integer","nullable":true,"description":"Simulation time step at which the failure resolves (null = permanent)","example":30},"targetResourceId":{"type":"string","nullable":true,"description":"ID of the specific resource to target (optional)"},"targetZone":{"type":"string","nullable":true,"description":"Availability zone to target for az_outage (optional)","example":"us-east-1a"},"isActive":{"type":"boolean","default":true}}},"examples":{"azOutage":{"summary":"Simulate a brief AZ outage on us-east-1a (steps 5–25)","value":{"name":"AZ Outage — us-east-1a","type":"az_outage","startTime":5,"endTime":25,"targetZone":"us-east-1a","isActive":true}},"dbOverload":{"summary":"Permanent database overload starting at step 10","value":{"name":"DB Overload Injection","type":"database_overload","startTime":10,"endTime":null,"targetResourceId":"r3","isActive":true}}}}}},"responses":{"201":{"description":"Failure injection created","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"name":{"type":"string"},"type":{"type":"string"},"isActive":{"type":"boolean"},"startTime":{"type":"integer"},"endTime":{"type":"integer","nullable":true}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"}}}},"/failures/{failureId}":{"x-stability":"stable","parameters":[{"name":"failureId","in":"path","required":true,"schema":{"type":"string"},"description":"Failure injection UUID"}],"patch":{"tags":["Simulations"],"summary":"Update a failure injection","description":"Partially updates a failure injection's fields (e.g. deactivate it by\nsetting `isActive: false`, or change the end time). Ownership of the\nparent simulation is enforced.\nRequires `write` scope.\n","operationId":"updateFailureInjection","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X PATCH https://your-production-domain.com/api/failures/fail-001 \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"isActive\": false}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.patch(\n    f\"{BASE_URL}/failures/fail-001\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\"isActive\": False},\n)\nresp.raise_for_status()\nprint(\"Failure injection updated:\", resp.json())\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/failures/fail-001`, {\n  method: \"PATCH\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({ isActive: false }),\n});\nconst failure = await resp.json();\nconsole.log(\"Failure injection updated:\", failure);\n"}],"security":[{"BearerAuth":[]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","properties":{"name":{"type":"string"},"isActive":{"type":"boolean","description":"Set to false to deactivate the failure early"},"endTime":{"type":"integer","nullable":true}}},"examples":{"deactivate":{"summary":"Deactivate a failure injection early","value":{"isActive":false}},"extendWindow":{"summary":"Extend the failure window by 10 more steps","value":{"endTime":40}}}}}},"responses":{"200":{"description":"Updated failure injection","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"name":{"type":"string"},"isActive":{"type":"boolean"}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"tags":["Simulations"],"summary":"Delete a failure injection","description":"Permanently removes a failure injection. The failure will no longer be\napplied on subsequent simulation steps. Ownership of the parent simulation\nis enforced.\nRequires `write` scope.\n","operationId":"deleteFailureInjection","security":[{"BearerAuth":[]}],"responses":{"204":{"description":"Failure injection deleted"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/metrics":{"x-stability":"stable","get":{"tags":["Simulations"],"summary":"Get simulation metrics history","description":"Returns the full time-series metrics history for the simulation. Each\nentry corresponds to one simulation step and includes CPU utilization,\nlatency percentiles, error rate, throughput, and cost per hour.\n\nRequires `read` scope and ownership.\n","operationId":"getSimulationMetrics","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/simulations/sim-abc123/metrics \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.get(\n    f\"{BASE_URL}/simulations/sim-abc123/metrics\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\nmetrics = resp.json()\nif metrics:\n    print(f\"Latest CPU: {metrics[-1]['cpuUsage']}%\")\n    print(f\"Latest p95 latency: {metrics[-1]['latencyP95']} ms\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/metrics`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst metrics = await resp.json();\nif (metrics.length > 0) {\n  const latest = metrics[metrics.length - 1];\n  console.log(`Latest CPU: ${latest.cpuUsage}%`);\n  console.log(`Latest p95 latency: ${latest.latencyP95} ms`);\n}\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Array of metrics snapshots ordered by time step","content":{"application/json":{"schema":{"type":"array","items":{"type":"object","properties":{"simulationId":{"type":"string"},"cpuUsage":{"type":"number","description":"CPU utilization (%)"},"latencyP50":{"type":"number","description":"Median latency (ms)"},"latencyP95":{"type":"number","description":"P95 latency (ms)"},"latencyP99":{"type":"number","description":"P99 latency (ms)"},"errorRate":{"type":"number","description":"Error rate (%)"},"throughput":{"type":"number","description":"Effective throughput (RPS)"},"costPerHour":{"type":"number","description":"Estimated cost per hour (USD)"},"cacheHitRate":{"type":"number","description":"Cache hit rate (%) — only present when cache resources exist"},"queueDepth":{"type":"number","description":"Queue depth (messages) — only present when queue resources exist"},"k8sNodeUtilization":{"type":"number","description":"Kubernetes node CPU utilization (%) — only present when Kubernetes resources exist"},"storageIopsUtilization":{"type":"number","description":"OCI Block Volume IOPS utilization (%) — only present when OCI Block Volume storage resources exist; 80 % triggers a warning status"},"connectionPressure":{"type":"number","description":"DB connection-pool pressure ratio (activeConnections / maxConnections), capped at 3.0. Only present when the simulation contains database resources. Values > 1.0 indicate pool exhaustion; values > 1.5 indicate severe saturation.\n"},"timestamp":{"type":"string","format":"date-time"}}}},"example":[{"simulationId":"sim-abc123","cpuUsage":62.3,"latencyP50":38,"latencyP95":112,"latencyP99":198,"errorRate":0.4,"throughput":4820,"costPerHour":2.38,"timestamp":"2024-01-15T10:00:00Z"},{"simulationId":"sim-abc123","cpuUsage":78.9,"latencyP50":55,"latencyP95":145,"latencyP99":260,"errorRate":1.2,"throughput":5100,"costPerHour":2.38,"timestamp":"2024-01-15T10:01:00Z"},{"simulationId":"sim-abc123","cpuUsage":45.1,"latencyP50":29,"latencyP95":88,"latencyP99":142,"errorRate":0.1,"throughput":4950,"costPerHour":2.85,"timestamp":"2024-01-15T10:02:00Z"}]}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/events":{"x-stability":"stable","get":{"tags":["Simulations"],"summary":"Get simulation event log","description":"Returns all events recorded for the simulation in chronological order.\nEvents include autoscaling decisions, failure injections, cost spikes,\nfailure recoveries, and manually injected entries.\n\nRequires `read` scope and ownership.\n","operationId":"getSimulationEvents","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/simulations/sim-abc123/events \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.get(\n    f\"{BASE_URL}/simulations/sim-abc123/events\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\nevents = resp.json()\nfor evt in events:\n    print(f\"[{evt['severity'].upper()}] {evt['message']}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/events`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst events = await resp.json();\nfor (const evt of events) {\n  console.log(`[${evt.severity.toUpperCase()}] ${evt.message}`);\n}\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Array of simulation events","content":{"application/json":{"schema":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"timestamp":{"type":"string","format":"date-time"},"severity":{"type":"string","enum":["info","warning","error","success"]},"message":{"type":"string"},"resource":{"type":"string","nullable":true,"description":"Name of the affected resource (if applicable)"}}}},"example":[{"id":"evt-001","simulationId":"sim-abc123","timestamp":"2024-01-15T10:01:15Z","severity":"warning","message":"CPU utilization crossed 75% threshold — autoscaler cooldown started","resource":"App Server"},{"id":"evt-002","simulationId":"sim-abc123","timestamp":"2024-01-15T10:02:00Z","severity":"success","message":"Autoscaler scaled out: 3 → 4 instances","resource":"App Server"},{"id":"evt-003","simulationId":"sim-abc123","timestamp":"2024-01-15T10:02:45Z","severity":"info","message":"Canary deployment v2.3.1 started","resource":null}]}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}},"post":{"tags":["Simulations"],"summary":"Manually add an event to a simulation","description":"Injects a custom event into the simulation's event log. Useful for\nannotating the timeline with external milestones (e.g. deployment\nmarkers, manual interventions).\n\nRequires `write` scope and ownership.\n","operationId":"createSimulationEvent","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/events \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"message\": \"Canary deployment v2.3.1 started\", \"severity\": \"info\", \"resource\": \"App Server\"}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/events\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\n        \"message\": \"Canary deployment v2.3.1 started\",\n        \"severity\": \"info\",\n        \"resource\": \"App Server\",\n    },\n)\nresp.raise_for_status()\nevent = resp.json()\nprint(\"Created event:\", event[\"id\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/events`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({\n    message: \"Canary deployment v2.3.1 started\",\n    severity: \"info\",\n    resource: \"App Server\",\n  }),\n});\nconst event = await resp.json();\nconsole.log(\"Created event:\", event.id);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["message","severity"],"properties":{"message":{"type":"string","description":"Human-readable event description","example":"Canary deployment v2.3.1 started"},"severity":{"type":"string","enum":["info","warning","error","success"],"default":"info","example":"info"},"resource":{"type":"string","nullable":true,"description":"Name of the resource this event is associated with"},"timestamp":{"type":"string","format":"date-time","description":"Event timestamp (defaults to current time if omitted)"}}},"examples":{"deploymentMarker":{"summary":"Annotate a canary deployment start","value":{"message":"Canary deployment v2.3.1 started","severity":"info","resource":"App Server"}},"manualWarning":{"summary":"Record a manual warning about an observed anomaly","value":{"message":"Manual: observed unusual CPU spike on primary node","severity":"warning","resource":"Primary DB"}}}}}},"responses":{"201":{"description":"Event added","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"simulationId":{"type":"string"},"timestamp":{"type":"string","format":"date-time"},"severity":{"type":"string"},"message":{"type":"string"}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"tags":["Simulations"],"summary":"Clear all events for a simulation","description":"Permanently removes all events from the simulation's event log. This is\na destructive operation and cannot be undone. Useful for resetting the\nlog before starting a new experiment.\n\nRequires `write` scope and ownership.\n","operationId":"clearSimulationEvents","security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"204":{"description":"Events cleared"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/explain":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"AI-generated explanation of simulation behaviour","description":"Uses GPT-5 to generate a natural-language explanation of what is currently\nhappening in the simulation: why latency is high, why autoscaling is or\nisn't triggering, what the dominant cost drivers are, etc.\n\nSet `beginnerMode: true` to receive a simplified, jargon-free explanation\nsuitable for cloud newcomers.\n\n**Authentication:** Optional. Unauthenticated callers are subject to\nstricter rate limits.\n","operationId":"explainSimulation","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/explain \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"beginnerMode\": false}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/explain\",\n    json={\"beginnerMode\": False},\n)\nresp.raise_for_status()\nprint(resp.json()[\"explanation\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/explain`, {\n  method: \"POST\",\n  headers: { \"Content-Type\": \"application/json\" },\n  body: JSON.stringify({ beginnerMode: false }),\n});\nconst data = await resp.json();\nconsole.log(data.explanation);\n"}],"security":[],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":false,"content":{"application/json":{"schema":{"type":"object","properties":{"beginnerMode":{"type":"boolean","default":false,"description":"Return a simplified, jargon-free explanation","example":false}}},"examples":{"expertMode":{"summary":"Request an expert-level explanation","value":{"beginnerMode":false}},"beginnerMode":{"summary":"Request a beginner-friendly, jargon-free explanation","value":{"beginnerMode":true}}}}}},"responses":{"200":{"description":"AI explanation","content":{"application/json":{"schema":{"type":"object","properties":{"explanation":{"type":"string","description":"Natural-language explanation of simulation behaviour","example":"Your simulation is experiencing high CPU utilization (82%) because the traffic load of 8 000 RPS exceeds the capacity of the current 3-node cluster. The autoscaler has not yet triggered because CPU has not been above the 75% threshold for the required 2-step cooldown window."}}}}}},"404":{"$ref":"#/components/responses/NotFound"},"429":{"description":"Rate limit exceeded","headers":{"Retry-After":{"description":"Seconds until the rate-limit window resets","schema":{"type":"integer"}}},"content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/optimize":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"AI-generated infrastructure optimisation suggestions","description":"Uses GPT-5 to analyse the simulation's current resource configuration and\nmetric history and return prioritised suggestions for reducing cost,\nimproving reliability, or increasing performance.\n\nSet `beginnerMode: true` for simplified language.\n\n**Authentication:** Optional. Unauthenticated callers are subject to\nstricter rate limits.\n","operationId":"optimizeSimulation","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/optimize \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"beginnerMode\": false}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/optimize\",\n    json={\"beginnerMode\": False},\n)\nresp.raise_for_status()\nprint(resp.json()[\"suggestions\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/optimize`, {\n  method: \"POST\",\n  headers: { \"Content-Type\": \"application/json\" },\n  body: JSON.stringify({ beginnerMode: false }),\n});\nconst data = await resp.json();\nconsole.log(data.suggestions);\n"}],"security":[],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":false,"content":{"application/json":{"schema":{"type":"object","properties":{"beginnerMode":{"type":"boolean","default":false,"description":"Return simplified suggestions"}}},"examples":{"expertOptimize":{"summary":"Request expert-level optimization suggestions","value":{"beginnerMode":false}},"beginnerOptimize":{"summary":"Request beginner-friendly optimization suggestions","value":{"beginnerMode":true}}}}}},"responses":{"200":{"description":"AI optimisation suggestions","content":{"application/json":{"schema":{"type":"object","properties":{"suggestions":{"type":"string","description":"Prioritised optimisation recommendations"}}}}}},"404":{"$ref":"#/components/responses/NotFound"},"429":{"description":"Rate limit exceeded","headers":{"Retry-After":{"description":"Seconds until the rate-limit window resets","schema":{"type":"integer"}}},"content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/troubleshoot":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"AI-guided troubleshooting for a specific issue","description":"Accepts a plain-text description of a problem (e.g. \"latency spikes every\n30 seconds\") and uses GPT-5 to analyse the simulation's current state and\nevent history to produce step-by-step troubleshooting guidance.\n\n**Authentication:** Optional. Unauthenticated callers are subject to\nstricter rate limits.\n","operationId":"troubleshootSimulation","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/troubleshoot \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"issue\": \"Latency spikes every 30 seconds and error rate climbs to 5% during spikes\"}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/troubleshoot\",\n    json={\"issue\": \"Latency spikes every 30 seconds and error rate climbs to 5% during spikes\"},\n)\nresp.raise_for_status()\nprint(resp.json()[\"guidance\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/troubleshoot`, {\n  method: \"POST\",\n  headers: { \"Content-Type\": \"application/json\" },\n  body: JSON.stringify({\n    issue: \"Latency spikes every 30 seconds and error rate climbs to 5% during spikes\",\n  }),\n});\nconst data = await resp.json();\nconsole.log(data.guidance);\n"}],"security":[],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["issue"],"properties":{"issue":{"type":"string","description":"Plain-text description of the problem to investigate","example":"Latency spikes every 30 seconds and error rate climbs to 5% during spikes"}}},"examples":{"latencySpike":{"summary":"Investigate periodic latency spikes","value":{"issue":"Latency spikes every 30 seconds and error rate climbs to 5% during spikes"}},"autoscalingStuck":{"summary":"Debug why autoscaling is not triggering","value":{"issue":"CPU is consistently above 80% but autoscaler has not added instances in the last 10 steps"}}}}}},"responses":{"200":{"description":"Troubleshooting guidance","content":{"application/json":{"schema":{"type":"object","properties":{"guidance":{"type":"string","description":"Step-by-step troubleshooting guidance"}}}}}},"400":{"description":"Missing issue description","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"429":{"description":"Rate limit exceeded","headers":{"Retry-After":{"description":"Seconds until the rate-limit window resets","schema":{"type":"integer"}}},"content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/bulk-resize":{"x-stability":"experimental","post":{"tags":["Simulations"],"summary":"Resize all compute resources to a new DigitalOcean Droplet size","description":"Resizes every compute resource in the simulation to the specified\nDigitalOcean Droplet size tier (e.g. `s-2vcpu-4gb`, `s-4vcpu-8gb`,\n`s-8vcpu-16gb`). Useful for right-sizing experiments where you want\nto evaluate the cost/performance trade-off of a uniform resize.\n\nThe size must be a valid DigitalOcean Droplet slug. Use\n`GET /api/description` to discover available sizes.\n\nRequires `write` scope and ownership.\n","operationId":"bulkResizeSimulation","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/bulk-resize \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"dropletSize\": \"s-4vcpu-8gb\"}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/bulk-resize\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\"dropletSize\": \"s-4vcpu-8gb\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(f\"Resized {data['resizedCount']} compute resource(s) to s-4vcpu-8gb\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/bulk-resize`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({ dropletSize: \"s-4vcpu-8gb\" }),\n});\nconst data = await resp.json();\nconsole.log(`Resized ${data.resizedCount} compute resource(s) to s-4vcpu-8gb`);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["dropletSize"],"properties":{"dropletSize":{"type":"string","description":"DigitalOcean Droplet size slug to apply to all compute resources","example":"s-4vcpu-8gb"}}},"examples":{"scaleUp":{"summary":"Scale all compute nodes up to 4-vCPU Droplets","value":{"dropletSize":"s-4vcpu-8gb"}},"scaleDown":{"summary":"Right-size to 2-vCPU Droplets after a load test","value":{"dropletSize":"s-2vcpu-4gb"}}}}}},"responses":{"200":{"description":"All compute resources resized","content":{"application/json":{"schema":{"type":"object","properties":{"simulation":{"$ref":"#/components/schemas/Simulation"},"resizedCount":{"type":"integer","description":"Number of compute resources that were resized","example":3}}}}}},"400":{"description":"Unknown Droplet size slug","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"}}}},"/simulations/{simulationId}/analyze-bottlenecks":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"AI-powered bottleneck analysis","description":"Uses GPT-5 to identify performance bottlenecks in the simulation based on\nthe current resource configuration, metric history, and event log.\nReturns a detailed analysis and, where applicable, a DigitalOcean-specific\nmigration recommendation.\n\nSet `beginnerMode: true` for simplified language.\n\n**Authentication:** Optional. Unauthenticated callers are subject to\nstricter rate limits.\n","operationId":"analyzeBottlenecks","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/analyze-bottlenecks \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"beginnerMode\": false}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/analyze-bottlenecks\",\n    json={\"beginnerMode\": False},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(data[\"analysis\"])\nif data.get(\"doRecommendation\"):\n    print(\"DO recommendation:\", data[\"doRecommendation\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/analyze-bottlenecks`, {\n  method: \"POST\",\n  headers: { \"Content-Type\": \"application/json\" },\n  body: JSON.stringify({ beginnerMode: false }),\n});\nconst data = await resp.json();\nconsole.log(data.analysis);\nif (data.doRecommendation) {\n  console.log(\"DO recommendation:\", data.doRecommendation);\n}\n"}],"security":[],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":false,"content":{"application/json":{"schema":{"type":"object","properties":{"beginnerMode":{"type":"boolean","default":false,"description":"Return simplified analysis"}}},"examples":{"expertAnalysis":{"summary":"Request expert-level bottleneck analysis","value":{"beginnerMode":false}},"beginnerAnalysis":{"summary":"Request beginner-friendly bottleneck analysis","value":{"beginnerMode":true}}}}}},"responses":{"200":{"description":"Bottleneck analysis","content":{"application/json":{"schema":{"type":"object","properties":{"analysis":{"type":"string","description":"Detailed bottleneck analysis"},"doRecommendation":{"type":"string","nullable":true,"description":"DigitalOcean-specific migration recommendation (if applicable)"}}}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/explain-autoscaling":{"x-stability":"stable","post":{"tags":["Simulations"],"summary":"AI explanation of autoscaling decisions","description":"Uses GPT-5 to explain why specific autoscaling actions were taken (or not\ntaken) during the simulation. Analyses the scaling history, recent metrics,\nand events to produce a narrative that helps operators understand the\nautoscaler's behaviour.\n\nSet `beginnerMode: true` for simplified language.\n\n**Authentication:** Optional. Unauthenticated callers are subject to\nstricter rate limits.\n","operationId":"explainAutoscaling","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/simulations/sim-abc123/explain-autoscaling \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"beginnerMode\": false}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\n\nresp = requests.post(\n    f\"{BASE_URL}/simulations/sim-abc123/explain-autoscaling\",\n    json={\"beginnerMode\": False},\n)\nresp.raise_for_status()\nprint(resp.json()[\"explanation\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/explain-autoscaling`, {\n  method: \"POST\",\n  headers: { \"Content-Type\": \"application/json\" },\n  body: JSON.stringify({ beginnerMode: false }),\n});\nconst data = await resp.json();\nconsole.log(data.explanation);\n"}],"security":[],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"requestBody":{"required":false,"content":{"application/json":{"schema":{"type":"object","properties":{"beginnerMode":{"type":"boolean","default":false,"description":"Return simplified explanation"}}},"examples":{"expertExplanation":{"summary":"Request expert-level autoscaling explanation","value":{"beginnerMode":false}},"beginnerExplanation":{"summary":"Request a beginner-friendly autoscaling explanation","value":{"beginnerMode":true}}}}}},"responses":{"200":{"description":"Autoscaling explanation","content":{"application/json":{"schema":{"type":"object","properties":{"explanation":{"type":"string","description":"Natural-language explanation of autoscaling decisions"}}}}}},"404":{"$ref":"#/components/responses/NotFound"},"429":{"description":"Rate limit exceeded","headers":{"Retry-After":{"description":"Seconds until the rate-limit window resets","schema":{"type":"integer"}}},"content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/right-sizing-hint":{"x-stability":"experimental","get":{"tags":["Simulations"],"summary":"Get right-sizing recommendations","description":"Analyses the simulation's recent metrics and resource configuration to\nidentify over-provisioned resources. Returns actionable hints with the\nrecommended smaller size slug, estimated hourly rate, estimated savings\npercentage, and a trade-off note explaining the operational impact.\n\nReturns `{ hasHint: false }` when the simulation is appropriately sized\nand no downsizing is recommended.\n\nRequires `read` scope and ownership.\n","operationId":"getRightSizingHint","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/simulations/sim-abc123/right-sizing-hint \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.get(\n    f\"{BASE_URL}/simulations/sim-abc123/right-sizing-hint\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nif data.get(\"hasHint\"):\n    for hint in data[\"hints\"]:\n        print(f\"Resize {hint['affectedResourceName']} → {hint['recommendedSlug']} (save ~{hint['estimatedSavingsPct']}%)\")\nelse:\n    print(\"Simulation is appropriately sized — no downsizing recommended.\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/simulations/sim-abc123/right-sizing-hint`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst data = await resp.json();\nif (data.hasHint) {\n  for (const hint of data.hints) {\n    console.log(`Resize ${hint.affectedResourceName} → ${hint.recommendedSlug} (save ~${hint.estimatedSavingsPct}%)`);\n  }\n} else {\n  console.log(\"Simulation is appropriately sized — no downsizing recommended.\");\n}\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Right-sizing hints (or empty result if no hints)","content":{"application/json":{"schema":{"oneOf":[{"type":"object","properties":{"hasHint":{"type":"boolean","example":false}}},{"type":"object","properties":{"hasHint":{"type":"boolean","example":true},"hints":{"type":"array","items":{"type":"object","properties":{"resourceType":{"type":"string","enum":["compute","database","network","storage"]},"reason":{"type":"string","enum":["low_cpu","scale_in","both","low_connection_util","low_throughput","low_throughput_util","high_throughput_util"]},"affectedResourceId":{"type":"string","nullable":true},"affectedResourceName":{"type":"string","nullable":true},"recommendedSlug":{"type":"string","description":"Recommended smaller size slug","example":"s-2vcpu-4gb"},"hourlyRate":{"type":"number","description":"Estimated hourly cost of the recommended size (USD)","example":0.036},"estimatedSavingsPct":{"type":"integer","description":"Estimated percentage cost saving","example":35},"tradeOffNote":{"type":"string","description":"Operational trade-off to consider before resizing"},"currentUtilization":{"type":"number","nullable":true,"description":"Current utilization percentage driving the recommendation"}}}}}}]},"examples":{"hasHint":{"summary":"Simulation is over-provisioned — downsize recommended","value":{"hasHint":true,"hints":[{"resourceType":"compute","reason":"low_cpu","affectedResourceId":"r2","affectedResourceName":"App Server","recommendedSlug":"s-2vcpu-4gb","hourlyRate":0.036,"estimatedSavingsPct":35,"tradeOffNote":"s-2vcpu-4gb has half the CPU capacity — monitor p95 latency closely after resize","currentUtilization":22.4}]}},"noHint":{"summary":"Simulation is appropriately sized — no downsize recommended","value":{"hasHint":false}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/validate-cost-accuracy":{"x-stability":"experimental","get":{"tags":["Simulations"],"summary":"Validate simulation cost accuracy against provider benchmarks","description":"Compares the simulation's cost estimates against known real-world provider\npricing benchmarks. Returns a validation result indicating whether the\nsimulated cost per hour is within the acceptable tolerance (±10%).\n\nUseful for confirming the simulation is faithfully modelling provider\npricing before using it for cost-optimisation decisions.\n\nRequires `read` scope and ownership.\n","operationId":"validateCostAccuracy","security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Cost accuracy validation result","content":{"application/json":{"schema":{"type":"object","properties":{"valid":{"type":"boolean","description":"Whether simulated cost is within tolerance","example":true},"simulatedCostPerHour":{"type":"number","description":"Simulated cost per hour (USD)","example":2.38},"benchmarkCostPerHour":{"type":"number","description":"Expected cost per hour from provider benchmark (USD)","example":2.4},"deviationPct":{"type":"number","description":"Percentage deviation from benchmark","example":0.83},"tolerance":{"type":"string","description":"Acceptable tolerance threshold","example":"±10%"}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/validate-performance-accuracy":{"x-stability":"experimental","get":{"tags":["Simulations"],"summary":"Validate simulation performance accuracy against provider benchmarks","description":"Compares the simulation's throughput and latency estimates against known\nreal-world provider benchmarks. Returns a validation result indicating\nwhether the simulated performance metrics are within the acceptable\ntolerance (±15%).\n\nRequires `read` scope and ownership.\n","operationId":"validatePerformanceAccuracy","security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Performance accuracy validation result","content":{"application/json":{"schema":{"type":"object","properties":{"valid":{"type":"boolean","description":"Whether simulated performance is within tolerance","example":true},"simulatedThroughput":{"type":"number","description":"Simulated throughput (RPS)"},"benchmarkThroughput":{"type":"number","description":"Expected throughput from provider benchmark (RPS)"},"deviationPct":{"type":"number","description":"Percentage deviation from benchmark"},"tolerance":{"type":"string","example":"±15%"}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/simulations/{simulationId}/validate-accuracy":{"x-stability":"experimental","get":{"tags":["Simulations"],"summary":"Validate both cost and performance accuracy","description":"Convenience endpoint that runs both cost and performance accuracy\nvalidations in a single call. Returns a combined result with an\n`overallValid` flag that is `true` only if both checks pass.\n\nRequires `read` scope and ownership.\n","operationId":"validateAccuracy","security":[{"BearerAuth":[]}],"parameters":[{"name":"simulationId","in":"path","required":true,"schema":{"type":"string"},"description":"Simulation UUID"}],"responses":{"200":{"description":"Combined accuracy validation result","content":{"application/json":{"schema":{"type":"object","properties":{"cost":{"type":"object","description":"Cost accuracy validation result","properties":{"valid":{"type":"boolean"},"deviationPct":{"type":"number"}}},"performance":{"type":"object","description":"Performance accuracy validation result","properties":{"valid":{"type":"boolean"},"deviationPct":{"type":"number"}}},"overallValid":{"type":"boolean","description":"True only when both cost and performance validations pass","example":true},"thresholds":{"type":"object","properties":{"cost":{"type":"string","example":"±10%"},"performance":{"type":"string","example":"±15%"}}}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Access denied","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/scenarios":{"x-stability":"stable","get":{"tags":["Scenarios"],"summary":"List all scenario templates","description":"Returns the list of pre-built infrastructure scenario templates. Scenarios\ndefine a complete simulation configuration (resources, traffic, connections)\nthat can be loaded directly into a new simulation via the browser UI or the\nAPI.\n\n**No authentication required.**\n","operationId":"listScenarios","responses":{"200":{"description":"Array of scenario templates","content":{"application/json":{"schema":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string","description":"Unique scenario identifier","example":"aws-web-app"},"name":{"type":"string","example":"AWS Multi-Tier Web Application"},"description":{"type":"string"},"category":{"type":"string","description":"Scenario category (e.g. web, data, ml)","example":"web"},"provider":{"type":"string","enum":["aws","gcp","azure","oci","digitalocean","multi"],"example":"aws"},"resources":{"type":"array","items":{"$ref":"#/components/schemas/Resource"}}}}}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/scenarios/{scenarioId}":{"x-stability":"stable","get":{"tags":["Scenarios"],"summary":"Get a scenario template by ID","description":"Returns the full details of a single scenario template including all\nresource definitions, connections, and default traffic settings.\n\n**No authentication required.**\n","operationId":"getScenario","parameters":[{"name":"scenarioId","in":"path","required":true,"schema":{"type":"string"},"description":"Scenario identifier","example":"aws-web-app"}],"responses":{"200":{"description":"Scenario template","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"name":{"type":"string"},"description":{"type":"string"},"category":{"type":"string"},"provider":{"type":"string"},"resources":{"type":"array","items":{"$ref":"#/components/schemas/Resource"}},"connections":{"type":"array","items":{"$ref":"#/components/schemas/Connection"}}}}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/fidelity":{"x-stability":"experimental","get":{"tags":["Simulation Fidelity"],"summary":"Get simulation fidelity benchmarks and accuracy thresholds","description":"Returns per-provider benchmark data used to calibrate the simulation engine,\nalong with the stated accuracy thresholds for cost and performance estimates.\n\nResponse fields:\n- **lastUpdated** — date of the most recent benchmark refresh\n- **accuracyThresholds** — `costPct` (±%) and `performancePct` (±%) accuracy guarantees\n- **providers** — array of provider objects, each containing:\n  - `id`, `name`, `region` — provider identity and reference region\n  - `pricingBenchmarks` — official hourly rates by service family / SKU with source attribution\n  - `performanceBenchmarks` — throughput, latency, IOPS, and connection metrics with source attribution\n\n**No authentication required.**\n","operationId":"getFidelity","responses":{"200":{"description":"Simulation fidelity document","content":{"application/json":{"schema":{"type":"object","properties":{"lastUpdated":{"type":"string","description":"Date of the most recent benchmark refresh","example":"June 2026"},"accuracyThresholds":{"type":"object","properties":{"costPct":{"type":"integer","description":"Cost accuracy tolerance as a percentage (±)","example":10},"performancePct":{"type":"integer","description":"Performance accuracy tolerance as a percentage (±)","example":15}}},"providers":{"type":"array","description":"Per-provider benchmark data","items":{"type":"object","properties":{"id":{"type":"string","description":"Provider identifier","example":"aws"},"name":{"type":"string","description":"Provider display name","example":"AWS"},"region":{"type":"string","description":"Reference region for listed prices","example":"us-east-1"},"pricingBenchmarks":{"type":"array","description":"Official hourly rates by service family and SKU","items":{"type":"object","properties":{"serviceFamily":{"type":"string","example":"ec2"},"size":{"type":"string","example":"t3.medium"},"hourlyRate":{"type":"number","format":"float","example":0.0416},"source":{"type":"string","example":"AWS Pricing Calculator"},"region":{"type":"string","example":"us-east-1"}}}},"performanceBenchmarks":{"type":"array","description":"Throughput and latency specs by service family and SKU","items":{"type":"object","properties":{"serviceFamily":{"type":"string","example":"ec2"},"size":{"type":"string","example":"t3.medium"},"metric":{"type":"string","example":"network_throughput"},"value":{"type":"number","example":5000},"unit":{"type":"string","example":"Mbps"},"source":{"type":"string","example":"AWS EC2 Instance Types"}}}}}}}}}}}}}}},"/accuracy-benchmark":{"x-stability":"experimental","get":{"tags":["Simulation Fidelity"],"summary":"Simulation accuracy benchmark — simulated vs. AWS reference metrics","description":"Runs the Cloud World Model simulation engine against a canonical three-tier\nAWS architecture (ALB → 2× m5.large EC2 → db.r5.large RDS MySQL, us-east-1)\nand returns simulated vs. reference metrics for four traffic scenarios:\nIdle (10 req/s), Normal (100 req/s), Peak (500 req/s), and Burst (1,000 req/s).\n\nFor each scenario, the response includes:\n- Per-metric comparison: reference value, simulated value, delta %, accuracy %\n- Composite accuracy score (weighted average across six metrics)\n- Overall accuracy score averaged across all four scenarios\n\nThe simulation is fully deterministic: seed `20240601` is always used, so\nthe results are reproducible across calls and engine versions.\n\n**No authentication required.** Rate-limited with the standard public rate limit.\n","operationId":"getAccuracyBenchmark","responses":{"200":{"description":"Accuracy benchmark results","content":{"application/json":{"schema":{"type":"object","properties":{"architecture":{"type":"object","description":"Canonical architecture spec used for the benchmark","properties":{"name":{"type":"string"},"description":{"type":"string"},"provider":{"type":"string"},"region":{"type":"string"},"components":{"type":"array","items":{"type":"object","properties":{"role":{"type":"string"},"instanceType":{"type":"string"},"count":{"type":"integer"},"officialHourlyCost":{"type":"number"},"notes":{"type":"string"}}}}}},"seed":{"type":"integer","description":"Fixed seed used for the deterministic simulation run","example":20240601},"generatedAt":{"type":"string","format":"date-time","description":"ISO 8601 timestamp when the benchmark was computed"},"overallScore":{"type":"number","description":"Overall accuracy score (0–100), unweighted average across four scenarios","example":88.5},"scoreDriverNote":{"type":"string","description":"One-sentence explanation of the main factor driving this provider's overall accuracy score (e.g. connection-pool saturation, calibration reference, in-memory acceleration)","example":"AWS is the calibration reference — the simulator's formulas are tuned to m5.large EC2 + db.r5.large RDS specs, so its score reflects how closely the engine reproduces its own training baseline."},"scenarios":{"type":"array","description":"Per-scenario comparison results","items":{"type":"object","properties":{"scenario":{"type":"object","properties":{"id":{"type":"string","enum":["idle","normal","peak","burst"]},"label":{"type":"string"},"description":{"type":"string"},"trafficRps":{"type":"number"}}},"compositeScore":{"type":"number","description":"Weighted composite accuracy score for this scenario (0–100)"},"metrics":{"type":"array","items":{"type":"object","properties":{"metric":{"type":"string"},"label":{"type":"string"},"unit":{"type":"string"},"reference":{"type":"number","description":"Curated reference value from AWS docs / load-test studies"},"simulated":{"type":"number","description":"Value produced by the simulation engine"},"deltaPct":{"type":"number","description":"Percentage deviation of simulated from reference"},"accuracyPct":{"type":"number","description":"Accuracy percentage (max(0, 100 - |deltaPct|))"}}}}}}}}}}}},"500":{"description":"Benchmark execution error","content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string"}}}}}}}},"post":{"tags":["Simulation Fidelity"],"summary":"Custom architecture accuracy benchmark","description":"Runs the accuracy benchmark with a user-supplied architecture config instead of\nthe canonical 2× m5.large + db.r5.large setup. The same four reference scenarios\n(Idle, Normal, Peak, Burst) and the same fixed seed (`20240601`) are used; only\nthe simulated-side resource list changes.\n\nSupply a `provider` (`aws`, `gcp`, `azure`, `oci`, or `digitalocean`) along with a\n`computeCount` (1–10), a provider-specific `computeType`, and a provider-specific\n`dbType` to explore how simulation accuracy shifts as the architecture scales or\nchanges provider.\n\nValid values by provider:\n\n| provider | computeType | dbType |\n|---|---|---|\n| `aws` | `t2.micro`, `t3.micro`, `t3.small`, `m5.large`, `m5.xlarge`, `m5.2xlarge`, `m6i.large`, `m6i.xlarge`, `c5.large`, `c5.xlarge`, `c6i.large`, `c6i.xlarge`, `r5.large`, `r5.xlarge`, `r6i.large`, `r6i.xlarge` | `db.t3.small`, `db.r5.large` |\n| `gcp` | `e2-medium`, `n2-standard-4` | `db-f1-micro`, `db-n1-standard-2` |\n| `azure` | `Standard_B2s`, `Standard_D4s_v3` | `sql-basic`, `sql-s2` |\n| `oci` | `VM.Standard.E4.Flex`, `VM.Standard3.Flex` | `mysql-heatwave`, `autonomous-db-std` |\n| `digitalocean` | `s-2vcpu-4gb`, `s-2vcpu-4gb-amd`, `c-4` | `db-s-1vcpu-1gb`, `db-s-2vcpu-4gb` |\n\n**No authentication required.** Rate-limited with the standard public rate limit.\n","operationId":"runCustomAccuracyBenchmark","requestBody":{"required":true,"content":{"application/json":{"schema":{"oneOf":[{"title":"AWS architecture","type":"object","required":["provider","computeCount","computeType","dbType"],"properties":{"provider":{"type":"string","enum":["aws"],"description":"Cloud provider"},"computeCount":{"type":"integer","minimum":1,"maximum":10,"description":"Number of EC2 instances to include in the simulated architecture","example":2},"computeType":{"type":"string","enum":["t2.micro","t3.micro","t3.small","m5.large","m5.xlarge","m5.2xlarge","m6i.large","m6i.xlarge","c5.large","c5.xlarge","c6i.large","c6i.xlarge","r5.large","r5.xlarge","r6i.large","r6i.xlarge"],"description":"EC2 instance type for all web/app servers","example":"m5.large"},"dbType":{"type":"string","enum":["db.t3.small","db.r5.large"],"description":"RDS MySQL instance type","example":"db.r5.large"}}},{"title":"GCP architecture","type":"object","required":["provider","computeCount","computeType","dbType"],"properties":{"provider":{"type":"string","enum":["gcp"],"description":"Cloud provider"},"computeCount":{"type":"integer","minimum":1,"maximum":10,"description":"Number of Compute Engine instances to include in the simulated architecture","example":2},"computeType":{"type":"string","enum":["e2-medium","n2-standard-4"],"description":"Compute Engine machine type for all web/app servers","example":"n2-standard-4"},"dbType":{"type":"string","enum":["db-f1-micro","db-n1-standard-2"],"description":"Cloud SQL for MySQL instance type","example":"db-n1-standard-2"}}},{"title":"Azure architecture","type":"object","required":["provider","computeCount","computeType","dbType"],"properties":{"provider":{"type":"string","enum":["azure"],"description":"Cloud provider"},"computeCount":{"type":"integer","minimum":1,"maximum":10,"description":"Number of Virtual Machine instances to include in the simulated architecture","example":2},"computeType":{"type":"string","enum":["Standard_B2s","Standard_D4s_v3"],"description":"Azure VM size for all web/app servers","example":"Standard_D4s_v3"},"dbType":{"type":"string","enum":["sql-basic","sql-s2"],"description":"Azure SQL Database tier","example":"sql-s2"}}},{"title":"OCI architecture","type":"object","required":["provider","computeCount","computeType","dbType"],"properties":{"provider":{"type":"string","enum":["oci"],"description":"Cloud provider"},"computeCount":{"type":"integer","minimum":1,"maximum":10,"description":"Number of VM instances to include in the simulated architecture","example":2},"computeType":{"type":"string","enum":["VM.Standard.E4.Flex","VM.Standard3.Flex"],"description":"OCI VM shape for all web/app servers","example":"VM.Standard3.Flex"},"dbType":{"type":"string","enum":["mysql-heatwave","autonomous-db-std"],"description":"OCI database type","example":"mysql-heatwave"}}},{"title":"DigitalOcean architecture","type":"object","required":["provider","computeCount","computeType","dbType"],"properties":{"provider":{"type":"string","enum":["digitalocean"],"description":"Cloud provider"},"computeCount":{"type":"integer","minimum":1,"maximum":10,"description":"Number of Droplets to include in the simulated architecture","example":2},"computeType":{"type":"string","enum":["s-2vcpu-4gb","s-2vcpu-4gb-amd","c-4"],"description":"Droplet size slug for all web/app servers","example":"s-2vcpu-4gb"},"dbType":{"type":"string","enum":["db-s-1vcpu-1gb","db-s-2vcpu-4gb"],"description":"Managed Database node size","example":"db-s-2vcpu-4gb"}}}],"discriminator":{"propertyName":"provider"}}}}},"responses":{"200":{"description":"Custom accuracy benchmark results (same shape as GET /accuracy-benchmark plus isCustomConfig: true)","content":{"application/json":{"schema":{"type":"object","properties":{"architecture":{"type":"object","description":"Custom architecture spec used for this benchmark run","properties":{"name":{"type":"string"},"description":{"type":"string"},"provider":{"type":"string"},"region":{"type":"string"},"components":{"type":"array","items":{"type":"object","properties":{"role":{"type":"string"},"instanceType":{"type":"string"},"count":{"type":"integer"},"officialHourlyCost":{"type":"number"},"notes":{"type":"string"}}}}}},"seed":{"type":"integer","example":20240601},"generatedAt":{"type":"string","format":"date-time"},"overallScore":{"type":"number","description":"Overall accuracy score (0–100)"},"scoreDriverNote":{"type":"string","description":"One-sentence explanation of the main factor driving this provider's overall accuracy score","example":"db-s-1vcpu-1gb's 75-connection limit is fully saturated at Burst (1000 RPS), causing ~15% error rates."},"isCustomConfig":{"type":"boolean","description":"Always `true` for POST responses","example":true},"scenarios":{"type":"array","items":{"type":"object","properties":{"scenario":{"type":"object","properties":{"id":{"type":"string"},"label":{"type":"string"},"trafficRps":{"type":"number"}}},"compositeScore":{"type":"number"},"metrics":{"type":"array","items":{"type":"object","properties":{"metric":{"type":"string"},"label":{"type":"string"},"unit":{"type":"string"},"reference":{"type":"number"},"simulated":{"type":"number"},"deltaPct":{"type":"number"},"accuracyPct":{"type":"number"}}}}}}}}}}}},"400":{"description":"Invalid config — ec2Count out of range or unknown instance type","content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string"}}}}}},"500":{"description":"Benchmark execution error","content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string"}}}}}}}}},"/benchmark":{"x-stability":"stable","get":{"tags":["Simulation Fidelity"],"summary":"Multi-cloud pricing and performance benchmark report","description":"Serves the static HTML benchmark report comparing AWS, GCP, Azure, OCI,\nand DigitalOcean across compute, database, storage, and networking tiers.\n\nThe report is a shareable, self-contained HTML file — no authentication\nrequired.  Use this URL in blog posts, HN submissions, or customer-facing\ncomparisons.\n\n**No authentication required.**\n","operationId":"getBenchmarkReport","responses":{"200":{"description":"HTML benchmark report","content":{"text/html":{"schema":{"type":"string","description":"Self-contained HTML document with benchmark data"}}}}}}},"/admin/pricing-check-status":{"x-stability":"experimental","get":{"tags":["Pricing History"],"summary":"Get the latest automated pricing-check status","description":"Returns the result of the most recent weekly pricing-drift check run by\nthe `check-regions` workflow (`scripts/check-regions-scheduled.ts`). Each\nentry maps a check identifier (e.g. `aws-ec2-m5large`, `oci-e4flex-vm`)\nto its outcome:\n- **pass** — the committed constant matches the live provider price within tolerance\n- **drift** — the live provider price drifted beyond tolerance; the constant needs updating\n- **error** — the provider pricing source was unreachable; no action required\n\nIf no check has run yet (or the status file is missing), `checkedAt` is\n`null` and `results` is an empty object.\n\n**No authentication required.**\n","operationId":"getPricingCheckStatus","responses":{"200":{"description":"Latest pricing-check status","content":{"application/json":{"schema":{"type":"object","required":["checkedAt","results"],"properties":{"checkedAt":{"type":"string","format":"date-time","nullable":true,"description":"ISO-8601 timestamp of the last completed check run, or null if none has run yet","example":"2026-06-12T22:48:05.359Z"},"results":{"type":"object","description":"Map of check identifier to outcome","additionalProperties":{"type":"string","enum":["pass","drift","error"]},"example":{"aws-ec2-m5large":"pass","oci-e4flex-vm":"pass","do-managed-mongodb-1gb":"error"}}}}}}}}}},"/pricing-history":{"x-stability":"experimental","get":{"tags":["Pricing History"],"summary":"Get cloud provider pricing history","description":"Returns a structured pricing history document containing:\n- **snapshots** — a list of dated pricing snapshots with entry counts\n- **changes** — the most recent price changes detected across all providers\n- **providerTrends** — per-provider price trend direction (up / down / stable)\n- **resourceIds** — the full list of trackable resource identifiers\n\nUse `GET /pricing-history/trend/{resourceId}` to retrieve the full\nprice trend for a specific resource type.\n\n**No authentication required.**\n","operationId":"getPricingHistory","responses":{"200":{"description":"Pricing history document","content":{"application/json":{"schema":{"type":"object","properties":{"snapshots":{"type":"array","description":"List of available pricing snapshots","items":{"type":"object","properties":{"date":{"type":"string","format":"date","description":"Snapshot date","example":"2026-01-01"},"label":{"type":"string","description":"Human-readable snapshot label","example":"Jan 2026"},"entryCount":{"type":"integer","description":"Number of price entries in this snapshot","example":42}}}},"changes":{"type":"array","description":"Most recent price changes","items":{"type":"object","properties":{"resourceId":{"type":"string","example":"aws-ec2-m5xlarge"},"provider":{"type":"string","example":"aws"},"previousPrice":{"type":"number","example":0.192},"currentPrice":{"type":"number","example":0.19},"changePct":{"type":"number","example":-1.04},"date":{"type":"string","format":"date"}}}},"providerTrends":{"type":"object","description":"Per-provider price trend direction","additionalProperties":{"type":"string","enum":["up","down","stable"]},"example":{"aws":"stable","gcp":"down","azure":"stable","oci":"down","digitalocean":"stable"}},"resourceIds":{"type":"array","description":"All trackable resource identifiers","items":{"type":"string"},"example":["aws-ec2-m5xlarge","gcp-n2-standard-4","azure-d4s-v3"]}}}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/pricing-history/trend/{resourceId}":{"x-stability":"experimental","get":{"tags":["Pricing History"],"summary":"Get price trend for a specific resource","description":"Returns the full price trend time-series for a single resource type,\nshowing how its hourly rate has changed across all available pricing\nsnapshots.\n\nUse `GET /pricing-history` to retrieve the full list of valid\n`resourceId` values.\n\n**No authentication required.**\n","operationId":"getPricingTrend","parameters":[{"name":"resourceId","in":"path","required":true,"schema":{"type":"string"},"description":"Resource identifier (from the resourceIds list in GET /pricing-history)","example":"aws-ec2-m5xlarge"}],"responses":{"200":{"description":"Price trend for the requested resource","content":{"application/json":{"schema":{"type":"object","properties":{"resourceId":{"type":"string","example":"aws-ec2-m5xlarge"},"trend":{"type":"array","description":"Ordered list of price data points","items":{"type":"object","properties":{"date":{"type":"string","format":"date","example":"2026-01-01"},"label":{"type":"string","example":"Jan 2026"},"price":{"type":"number","description":"Hourly rate (USD) at this snapshot date","example":0.192}}}}}}}}},"404":{"description":"Resource ID not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/keys":{"x-stability":"stable","post":{"tags":["API Keys"],"summary":"Create a new API key","description":"Creates a new API key for authenticating to the RL Training API.\nThe full key is returned only once - store it securely.\n","operationId":"createApiKey","requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["name"],"properties":{"name":{"type":"string","description":"Human-readable name for the API key","example":"Production Agent Key"},"scopes":{"type":"array","items":{"type":"string","enum":["read","write","admin"]},"default":["read","write"],"description":"Permission scopes for the key"},"rateLimit":{"type":"integer","default":1000,"description":"Requests per hour limit","example":1000},"expiresAt":{"type":"string","format":"date-time","description":"Optional expiration date","example":"2025-12-31T23:59:59Z"},"userId":{"type":"string","description":"Optional user ID to associate with the key"}}}}}},"responses":{"201":{"description":"API key created successfully","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"key":{"type":"string","description":"The full API key (shown only once)","example":"cwm_live_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6"},"keyPrefix":{"type":"string","description":"Key prefix for identification","example":"cwm_live_a1b2c3d4..."},"name":{"type":"string"},"scopes":{"type":"array","items":{"type":"string"}},"rateLimit":{"type":"integer"},"createdAt":{"type":"string","format":"date-time"},"expiresAt":{"type":"string","format":"date-time"},"message":{"type":"string","example":"Store this API key securely. You won't be able to see it again."}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"500":{"$ref":"#/components/responses/InternalError"}}},"get":{"tags":["API Keys"],"summary":"List API keys","description":"Lists all API keys. Only shows key prefixes for security.\n","operationId":"listApiKeys","parameters":[{"name":"userId","in":"query","schema":{"type":"string"},"description":"Optional filter by user ID"}],"responses":{"200":{"description":"List of API keys","content":{"application/json":{"schema":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"keyPrefix":{"type":"string","example":"cwm_live_a1b2c3d4..."},"name":{"type":"string"},"scopes":{"type":"array","items":{"type":"string"}},"rateLimit":{"type":"integer"},"isActive":{"type":"boolean"},"createdAt":{"type":"string","format":"date-time"},"lastUsedAt":{"type":"string","format":"date-time"},"expiresAt":{"type":"string","format":"date-time"}}}}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/keys/{keyId}":{"x-stability":"stable","delete":{"tags":["API Keys"],"summary":"Revoke an API key","description":"Revokes an API key, preventing further use","operationId":"revokeApiKey","parameters":[{"name":"keyId","in":"path","required":true,"schema":{"type":"string","format":"uuid"}}],"responses":{"200":{"description":"API key revoked successfully","content":{"application/json":{"schema":{"type":"object","properties":{"message":{"type":"string","example":"API key revoked successfully"}}}}}},"404":{"$ref":"#/components/responses/NotFound"},"500":{"$ref":"#/components/responses/InternalError"}}}},"/rl/environments":{"x-stability":"stable","post":{"tags":["RL Environments"],"summary":"Create a new RL training environment","description":"Creates a reinforcement learning environment for training agents.\nLinks to an existing simulation and configures episode parameters.\n\nThis endpoint works with simulations built on **any supported provider**,\nincluding AWS, GCP, Azure, OCI, and **DigitalOcean**. Training an agent against\na DigitalOcean simulation lets you learn optimal autoscaling strategies for\nDroplet-based workloads, Managed Database failover handling, and\nmulti-datacenter traffic routing — all without incurring real cloud costs.\nThe observation space, action space, and reward function are identical\nregardless of provider.\n\n**Idle TTL:** Authenticated RL environments expire after **2 hours of inactivity**\n(no `step` or `reset` call received). Expired environments and their linked\nsimulations are removed automatically; subsequent requests return `404`. Reset the\nidle timer by calling `POST /rl/environments/{environmentId}/step` or\n`POST /rl/environments/{environmentId}/reset` at least once every 2 hours during\nlong training runs.\n\n**Note:** In-memory deployments (the default) do not persist RL environments\nacross server restarts. Reconnect or recreate the environment after a restart.\n\n**UI Status Viewer:** Once an environment is running you can monitor its episodes,\ncumulative rewards, and health without writing code — open the\n[RL Environment Status viewer](/admin/rl-environments) in the platform UI and enter\nyour API key to see all active environments at a glance.\n","operationId":"createRLEnvironment","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/rl/environments \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"simulationId\": \"a638caad-7423-40a3-bb09-f91235d9392d\",\n    \"episodeConfig\": {\n      \"maxSteps\": 200,\n      \"targetTrafficPattern\": \"wave\",\n      \"initialTraffic\": 1500,\n      \"targetSLA\": { \"maxLatencyP95\": 180, \"maxErrorRate\": 1.0 },\n      \"costBudgetPerHour\": 3.50,\n      \"enableFailures\": false\n    }\n  }'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/rl/environments\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\n        \"simulationId\": \"a638caad-7423-40a3-bb09-f91235d9392d\",\n        \"episodeConfig\": {\n            \"maxSteps\": 200,\n            \"targetTrafficPattern\": \"wave\",\n            \"initialTraffic\": 1500,\n            \"targetSLA\": {\"maxLatencyP95\": 180, \"maxErrorRate\": 1.0},\n            \"costBudgetPerHour\": 3.50,\n            \"enableFailures\": False,\n        },\n    },\n)\nresp.raise_for_status()\ndata = resp.json()\nenv_id = data[\"environment\"][\"id\"]\nprint(f\"Created RL environment: {env_id}\")\nprint(f\"Initial CPU: {data['observation']['metrics']['cpuUsage']}%\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/rl/environments`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({\n    simulationId: \"a638caad-7423-40a3-bb09-f91235d9392d\",\n    episodeConfig: {\n      maxSteps: 200,\n      targetTrafficPattern: \"wave\",\n      initialTraffic: 1500,\n      targetSLA: { maxLatencyP95: 180, maxErrorRate: 1.0 },\n      costBudgetPerHour: 3.50,\n      enableFailures: false,\n    },\n  }),\n});\nconst data = await resp.json();\nconst envId = data.environment.id;\nconsole.log(`Created RL environment: ${envId}`);\nconsole.log(`Initial CPU: ${data.observation.metrics.cpuUsage}%`);\n"}],"security":[{"BearerAuth":[]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["simulationId","episodeConfig"],"properties":{"simulationId":{"type":"string","format":"uuid","description":"ID of the simulation to train on","example":"a638caad-7423-40a3-bb09-f91235d9392d"},"episodeConfig":{"$ref":"#/components/schemas/EpisodeConfig"},"webhookUrl":{"type":"string","format":"uri","description":"Optional HTTPS URL to receive webhook notification when episode completes","example":"https://your-app.com/webhooks/rl-episode"},"webhookSecret":{"type":"string","description":"Optional secret for HMAC-SHA256 webhook signature verification","example":"your-secret-key-here"}}},"examples":{"digitalOceanWaveTraffic":{"summary":"DigitalOcean Droplet cluster — wave traffic pattern","description":"Train an agent on a DigitalOcean simulation using a wave traffic pattern.\nThe simulationId must reference an existing simulation that contains\nDigitalOcean resources (Droplets, Managed Database, Load Balancer).\nThe cost budget of $3.50/hr reflects typical pricing for two\ns-2vcpu-4gb Droplets plus a single Managed PostgreSQL basic node in nyc3.\n","value":{"simulationId":"b7e1c2d4-3f8a-4b5e-9c0d-1e2f3a4b5c6d","episodeConfig":{"maxSteps":200,"targetTrafficPattern":"wave","initialTraffic":1500,"targetSLA":{"maxLatencyP95":180,"maxErrorRate":1},"costBudgetPerHour":3.5,"enableFailures":false}}},"digitalOceanBurstTraffic":{"summary":"DigitalOcean Droplet cluster — burst traffic with failure injection","description":"More challenging episode: burst traffic spikes combined with random failure\ninjection (e.g., Managed Database failover). Useful for training a robust\nagent that can handle both scaling pressure and partial outages.\n","value":{"simulationId":"b7e1c2d4-3f8a-4b5e-9c0d-1e2f3a4b5c6d","episodeConfig":{"maxSteps":150,"targetTrafficPattern":"burst","initialTraffic":800,"targetSLA":{"maxLatencyP95":200,"maxErrorRate":2},"costBudgetPerHour":5,"enableFailures":true}}},"digitalOceanAMDNVMe":{"summary":"DigitalOcean AMD NVMe Droplet cluster — sustained ramp traffic","description":"Train an agent on a DigitalOcean simulation using AMD NVMe Droplets\n(s-2vcpu-4gb-amd). The AMD variant uses NVMe-backed local storage and\nAMD EPYC processors, offering lower per-hour cost than equivalent\nIntel Droplets while delivering comparable CPU performance for\ncompute-bound workloads.\n\nThe cost budget of $2.80/hr reflects typical pricing for two\ns-2vcpu-4gb-amd Droplets plus a single Managed PostgreSQL basic node\nin nyc3. Use this example to benchmark agent policies across Intel vs.\nAMD Droplet fleets under a sustained ramp traffic pattern.\n","value":{"simulationId":"c9f2d3e5-4a7b-4c6f-8d1e-2f3a4b5c6d7e","episodeConfig":{"maxSteps":200,"targetTrafficPattern":"ramp","initialTraffic":1000,"targetSLA":{"maxLatencyP95":175,"maxErrorRate":1},"costBudgetPerHour":2.8,"enableFailures":false,"computeType":"s-2vcpu-4gb-amd"}}},"awsSpot":{"summary":"AWS EC2 Spot — cost-optimized batch RL training using Spot Instances","description":"Train an agent on a cost-sensitive AWS workload backed by EC2 Spot Instances.\nThe simulationId must reference an existing simulation containing Spot-eligible\nresources (e.g. t3.xlarge or c5.2xlarge instances). The cost budget of $1.80/hr\nreflects typical Spot pricing for a small fleet of t3.large instances in us-east-1\n(roughly 70% below on-demand list price). Latency SLA is relaxed to 500 ms p95\nto accommodate the occasional Spot interruption and re-scheduling delay. Use this\nexample to benchmark agent policies that prioritise cost efficiency over strict\nlatency guarantees — a common requirement for ML training, data pipelines, and\nother fault-tolerant batch workloads.\n","value":{"simulationId":"d4e5f6a7-1b2c-4d3e-8f9a-0b1c2d3e4f5a","episodeConfig":{"maxSteps":200,"targetTrafficPattern":"ramp","initialTraffic":1000,"targetSLA":{"maxLatencyP95":500,"maxErrorRate":2},"costBudgetPerHour":1.8,"enableFailures":true},"webhookUrl":"https://your-app.example.com/webhooks/rl-episode","webhookSecret":"aws-spot-rl-secret"}},"gcpSpot":{"summary":"GCP Spot VM — cost-optimized batch RL training using preemptible compute","description":"Train an agent on a cost-sensitive GCP workload backed by Spot VMs (formerly\npreemptible). The simulationId must reference an existing simulation containing\nSpot-eligible resources (e.g. n2-standard-4 or c2-standard-8 Spot instances in\nus-central1). The cost budget of $1.60/hr reflects typical Spot pricing for a\nsmall fleet of n2-standard-4 instances (roughly 70% below on-demand list price).\nLatency SLA is relaxed to 500 ms p95 to accommodate the occasional preemption\nand re-scheduling delay. Use this example to benchmark agent policies that\nprioritise cost efficiency over strict latency guarantees — ideal for ML\ntraining, data processing, and other fault-tolerant batch workloads on GCP.\n","value":{"simulationId":"e5f6a7b8-2c3d-4e5f-9a0b-1c2d3e4f5a6b","episodeConfig":{"maxSteps":200,"targetTrafficPattern":"ramp","initialTraffic":1000,"targetSLA":{"maxLatencyP95":500,"maxErrorRate":2},"costBudgetPerHour":1.6,"enableFailures":true},"webhookUrl":"https://your-app.example.com/webhooks/rl-episode","webhookSecret":"gcp-spot-rl-secret"}},"azureSpot":{"summary":"Azure Spot VM — cost-optimized batch RL training using Azure Spot instances","description":"Train an agent on a cost-sensitive Azure workload backed by Azure Spot VMs.\nThe simulationId must reference an existing simulation containing Spot-eligible\nresources (e.g. Standard_D4s_v3 or Standard_F8s_v2 Spot instances in eastus).\nThe cost budget of $1.50/hr reflects typical Spot pricing for a small fleet of\nStandard_D4s_v3 instances (roughly 70% below pay-as-you-go list price). Latency\nSLA is relaxed to 400 ms p95 to accommodate Azure Spot eviction and re-deployment\ndelays. Use this example to benchmark agent policies that prioritise cost\nefficiency over strict latency guarantees — well-suited for batch inference,\ndata pipelines, and other interruption-tolerant workloads on Azure.\n","value":{"simulationId":"f6a7b8c9-3d4e-4f5a-0b1c-2d3e4f5a6b7c","episodeConfig":{"maxSteps":200,"targetTrafficPattern":"ramp","initialTraffic":1000,"targetSLA":{"maxLatencyP95":400,"maxErrorRate":2},"costBudgetPerHour":1.5,"enableFailures":true},"webhookUrl":"https://your-app.example.com/webhooks/rl-episode","webhookSecret":"azure-spot-rl-secret"}}}}}},"responses":{"201":{"description":"RL environment created successfully","content":{"application/json":{"schema":{"type":"object","properties":{"environment":{"$ref":"#/components/schemas/RLEnvironment"},"observation":{"$ref":"#/components/schemas/Observation"}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"404":{"description":"Simulation not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/rl/environments/{environmentId}/reset":{"x-stability":"stable","post":{"tags":["RL Environments"],"summary":"Reset an RL environment to start a new episode","description":"Resets the environment to its initial state, clearing all scaling history and events.\nUse this to start a new training episode after the previous one completes.\n","operationId":"resetRLEnvironment","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/rl/environments/env-aws-001/reset \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nENV_ID = \"env-aws-001\"\n\nresp = requests.post(\n    f\"{BASE_URL}/rl/environments/{ENV_ID}/reset\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(f\"Environment reset — step: {data['environment']['currentStep']}\")\nprint(f\"Initial CPU: {data['observation']['metrics']['cpuUsage']}%\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst ENV_ID = \"env-aws-001\";\n\nconst resp = await fetch(`${BASE_URL}/rl/environments/${ENV_ID}/reset`, {\n  method: \"POST\",\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst data = await resp.json();\nconsole.log(`Environment reset — step: ${data.environment.currentStep}`);\nconsole.log(`Initial CPU: ${data.observation.metrics.cpuUsage}%`);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"environmentId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the RL environment to reset"}],"responses":{"200":{"description":"Environment reset successfully","content":{"application/json":{"schema":{"type":"object","properties":{"environment":{"$ref":"#/components/schemas/RLEnvironment"},"observation":{"$ref":"#/components/schemas/Observation"},"info":{"type":"object","description":"Metadata about the initial environment state","properties":{"sim_time_human":{"type":"string","description":"Human-readable simulation time at episode start (always \"0s\")","example":"0s"}}}}},"examples":{"awsReset":{"summary":"AWS — environment reset (EC2 m5.large cluster, us-east-1)","description":"Episode reset on an AWS simulation. Resources return to their initial\nstate: 2 × m5.large EC2 instances behind an ALB with RDS Multi-AZ.\nScaling history and events are cleared. connectionPressure reflects\nthe RDS connection-pool ratio.\n","value":{"environment":{"id":"env-aws-001","simulationId":"sim-aws-001","isActive":true,"currentStep":0,"maxSteps":100},"observation":{"metrics":{"cpuUsage":41,"latencyP50":38,"latencyP95":82,"errorRate":0.2,"throughput":4800,"costPerHour":0.38,"connectionPressure":0.3},"resources":[{"id":"res-alb-001","name":"ALB","type":"network","provider":"aws","instances":1},{"id":"res-ec2-001","name":"EC2 m5.large","type":"compute","provider":"aws","instances":2},{"id":"res-rds-001","name":"RDS db.r5.large","type":"database","provider":"aws","instances":1}],"traffic":4800,"currentTime":0,"autoscalingConfig":{"scaleOutCpuThreshold":70,"scaleInCpuThreshold":30,"maxInstances":12,"minInstances":2},"scalingHistory":[],"recentEvents":[]},"info":{"sim_time_human":"0s"}}},"gcpReset":{"summary":"GCP — environment reset (GCE e2-standard-4 cluster, us-central1)","description":"Episode reset on a GCP simulation. Resources return to their initial\nstate: 2 × e2-standard-4 GCE instances behind Cloud Load Balancing\nwith Cloud SQL db-standard-4. Scaling history and events are cleared.\nconnectionPressure reflects the Cloud SQL connection-pool ratio.\n","value":{"environment":{"id":"env-gcp-001","simulationId":"sim-gcp-001","isActive":true,"currentStep":0,"maxSteps":150},"observation":{"metrics":{"cpuUsage":42,"latencyP50":40,"latencyP95":88,"errorRate":0.2,"throughput":3900,"costPerHour":0.44,"connectionPressure":0.28},"resources":[{"id":"res-lb-001","name":"Cloud Load Balancing","type":"network","provider":"gcp","instances":1},{"id":"res-gce-001","name":"GCE e2-standard-4","type":"compute","provider":"gcp","instances":2},{"id":"res-csql-001","name":"Cloud SQL db-standard-4","type":"database","provider":"gcp","instances":1}],"traffic":3900,"currentTime":0,"autoscalingConfig":{"scaleOutCpuThreshold":68,"scaleInCpuThreshold":30,"maxInstances":10,"minInstances":2},"scalingHistory":[],"recentEvents":[]},"info":{"sim_time_human":"0s"}}},"azureReset":{"summary":"Azure — environment reset (Standard_D4s_v3 VM Scale Set, East US)","description":"Episode reset on an Azure simulation. Resources return to their initial\nstate: 2 × Standard_D4s_v3 VMs behind an Azure Load Balancer\nwith Azure SQL General Purpose 4 vCores. Scaling history is cleared.\nconnectionPressure reflects the Azure SQL connection-pool ratio.\n","value":{"environment":{"id":"env-azure-001","simulationId":"sim-azure-001","isActive":true,"currentStep":0,"maxSteps":150},"observation":{"metrics":{"cpuUsage":45,"latencyP50":44,"latencyP95":96,"errorRate":0.3,"throughput":4400,"costPerHour":0.52,"connectionPressure":0.33},"resources":[{"id":"res-alb-001","name":"Azure Load Balancer","type":"network","provider":"azure","instances":1},{"id":"res-vm-001","name":"Standard_D4s_v3","type":"compute","provider":"azure","instances":2},{"id":"res-sql-001","name":"Azure SQL General Purpose","type":"database","provider":"azure","instances":1}],"traffic":4400,"currentTime":0,"autoscalingConfig":{"scaleOutCpuThreshold":70,"scaleInCpuThreshold":30,"maxInstances":10,"minInstances":2},"scalingHistory":[],"recentEvents":[]},"info":{"sim_time_human":"0s"}}},"ociReset":{"summary":"OCI — environment reset (VM.Standard3.Flex + Autonomous DB, us-ashburn-1)","description":"Episode reset on an OCI simulation. Resources return to their initial\nstate: 2 × VM.Standard3.Flex instances behind OCI Load Balancer\nwith Autonomous Database (2 OCPU). Scaling history is cleared.\nconnectionPressure reflects the Autonomous DB connection-pool ratio.\n","value":{"environment":{"id":"env-oci-001","simulationId":"sim-oci-001","isActive":true,"currentStep":0,"maxSteps":150},"observation":{"metrics":{"cpuUsage":40,"latencyP50":36,"latencyP95":76,"errorRate":0.1,"throughput":4900,"costPerHour":0.31,"connectionPressure":0.22},"resources":[{"id":"res-lb-001","name":"OCI Load Balancer","type":"network","provider":"oci","instances":1},{"id":"res-vm-001","name":"VM.Standard3.Flex","type":"compute","provider":"oci","instances":2},{"id":"res-adb-001","name":"Autonomous Database 2 OCPU","type":"database","provider":"oci","instances":1}],"traffic":4900,"currentTime":0,"autoscalingConfig":{"scaleOutCpuThreshold":65,"scaleInCpuThreshold":28,"maxInstances":10,"minInstances":2},"scalingHistory":[],"recentEvents":[]},"info":{"sim_time_human":"0s"}}},"digitaloceanReset":{"summary":"DigitalOcean — environment reset (s-2vcpu-4gb Droplets, nyc3)","description":"Episode reset on a DigitalOcean simulation. Resources return to their\ninitial state: 2 × s-2vcpu-4gb Droplets behind a DO Load Balancer\nwith Managed PostgreSQL db-s-2vcpu-4gb. Scaling history is cleared.\nconnectionPressure reflects the Managed PostgreSQL connection-pool ratio.\n","value":{"environment":{"id":"env-do-001","simulationId":"sim-do-001","isActive":true,"currentStep":0,"maxSteps":200},"observation":{"metrics":{"cpuUsage":38,"latencyP50":48,"latencyP95":98,"errorRate":0.3,"throughput":1480,"costPerHour":0.72,"connectionPressure":0.4},"resources":[{"id":"res-lb-001","name":"DO Load Balancer","type":"network","provider":"digitalocean","instances":1},{"id":"res-droplet-001","name":"Droplet s-2vcpu-4gb","type":"compute","provider":"digitalocean","instances":2},{"id":"res-pg-001","name":"Managed PostgreSQL","type":"database","provider":"digitalocean","instances":1}],"traffic":1480,"currentTime":0,"autoscalingConfig":{"scaleOutCpuThreshold":65,"scaleInCpuThreshold":35,"maxInstances":8,"minInstances":1},"scalingHistory":[],"recentEvents":[]},"info":{"sim_time_human":"0s"}}},"digitaloceanAMDNVMeReset":{"summary":"DigitalOcean — environment reset (s-2vcpu-4gb-amd AMD NVMe Droplets, nyc3)","description":"Episode reset on a DigitalOcean simulation backed by AMD NVMe Droplets\n(s-2vcpu-4gb-amd) at $0.038/hr per instance. The AMD EPYC variant\nprovides NVMe-backed local storage and lower P95 latency than the\nIntel equivalent under the same traffic load. Resources return to their\ninitial state: 2 × s-2vcpu-4gb-amd Droplets behind a DO Load Balancer\nwith Managed PostgreSQL db-s-2vcpu-4gb. Scaling history is cleared.\nconnectionPressure reflects the Managed PostgreSQL connection-pool ratio.\nUse this alongside the Intel example to compare agent policy performance\nacross Droplet variants with identical topology.\n","value":{"environment":{"id":"env-do-amd-001","simulationId":"c9f2d3e5-4a7b-4c6f-8d1e-2f3a4b5c6d7e","isActive":true,"currentStep":0,"maxSteps":200},"observation":{"metrics":{"cpuUsage":35.5,"latencyP50":44,"latencyP95":92,"errorRate":0.2,"throughput":995,"costPerHour":0.62,"connectionPressure":0.37},"resources":[{"id":"res-lb-amd-001","name":"DO Load Balancer","type":"network","provider":"digitalocean","instances":1},{"id":"res-droplet-amd-001","name":"Droplet s-2vcpu-4gb-amd","type":"compute","provider":"digitalocean","instances":2},{"id":"res-pg-amd-001","name":"Managed PostgreSQL","type":"database","provider":"digitalocean","instances":1}],"traffic":995,"currentTime":0,"autoscalingConfig":{"scaleOutCpuThreshold":65,"scaleInCpuThreshold":35,"maxInstances":8,"minInstances":1},"scalingHistory":[],"recentEvents":[]},"info":{"sim_time_human":"0s"}}}}}}},"404":{"$ref":"#/components/responses/NotFound"}}}},"/rl/environments/{environmentId}/step":{"x-stability":"stable","post":{"tags":["RL Environments"],"summary":"Execute an action and advance the simulation by one step","description":"Executes an agent action, simulates one time step, and returns the next observation,\nreward, and episode completion status. This is the core training loop interaction.\n\n**Idle TTL:** Each successful step call resets the environment's 2-hour idle timer.\nEnvironments that receive no `step` or `reset` calls for 2 hours are automatically\ndeactivated and their linked simulation artifacts removed. Subsequent requests to a\ndeactivated environment return `404`. Call `step` or `reset` at least once every\n2 hours during long training runs to keep the environment alive.\n","operationId":"stepRLEnvironment","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/rl/environments/env-aws-001/step \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action\": {\"type\": \"scale_out\", \"parameters\": {\"instanceCount\": 1}}}'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nENV_ID = \"env-aws-001\"\n\nresp = requests.post(\n    f\"{BASE_URL}/rl/environments/{ENV_ID}/step\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\"action\": {\"type\": \"scale_out\", \"parameters\": {\"instanceCount\": 1}}},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(f\"t={data['t']}  reward={data['reward']:.3f}  done={data['done']}\")\nprint(f\"CPU: {data['obs']['cpu_util']:.1%}  P95: {data['metrics']['latency_p95']} ms  \"\n      f\"cost: ${data['metrics']['cost_usd_hr']:.2f}/hr\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst ENV_ID = \"env-aws-001\";\n\nconst resp = await fetch(`${BASE_URL}/rl/environments/${ENV_ID}/step`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({ action: { type: \"scale_out\", parameters: { instanceCount: 1 } } }),\n});\nconst data = await resp.json();\nconsole.log(`t=${data.t}  reward=${data.reward.toFixed(3)}  done=${data.done}`);\nconsole.log(`CPU: ${(data.obs.cpu_util * 100).toFixed(1)}%  P95: ${data.metrics.latency_p95} ms  cost: $${data.metrics.cost_usd_hr.toFixed(2)}/hr`);\n"},{"lang":"Python","label":"Python – warm-up/training","source":"\"\"\"\nTwo-phase training loop: fast warm-up followed by fine-grained training.\n\nPhase 1 – Warm-up (300 s ticks)\n  Use large tick_seconds to fast-forward through startup noise before the\n  agent starts making meaningful autoscaling decisions. Each step advances\n  the simulation clock by 5 minutes, so 20 warm-up steps cover ~1.7 hours\n  of simulated time in seconds of wall time.\n\nPhase 2 – Training (60 s ticks)\n  Switch to 1-minute ticks for precise autoscaling control. The agent now\n  observes and reacts to traffic on a per-minute basis, matching the\n  granularity of real autoscaling cooldown windows (e.g. AWS default: 300 s).\n\"\"\"\nimport requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nENV_ID = \"env-aws-001\"\n\nsession = requests.Session()\nsession.headers.update({\"Authorization\": f\"Bearer {API_KEY}\"})\n\ndef step(action: dict, tick_seconds: int) -> dict:\n    resp = session.post(\n        f\"{BASE_URL}/rl/environments/{ENV_ID}/step\",\n        json={\"action\": action, \"tick_seconds\": tick_seconds},\n    )\n    resp.raise_for_status()\n    return resp.json()\n\nNO_OP = {\"type\": \"no_op\", \"parameters\": {}}\n\nWARMUP_STEPS = 20\nprint(\"=== Warm-up phase (300 s ticks) ===\")\nfor i in range(WARMUP_STEPS):\n    data = step(NO_OP, tick_seconds=300)\n    print(\n        f\"  warmup {i+1:2d}/{WARMUP_STEPS}  \"\n        f\"sim={data['sim_time_human']:>8s}  \"\n        f\"cpu={data['obs']['cpu_util']:.1%}  \"\n        f\"instances={data['obs']['instances']}\"\n    )\n    if data[\"done\"]:\n        print(\"  Episode ended during warm-up — reset and retry.\")\n        break\n\nprint(\"\\n=== Training phase (60 s ticks) ===\")\ndone = False\nwhile not done:\n    cpu = data[\"obs\"][\"cpu_util\"]\n    if cpu > 0.75:\n        action = {\"type\": \"scale_out\", \"parameters\": {\"instanceCount\": 1}}\n    elif cpu < 0.30 and data[\"obs\"][\"instances\"] > 1:\n        action = {\"type\": \"scale_in\", \"parameters\": {\"instanceCount\": 1}}\n    else:\n        action = NO_OP\n\n    data = step(action, tick_seconds=60)\n    done = data[\"done\"]\n    print(\n        f\"  t={data['t']:4d}  sim={data['sim_time_human']:>8s}  \"\n        f\"reward={data['reward']:+.3f}  \"\n        f\"cpu={data['obs']['cpu_util']:.1%}  \"\n        f\"p95={data['metrics']['latency_p95']} ms  \"\n        f\"cost=${data['metrics']['cost_usd_hr']:.2f}/hr  \"\n        f\"done={done}\"\n    )\n"},{"lang":"Node.js","label":"Node.js – warm-up/training","source":"/**\n * Two-phase training loop: fast warm-up followed by fine-grained training.\n *\n * Phase 1 – Warm-up (300 s ticks)\n *   Use large tick_seconds to fast-forward through startup noise before the\n *   agent starts making meaningful autoscaling decisions. Each step advances\n *   the simulation clock by 5 minutes, so 20 warm-up steps cover ~1.7 hours\n *   of simulated time in seconds of wall time.\n *\n * Phase 2 – Training (60 s ticks)\n *   Switch to 1-minute ticks for precise autoscaling control. The agent now\n *   observes and reacts to traffic on a per-minute basis, matching the\n *   granularity of real autoscaling cooldown windows (e.g. AWS default: 300 s).\n */\nconst BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst ENV_ID = \"env-aws-001\";\n\nasync function step(action, tickSeconds) {\n  const resp = await fetch(`${BASE_URL}/rl/environments/${ENV_ID}/step`, {\n    method: \"POST\",\n    headers: {\n      \"Authorization\": `Bearer ${API_KEY}`,\n      \"Content-Type\": \"application/json\",\n    },\n    body: JSON.stringify({ action, tick_seconds: tickSeconds }),\n  });\n  if (!resp.ok) throw new Error(`step failed: ${resp.status}`);\n  return resp.json();\n}\n\nconst NO_OP = { type: \"no_op\", parameters: {} };\n\n// --- Phase 1: Warm-up (300 s / 5-minute ticks) ---\nconst WARMUP_STEPS = 20;\nconsole.log(\"=== Warm-up phase (300 s ticks) ===\");\nlet data;\nfor (let i = 0; i < WARMUP_STEPS; i++) {\n  data = await step(NO_OP, 300);\n  console.log(\n    `  warmup ${String(i + 1).padStart(2)}/${WARMUP_STEPS}` +\n    `  sim=${data.sim_time_human.padStart(8)}` +\n    `  cpu=${(data.obs.cpu_util * 100).toFixed(1)}%` +\n    `  instances=${data.obs.instances}`\n  );\n  if (data.done) {\n    console.log(\"  Episode ended during warm-up — reset and retry.\");\n    break;\n  }\n}\n\n// --- Phase 2: Training (60 s / 1-minute ticks) ---\nconsole.log(\"\\n=== Training phase (60 s ticks) ===\");\nlet done = false;\nwhile (!done) {\n  const cpu = data.obs.cpu_util;\n  let action;\n  if (cpu > 0.75) {\n    action = { type: \"scale_out\", parameters: { instanceCount: 1 } };\n  } else if (cpu < 0.30 && data.obs.instances > 1) {\n    action = { type: \"scale_in\", parameters: { instanceCount: 1 } };\n  } else {\n    action = NO_OP;\n  }\n\n  data = await step(action, 60);\n  done = data.done;\n  console.log(\n    `  t=${String(data.t).padStart(4)}  sim=${data.sim_time_human.padStart(8)}` +\n    `  reward=${data.reward >= 0 ? \"+\" : \"\"}${data.reward.toFixed(3)}` +\n    `  cpu=${(data.obs.cpu_util * 100).toFixed(1)}%` +\n    `  p95=${data.metrics.latency_p95} ms` +\n    `  cost=$${data.metrics.cost_usd_hr.toFixed(2)}/hr` +\n    `  done=${done}`\n  );\n}\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"environmentId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the RL environment"}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["action"],"properties":{"action":{"$ref":"#/components/schemas/Action"},"tick_seconds":{"type":"integer","minimum":1,"maximum":3600,"description":"Per-step override for the simulation clock advancement. When provided,\nthis value overrides the episode-level `tick_seconds` set in `episodeConfig`\nfor this step only. Useful for agents that want to fast-forward through\nwarm-up phases (e.g. 300 s ticks) and then switch to finer-grained steps\n(e.g. 60 s ticks) for precise autoscaling decisions. The actual value\nused is reflected in `observation.tick_seconds` in the response.\n","example":300}}},"examples":{"doAdjustThreshold":{"summary":"DO: tune Droplet autoscaling thresholds","description":"Lower the CPU scale-out trigger from the default 70 % to 65 % so the\nDigitalOcean autoscaler scales out earlier, reducing latency during wave\ntraffic peaks. Also tightens the throughput threshold to 70 %.\n","value":{"action":{"type":"adjust_threshold","parameters":{"cpuThreshold":65,"throughputThreshold":70,"latencyThreshold":160}}}},"doScaleOut":{"summary":"DO: proactively add a Droplet replica","description":"Force-provision one additional s-2vcpu-4gb Droplet replica ahead of a\npredicted traffic spike. The first observation after this action will\nreflect the ~30 s cold-start latency overhead modelled for DigitalOcean.\n","value":{"action":{"type":"scale_out","parameters":{"instanceCount":1}}}},"doScaleIn":{"summary":"DO: remove an idle Droplet replica","description":"Remove one Droplet replica when traffic has subsided. The 180 s cooldown\nin the DigitalOcean autoscaling profile prevents thrashing between\nscale-in and scale-out actions.\n","value":{"action":{"type":"scale_in","parameters":{"instanceCount":1}}}},"digitaloceanAMDNVMeScaleOut":{"summary":"DO AMD NVMe: add an s-2vcpu-4gb-amd Droplet replica","description":"Force-provision one additional AMD NVMe Droplet (s-2vcpu-4gb-amd,\n$0.038/hr per instance) ahead of a predicted traffic spike. The AMD\nEPYC variant's NVMe-backed local storage delivers higher I/O throughput\nthan the standard Intel Droplet at the same price point. After the ~30 s\ncold-start overhead the cluster moves from 2 to 3 instances, bringing\nCPU utilisation down from ~71 % to ~56 % and P95 latency from ~148 ms\nto ~118 ms. This action triggers the `digitaloceanAMDNVMeStep` response\nshape, where resources are named \"Droplet s-2vcpu-4gb-amd\" to\ndistinguish AMD NVMe Droplets from standard Intel Droplets.\n","value":{"action":{"type":"scale_out","parameters":{"instanceCount":1}}}}}}}},"responses":{"200":{"description":"Step executed successfully","content":{"application/json":{"schema":{"type":"object","required":["t","obs","metrics","reward","reward_components","done","sim_time_human","info"],"properties":{"t":{"type":"integer","description":"Current simulation time step (incremented after each action)","example":15},"obs":{"$ref":"#/components/schemas/RLObs"},"metrics":{"$ref":"#/components/schemas/RLMetrics"},"resources":{"type":"array","description":"Full resource list with per-resource `recoveryPolicy`. Resources that have never had `set_recovery_policy` applied carry the global defaults (criticalCpuThreshold: 80, criticalSteps: 4, warningCpuThreshold: 70, warningSteps: 3). Use this to confirm a `set_recovery_policy` action took effect or to compare healing configurations across resources.\n","items":{"$ref":"#/components/schemas/Resource"}},"reward":{"type":"number","description":"Scalar total reward for this step (weighted sum of reward_components)","example":0.481},"reward_components":{"type":"object","description":"Individual reward sub-scores before weighting","properties":{"performance":{"type":"number","description":"Performance score (0–1, based on latency and errors)","example":0.812},"cost":{"type":"number","description":"Cost efficiency score (0–1, based on budget)","example":0.924},"stability":{"type":"number","description":"Stability score (−1 to 1, penalizes excessive changes)","example":-0.1},"sla":{"type":"number","description":"SLA compliance score (−1 to 0, penalizes violations)","example":0},"connection_pressure":{"type":"number","description":"DB connection-pool saturation penalty. Only present when the simulation contains database resources. 0 when pool pressure ≤ 1.0 (healthy); decreases with slope −1 per unit of pressure from 1.0 to 1.5 (reaching −0.5), then drops with slope −2 per unit above 1.5 — twice as steep — flooring at −1.0 at pressure ≥ 1.75. Added directly to the weighted sum of the other four components so agents are penalised for driving pools into exhaustion even before latency rises. NOTE: This component is absent until the episode first contains a database resource. When a DB is added mid-episode (via add_resource), the penalty is ramped in linearly over 5 steps (starting at 1/5 of its full magnitude on the first step it appears, reaching full strength after 5 steps) so its introduction does not cause a sudden step-to-step reward discontinuity. Agents may still treat the first appearance as a near-zero baseline.\n","example":-0.3}}},"done":{"type":"boolean","description":"Whether the episode is complete","example":false},"sim_time_human":{"type":"string","description":"Human-readable representation of the elapsed simulated time.\nFormat is `Xh Ym` when at least one hour has elapsed,\n`Xm Ys` when at least one minute (but less than one hour)\nhas elapsed, and `Xs` for less than one minute.\nMirrors the value inside `info.sim_time_human` for convenient\ntop-level access without unpacking the info object.\n","example":"15s"},"info":{"type":"object","description":"Additional diagnostic information","properties":{"stepMetrics":{"type":"object","description":"Raw metrics from this step"},"eventsGenerated":{"type":"integer","description":"Number of events generated this step"},"currentCost":{"type":"number","description":"Current cost per hour"},"sim_time_human":{"type":"string","description":"Human-readable representation of the elapsed simulated time.\nFormat is `Xh Ym` when at least one hour has elapsed,\n`Xm Ys` when at least one minute (but less than one hour)\nhas elapsed, and `Xs` for less than one minute.\n","example":"1h 0m"}}}}},"examples":{"awsStepWithDb":{"summary":"AWS — step response after scale-out (EC2 m5.large + RDS, us-east-1)","description":"The agent scaled out by 1 EC2 instance (now 3 × m5.large). CPU dropped\nfrom 69 % to 53 %, P95 latency is well within the 200 ms SLA, and cost\nrose to $0.57/hr. connection_pressure reflects the RDS Multi-AZ\nconnection-pool ratio; 0.42 is healthy (well below pool exhaustion).\n","value":{"t":42,"obs":{"rps":4750,"cpu_util":0.534,"instances":3,"traffic":4750,"currentTime":42},"metrics":{"cost_usd_hr":0.57,"latency_p95":98,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.42},"reward":0.531,"reward_components":{"performance":0.841,"cost":0.91,"stability":-0.1,"sla":0},"done":false,"sim_time_human":"42s","info":{"stepMetrics":{"cpuUsage":53.4,"throughput":4750},"eventsGenerated":1,"currentCost":0.57,"sim_time_human":"42s"}}},"gcpStepWithDb":{"summary":"GCP — step response after no-op (GCE e2-standard-4 + Cloud SQL, us-central1)","description":"The agent issued a no-op while the cluster ran at 2 × e2-standard-4 GCE\ninstances with Cloud Load Balancing and Cloud SQL. CPU is stable at 48 %,\nP95 latency is 91 ms, and cost is $0.44/hr. connection_pressure reflects\nthe Cloud SQL connection-pool ratio; 0.38 indicates plenty of headroom.\n","value":{"t":30,"obs":{"rps":3820,"cpu_util":0.478,"instances":2,"traffic":3820,"currentTime":30},"metrics":{"cost_usd_hr":0.44,"latency_p95":91,"error_rate":0.002,"uptime":0.998,"sla_violations":0,"connection_pressure":0.38},"reward":0.612,"reward_components":{"performance":0.873,"cost":0.951,"stability":0,"sla":0},"done":false,"sim_time_human":"30s","info":{"stepMetrics":{"cpuUsage":47.8,"throughput":3820},"eventsGenerated":0,"currentCost":0.44,"sim_time_human":"30s"}}},"azureStepWithDb":{"summary":"Azure — step response after no-op (Standard_D4s_v3 + Azure SQL, East US)","description":"The agent issued a no-op while the cluster ran at 2 × Standard_D4s_v3\nVMs behind Azure Load Balancer with Azure SQL. CPU is at 55 %, P95\nlatency is 104 ms, and cost is $0.52/hr. connection_pressure reflects\nthe Azure SQL connection-pool ratio; 0.51 is moderate but healthy.\n","value":{"t":28,"obs":{"rps":4300,"cpu_util":0.551,"instances":2,"traffic":4300,"currentTime":28},"metrics":{"cost_usd_hr":0.52,"latency_p95":104,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.51},"reward":0.487,"reward_components":{"performance":0.796,"cost":0.938,"stability":0,"sla":0},"done":false,"sim_time_human":"28s","info":{"stepMetrics":{"cpuUsage":55.1,"throughput":4300},"eventsGenerated":0,"currentCost":0.52,"sim_time_human":"28s"}}},"ociStepWithDb":{"summary":"OCI — step response after scale-in (VM.Standard3.Flex + Autonomous DB, us-ashburn-1)","description":"The agent scaled in by 1 instance (now 2 × VM.Standard3.Flex) after\ntraffic subsided. CPU is low at 34 %, P95 latency is 72 ms, and cost\nis $0.31/hr. connection_pressure reflects the Autonomous Database\nconnection-pool ratio; 0.29 is well within healthy bounds.\n","value":{"t":25,"obs":{"rps":4820,"cpu_util":0.342,"instances":2,"traffic":4820,"currentTime":25},"metrics":{"cost_usd_hr":0.31,"latency_p95":72,"error_rate":0.001,"uptime":0.999,"sla_violations":0,"connection_pressure":0.29},"reward":0.703,"reward_components":{"performance":0.921,"cost":0.985,"stability":-0.1,"sla":0},"done":false,"sim_time_human":"25s","info":{"stepMetrics":{"cpuUsage":34.2,"throughput":4820},"eventsGenerated":1,"currentCost":0.31,"sim_time_human":"25s"}}},"doStepAfterScaleOut":{"summary":"DO: step response after scaling out to 3 Droplet replicas (with Managed PostgreSQL)","description":"The agent scaled out by 1 Droplet (now 3 × s-2vcpu-4gb). CPU dropped\nfrom 72 % to 58 %, P95 latency improved to 104 ms, and cost rose to\n$1.08/hr (within the $3.50/hr budget). Reward is positive because the\nSLA is satisfied and cost efficiency is high. connection_pressure\nreflects the Managed PostgreSQL connection-pool ratio; 0.44 is healthy.\n","value":{"t":15,"obs":{"rps":1620,"cpu_util":0.582,"instances":3,"traffic":1620,"currentTime":15},"metrics":{"cost_usd_hr":1.08,"latency_p95":104,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.44},"reward":0.481,"reward_components":{"performance":0.812,"cost":0.924,"stability":-0.1,"sla":0},"done":false,"sim_time_human":"15s","info":{"stepMetrics":{"cpuUsage":58.2,"throughput":1620},"eventsGenerated":1,"currentCost":1.08,"sim_time_human":"15s"}}},"doStepThresholdTune":{"summary":"DO: step response after tuning CPU/throughput thresholds","description":"The agent lowered the CPU threshold to 65 % without changing instance\ncount. The cluster runs at 2 Droplets. Metrics are stable; the small\nstability penalty (-0.05) reflects the configuration change itself.\nNo database resource in this scenario — connection_pressure is absent.\n","value":{"t":8,"obs":{"rps":1480,"cpu_util":0.614,"instances":2,"traffic":1480,"currentTime":8},"metrics":{"cost_usd_hr":0.72,"latency_p95":118,"error_rate":0.004,"uptime":0.996,"sla_violations":0},"reward":0.392,"reward_components":{"performance":0.743,"cost":0.96,"stability":-0.05,"sla":0},"done":false,"sim_time_human":"15s","info":{"stepMetrics":{"cpuUsage":61.4,"throughput":1480},"eventsGenerated":0,"currentCost":0.72,"sim_time_human":"15s"}}},"digitaloceanAMDNVMeStep":{"summary":"DigitalOcean — step response after scale-out (s-2vcpu-4gb-amd AMD NVMe Droplets, nyc3)","description":"The agent scaled out by 1 AMD NVMe Droplet (now 3 × s-2vcpu-4gb-amd,\n$0.038/hr per instance). CPU dropped from 71 % to 56 %, P95 latency\nimproved to 98 ms, and cost rose to $0.93/hr (within the $3.50/hr budget).\nThe per-instance compute cost is $0.038/hr, so 3 instances total $0.114/hr\nfor compute alone; `metrics.cost_usd_hr` (0.93) is the full simulated stack\ncost including the DO Load Balancer, Managed PostgreSQL, and modelled\noverhead. The AMD EPYC variant's NVMe-backed local storage yields slightly\nlower P95 latency than the Intel s-2vcpu-4gb equivalent at the same traffic\nlevel — 98 ms vs ~104 ms after scale-out. connection_pressure reflects the\nManaged PostgreSQL db-s-2vcpu-4gb connection-pool ratio; 0.41 is healthy.\nResources are named \"Droplet s-2vcpu-4gb-amd\" to distinguish AMD NVMe\nDroplets from standard Intel Droplets.\n","value":{"t":20,"obs":{"rps":1620,"cpu_util":0.561,"instances":3,"traffic":1620,"currentTime":20},"metrics":{"cost_usd_hr":0.93,"latency_p95":98,"error_rate":0.002,"uptime":0.998,"sla_violations":0,"connection_pressure":0.41},"reward":0.514,"reward_components":{"performance":0.834,"cost":0.918,"stability":-0.1,"sla":0},"done":false,"sim_time_human":"20s","info":{"stepMetrics":{"cpuUsage":56.1,"throughput":1620},"eventsGenerated":1,"currentCost":0.93,"sim_time_human":"20s"}}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/rl/environments/{environmentId}/batch-step":{"x-stability":"stable","post":{"tags":["RL Environments"],"summary":"Execute multiple actions in a single request","description":"Executes up to 30 actions sequentially in a single HTTP round-trip, advancing the\nsimulation by one step per action. Stops early and returns partial results if the\nepisode ends (`done: true`) before all actions are processed.\n\n**Rate limiting:** Each action in the batch counts as one call against the\n5 000 req/hr RL training quota. A batch of 30 actions consumes 30 quota units.\nIf the quota is exhausted mid-batch, the endpoint returns 429 with a `Retry-After`\nheader indicating how many seconds remain until the window resets.\n\n**Idle TTL:** The last successful batch-step call resets the environment's 2-hour\nidle timer, the same as a single `step` call.\n","operationId":"batchStepRLEnvironment","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/rl/environments/env-aws-001/batch-step \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"steps\": [\n      {\"action\": {\"type\": \"scale_out\", \"parameters\": {\"instanceCount\": 1}}},\n      {\"action\": {\"type\": \"no_op\",     \"parameters\": {}}},\n      {\"action\": {\"type\": \"no_op\",     \"parameters\": {}}}\n    ]\n  }'\n"},{"lang":"Python","label":"Python","source":"import time, requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nENV_ID = \"env-aws-001\"\n\nsession = requests.Session()\nsession.headers.update({\"Authorization\": f\"Bearer {API_KEY}\"})\n\ndef batch_step(steps: list[dict], max_retries: int = 5) -> dict:\n    for attempt in range(max_retries):\n        resp = session.post(\n            f\"{BASE_URL}/rl/environments/{ENV_ID}/batch-step\",\n            json={\"steps\": steps},\n        )\n        if resp.status_code == 429:\n            retry_after = int(resp.headers.get(\"Retry-After\", 60))\n            print(f\"Rate limited — waiting {retry_after}s\")\n            time.sleep(retry_after)\n            continue\n        resp.raise_for_status()\n        return resp.json()\n    raise RuntimeError(\"Exceeded max retries\")\n\nNO_OP = {\"action\": {\"type\": \"no_op\", \"parameters\": {}}}\nSCALE_OUT = {\"action\": {\"type\": \"scale_out\", \"parameters\": {\"instanceCount\": 1}}}\n\nwarmup = batch_step([{**NO_OP, \"tick_seconds\": 300}] * 30)\nprint(f\"Warm-up complete: {len(warmup['results'])} steps\")\n\ntotal_reward = 0.0\nwhile True:\n    result = batch_step([SCALE_OUT] + [NO_OP] * 9)\n    for step_result in result[\"results\"]:\n        total_reward += step_result[\"reward\"]\n        if step_result[\"done\"]:\n            print(f\"Episode complete — total reward: {total_reward:.2f}\")\n            break\n    else:\n        continue\n    break\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst ENV_ID = \"env-aws-001\";\n\nasync function batchStep(steps, maxRetries = 5) {\n  for (let attempt = 0; attempt < maxRetries; attempt++) {\n    const resp = await fetch(`${BASE_URL}/rl/environments/${ENV_ID}/batch-step`, {\n      method: \"POST\",\n      headers: { \"Authorization\": `Bearer ${API_KEY}`, \"Content-Type\": \"application/json\" },\n      body: JSON.stringify({ steps }),\n    });\n    if (resp.status === 429) {\n      const retryAfter = parseInt(resp.headers.get(\"Retry-After\") ?? \"60\", 10);\n      console.log(`Rate limited — waiting ${retryAfter}s`);\n      await new Promise(r => setTimeout(r, retryAfter * 1000));\n      continue;\n    }\n    if (!resp.ok) throw new Error(`HTTP ${resp.status}`);\n    return resp.json();\n  }\n  throw new Error(\"Exceeded max retries\");\n}\n\nconst NO_OP   = { action: { type: \"no_op\",     parameters: {} } };\nconst SCALE_OUT = { action: { type: \"scale_out\", parameters: { instanceCount: 1 } } };\n\n// 30-step warm-up at 300 s ticks\nconst warmup = await batchStep(Array(30).fill({ ...NO_OP, tick_seconds: 300 }));\nconsole.log(`Warm-up complete: ${warmup.results.length} steps`);\n\n// Training loop using 10-step batches\nlet totalReward = 0;\nlet done = false;\nwhile (!done) {\n  const { results } = await batchStep([SCALE_OUT, ...Array(9).fill(NO_OP)]);\n  for (const r of results) {\n    totalReward += r.reward;\n    if (r.done) { done = true; break; }\n  }\n}\nconsole.log(`Episode complete — total reward: ${totalReward.toFixed(2)}`);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"environmentId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the RL environment"}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["steps"],"properties":{"steps":{"type":"array","minItems":1,"maxItems":30,"description":"Ordered list of step actions to execute (max 30)","items":{"type":"object","required":["action"],"properties":{"action":{"$ref":"#/components/schemas/Action"},"tick_seconds":{"type":"integer","minimum":1,"maximum":3600,"description":"Simulated seconds to advance per step (overrides environment default)"}}}}}},"examples":{"threeStepBatch":{"summary":"Scale out then observe","value":{"steps":[{"action":{"type":"scale_out","parameters":{"instanceCount":1}}},{"action":{"type":"no_op","parameters":{}}},{"action":{"type":"no_op","parameters":{}}}]}},"warmupBatch":{"summary":"30-step warm-up at 300 s ticks","value":{"steps":[{"action":{"type":"no_op","parameters":{}},"tick_seconds":300}]}}}}}},"responses":{"200":{"description":"Batch steps executed successfully","content":{"application/json":{"schema":{"type":"object","required":["results"],"properties":{"results":{"type":"array","description":"Step results in the same order as the request `steps` array.\nMay be shorter than the request array if the episode ended early\n(`done: true` in the last result).\n","items":{"type":"object","required":["t","obs","metrics","reward","reward_components","done","sim_time_human","info"],"properties":{"t":{"type":"integer","description":"Current step index"},"obs":{"$ref":"#/components/schemas/RLObs"},"metrics":{"$ref":"#/components/schemas/RLMetrics"},"resources":{"type":"array","description":"Full resource list with per-resource recoveryPolicy","items":{"$ref":"#/components/schemas/Resource"}},"reward":{"type":"number","description":"Scalar total reward for this step"},"reward_components":{"type":"object","properties":{"performance":{"type":"number"},"cost":{"type":"number"},"stability":{"type":"number"},"sla":{"type":"number"},"connection_pressure":{"type":"number","description":"DB connection-pool saturation penalty. Only present when the simulation contains database resources. 0 when healthy; negative (floor −1.0) when pool is exhausted. When a DB is added mid-episode the penalty is ramped in linearly over 5 steps so it does not cause a sudden reward jump. See step response for full formula.\n"}}},"done":{"type":"boolean","description":"True when the episode has ended"},"sim_time_human":{"type":"string","description":"Human-readable simulation time"},"info":{"type":"object","description":"Additional episode metadata"}}}}}},"examples":{"threeStepResult":{"summary":"Three-step batch — scale out then two no-ops","value":{"results":[{"t":1,"obs":{"rps":4900,"cpu_util":0.58,"instances":3,"traffic":4900,"currentTime":60,"tick_seconds":60},"metrics":{"cost_usd_hr":0.41,"latency_p95":112,"error_rate":0.004,"uptime":0.996,"sla_violations":0},"reward":0.734,"reward_components":{"performance":0.812,"cost":0.901,"stability":0,"sla":0},"done":false,"sim_time_human":"1m 0s","info":{"sim_time_human":"1m 0s"},"resources":[]},{"t":2,"obs":{"rps":5100,"cpu_util":0.44,"instances":3,"traffic":5100,"currentTime":120,"tick_seconds":60},"metrics":{"cost_usd_hr":0.41,"latency_p95":89,"error_rate":0.001,"uptime":0.999,"sla_violations":0},"reward":0.819,"reward_components":{"performance":0.889,"cost":0.901,"stability":0,"sla":0},"done":false,"sim_time_human":"2m 0s","info":{"sim_time_human":"2m 0s"},"resources":[]}]}},"awsBatchWithDb":{"summary":"AWS — batch-step (EC2 m5.large + RDS Multi-AZ, us-east-1)","description":"Two-step batch: scale out by 1 EC2 instance, then observe with a\nno-op. Each step's metrics include connection_pressure, which\nreflects the RDS Multi-AZ connection-pool ratio; values near 0.4\nare healthy (well below pool exhaustion).\n","value":{"results":[{"t":1,"obs":{"rps":4750,"cpu_util":0.61,"instances":3,"traffic":4750,"currentTime":60,"tick_seconds":60},"metrics":{"cost_usd_hr":0.57,"latency_p95":108,"error_rate":0.004,"uptime":0.996,"sla_violations":0,"connection_pressure":0.45},"reward":0.531,"reward_components":{"performance":0.812,"cost":0.901,"stability":-0.1,"sla":0},"done":false,"sim_time_human":"1m 0s","info":{"sim_time_human":"1m 0s"},"resources":[]},{"t":2,"obs":{"rps":4820,"cpu_util":0.53,"instances":3,"traffic":4820,"currentTime":120,"tick_seconds":60},"metrics":{"cost_usd_hr":0.57,"latency_p95":98,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.42},"reward":0.612,"reward_components":{"performance":0.841,"cost":0.91,"stability":0,"sla":0},"done":false,"sim_time_human":"2m 0s","info":{"sim_time_human":"2m 0s"},"resources":[]}]}},"gcpBatchWithDb":{"summary":"GCP — batch-step (GCE e2-standard-4 + Cloud SQL, us-central1)","description":"Two-step batch of no-ops while the cluster runs at 2 × e2-standard-4\nwith Cloud Load Balancing and Cloud SQL. Each step's metrics include\nconnection_pressure, which reflects the Cloud SQL connection-pool\nratio; 0.38 indicates plenty of headroom.\n","value":{"results":[{"t":1,"obs":{"rps":3820,"cpu_util":0.48,"instances":2,"traffic":3820,"currentTime":60,"tick_seconds":60},"metrics":{"cost_usd_hr":0.44,"latency_p95":91,"error_rate":0.002,"uptime":0.998,"sla_violations":0,"connection_pressure":0.38},"reward":0.612,"reward_components":{"performance":0.873,"cost":0.951,"stability":0,"sla":0},"done":false,"sim_time_human":"1m 0s","info":{"sim_time_human":"1m 0s"},"resources":[]},{"t":2,"obs":{"rps":3760,"cpu_util":0.46,"instances":2,"traffic":3760,"currentTime":120,"tick_seconds":60},"metrics":{"cost_usd_hr":0.44,"latency_p95":88,"error_rate":0.002,"uptime":0.998,"sla_violations":0,"connection_pressure":0.36},"reward":0.628,"reward_components":{"performance":0.884,"cost":0.951,"stability":0,"sla":0},"done":false,"sim_time_human":"2m 0s","info":{"sim_time_human":"2m 0s"},"resources":[]}]}},"azureBatchWithDb":{"summary":"Azure — batch-step (Standard_D4s_v3 + Azure SQL, East US)","description":"Two-step batch of no-ops while the cluster runs at 2 × Standard_D4s_v3\nVMs behind Azure Load Balancer with Azure SQL. Each step's metrics\ninclude connection_pressure, which reflects the Azure SQL\nconnection-pool ratio; 0.51 is moderate but healthy.\n","value":{"results":[{"t":1,"obs":{"rps":4300,"cpu_util":0.55,"instances":2,"traffic":4300,"currentTime":60,"tick_seconds":60},"metrics":{"cost_usd_hr":0.52,"latency_p95":104,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.51},"reward":0.487,"reward_components":{"performance":0.796,"cost":0.938,"stability":0,"sla":0},"done":false,"sim_time_human":"1m 0s","info":{"sim_time_human":"1m 0s"},"resources":[]},{"t":2,"obs":{"rps":4360,"cpu_util":0.56,"instances":2,"traffic":4360,"currentTime":120,"tick_seconds":60},"metrics":{"cost_usd_hr":0.52,"latency_p95":106,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.49},"reward":0.482,"reward_components":{"performance":0.79,"cost":0.938,"stability":0,"sla":0},"done":false,"sim_time_human":"2m 0s","info":{"sim_time_human":"2m 0s"},"resources":[]}]}},"ociBatchWithDb":{"summary":"OCI — batch-step (VM.Standard3.Flex + Autonomous DB, us-ashburn-1)","description":"Two-step batch: scale in by 1 instance, then observe with a no-op\n(now 2 × VM.Standard3.Flex). Each step's metrics include\nconnection_pressure, which reflects the Autonomous Database\nconnection-pool ratio; 0.29 is well within healthy bounds.\n","value":{"results":[{"t":1,"obs":{"rps":4820,"cpu_util":0.34,"instances":2,"traffic":4820,"currentTime":60,"tick_seconds":60},"metrics":{"cost_usd_hr":0.31,"latency_p95":72,"error_rate":0.001,"uptime":0.999,"sla_violations":0,"connection_pressure":0.31},"reward":0.703,"reward_components":{"performance":0.921,"cost":0.985,"stability":-0.1,"sla":0},"done":false,"sim_time_human":"1m 0s","info":{"sim_time_human":"1m 0s"},"resources":[]},{"t":2,"obs":{"rps":4780,"cpu_util":0.33,"instances":2,"traffic":4780,"currentTime":120,"tick_seconds":60},"metrics":{"cost_usd_hr":0.31,"latency_p95":70,"error_rate":0.001,"uptime":0.999,"sla_violations":0,"connection_pressure":0.29},"reward":0.812,"reward_components":{"performance":0.934,"cost":0.985,"stability":0,"sla":0},"done":false,"sim_time_human":"2m 0s","info":{"sim_time_human":"2m 0s"},"resources":[]}]}},"doBatchWithDb":{"summary":"DO — batch-step (Droplet s-2vcpu-4gb + Managed PostgreSQL)","description":"Two-step batch: scale out by 1 Droplet, then observe with a no-op\n(now 3 × s-2vcpu-4gb). Each step's metrics include\nconnection_pressure, which reflects the Managed PostgreSQL\nconnection-pool ratio; 0.44 is healthy.\n","value":{"results":[{"t":1,"obs":{"rps":1620,"cpu_util":0.58,"instances":3,"traffic":1620,"currentTime":60,"tick_seconds":60},"metrics":{"cost_usd_hr":1.08,"latency_p95":104,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.46},"reward":0.481,"reward_components":{"performance":0.812,"cost":0.924,"stability":-0.1,"sla":0},"done":false,"sim_time_human":"1m 0s","info":{"sim_time_human":"1m 0s"},"resources":[]},{"t":2,"obs":{"rps":1580,"cpu_util":0.55,"instances":3,"traffic":1580,"currentTime":120,"tick_seconds":60},"metrics":{"cost_usd_hr":1.08,"latency_p95":99,"error_rate":0.002,"uptime":0.998,"sla_violations":0,"connection_pressure":0.44},"reward":0.503,"reward_components":{"performance":0.831,"cost":0.924,"stability":0,"sla":0},"done":false,"sim_time_human":"2m 0s","info":{"sim_time_human":"2m 0s"},"resources":[]}]}},"digitaloceanAMDNVMeBatchStep":{"summary":"DO AMD NVMe — batch-step (s-2vcpu-4gb-amd Droplets + Managed PostgreSQL, nyc3)","description":"Two-step batch on a DigitalOcean simulation backed by AMD NVMe Droplets\n(s-2vcpu-4gb-amd, $0.038/hr per instance). Step 1 scales out by 1 Droplet\n(now 3 × s-2vcpu-4gb-amd), bringing CPU down from 71 % to 56 %. The\nper-instance compute cost is $0.038/hr, so 3 instances total $0.114/hr for\ncompute alone; `metrics.cost_usd_hr` (0.93) is the full simulated stack\ncost including the DO Load Balancer, Managed PostgreSQL, and modelled\noverhead. Step 2 is a no-op that confirms the cluster has stabilised. The\nAMD EPYC variant's NVMe-backed local storage yields slightly lower P95\nlatency than the Intel s-2vcpu-4gb equivalent at the same traffic level —\n98 ms vs ~104 ms after scale-out. connection_pressure reflects the Managed\nPostgreSQL db-s-2vcpu-4gb connection-pool ratio; 0.41 is healthy. Resources\nare named \"Droplet s-2vcpu-4gb-amd\" to distinguish AMD NVMe Droplets from\nstandard Intel Droplets.\n","value":{"results":[{"t":1,"obs":{"rps":1620,"cpu_util":0.561,"instances":3,"traffic":1620,"currentTime":60,"tick_seconds":60},"metrics":{"cost_usd_hr":0.93,"latency_p95":98,"error_rate":0.002,"uptime":0.998,"sla_violations":0,"connection_pressure":0.41},"reward":0.514,"reward_components":{"performance":0.834,"cost":0.918,"stability":-0.1,"sla":0},"done":false,"sim_time_human":"1m 0s","info":{"sim_time_human":"1m 0s"},"resources":[{"id":"res-lb-amd-001","name":"DO Load Balancer","type":"network","provider":"digitalocean","instances":1},{"id":"res-droplet-amd-001","name":"Droplet s-2vcpu-4gb-amd","type":"compute","provider":"digitalocean","instances":3},{"id":"res-pg-amd-001","name":"Managed PostgreSQL","type":"database","provider":"digitalocean","instances":1}]},{"t":2,"obs":{"rps":1580,"cpu_util":0.541,"instances":3,"traffic":1580,"currentTime":120,"tick_seconds":60},"metrics":{"cost_usd_hr":0.93,"latency_p95":94,"error_rate":0.002,"uptime":0.998,"sla_violations":0,"connection_pressure":0.4},"reward":0.531,"reward_components":{"performance":0.848,"cost":0.918,"stability":0,"sla":0},"done":false,"sim_time_human":"2m 0s","info":{"sim_time_human":"2m 0s"},"resources":[{"id":"res-lb-amd-001","name":"DO Load Balancer","type":"network","provider":"digitalocean","instances":1},{"id":"res-droplet-amd-001","name":"Droplet s-2vcpu-4gb-amd","type":"compute","provider":"digitalocean","instances":3},{"id":"res-pg-amd-001","name":"Managed PostgreSQL","type":"database","provider":"digitalocean","instances":1}]}]}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"404":{"$ref":"#/components/responses/NotFound"},"429":{"description":"RL rate limit exceeded (5 000 req/hr)","headers":{"Retry-After":{"description":"Seconds until the rate-limit window resets","schema":{"type":"integer"}}},"content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/rl/environments/{environmentId}/observation":{"x-stability":"stable","get":{"tags":["RL Environments"],"summary":"Get the current observation without executing an action","description":"Returns the current state observation without advancing the simulation.\nUseful for initial state inspection or debugging.\n","operationId":"getRLObservation","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/rl/environments/env-aws-001/observation \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nENV_ID = \"env-aws-001\"\n\nresp = requests.get(\n    f\"{BASE_URL}/rl/environments/{ENV_ID}/observation\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nobs, metrics = data[\"obs\"], data[\"metrics\"]\nprint(f\"Step {obs['currentTime']}  CPU: {obs['cpu_util']:.1%}  \"\n      f\"P95: {metrics['latency_p95']} ms  cost: ${metrics['cost_usd_hr']:.2f}/hr\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst ENV_ID = \"env-aws-001\";\n\nconst resp = await fetch(`${BASE_URL}/rl/environments/${ENV_ID}/observation`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst { obs, metrics } = await resp.json();\nconsole.log(`Step ${obs.currentTime}  CPU: ${(obs.cpu_util * 100).toFixed(1)}%  ` +\n            `P95: ${metrics.latency_p95} ms  cost: $${metrics.cost_usd_hr.toFixed(2)}/hr`);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"environmentId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the RL environment"}],"responses":{"200":{"description":"Current observation","content":{"application/json":{"schema":{"type":"object","required":["obs","metrics","resources"],"properties":{"obs":{"$ref":"#/components/schemas/RLObs"},"metrics":{"$ref":"#/components/schemas/RLMetrics"},"resources":{"type":"array","description":"Full resource list with per-resource `recoveryPolicy`. Resources that have never had `set_recovery_policy` applied carry the global defaults (criticalCpuThreshold: 80, criticalSteps: 4, warningCpuThreshold: 70, warningSteps: 3). Use this to confirm a `set_recovery_policy` action took effect or to compare healing configurations across resources.\n","items":{"$ref":"#/components/schemas/Resource"}}}},"examples":{"awsObservation":{"summary":"AWS — observation at step 42 (EC2 m5.large, us-east-1)","description":"Mid-episode observation for an AWS simulation. The cluster is running\n3 × m5.large EC2 instances behind an ALB with RDS Multi-AZ. CPU is\nmoderate, P95 latency is within the 200 ms SLA, and cost is on-budget.\nconnection_pressure reflects the RDS connection-pool ratio.\n","value":{"obs":{"rps":4750,"cpu_util":0.534,"instances":3,"traffic":4750,"currentTime":42},"metrics":{"cost_usd_hr":0.57,"latency_p95":98,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.42}}},"gcpObservation":{"summary":"GCP — observation at step 30 (GCE e2-standard-4, us-central1)","description":"Mid-episode observation for a GCP simulation. The cluster is running\n2 × e2-standard-4 GCE instances behind Cloud Load Balancing with\nCloud SQL. CPU is stable and well within the SLA.\nconnection_pressure reflects the Cloud SQL connection-pool ratio.\n","value":{"obs":{"rps":3820,"cpu_util":0.478,"instances":2,"traffic":3820,"currentTime":30},"metrics":{"cost_usd_hr":0.44,"latency_p95":91,"error_rate":0.002,"uptime":0.998,"sla_violations":0,"connection_pressure":0.38}}},"azureObservation":{"summary":"Azure — observation at step 28 (Standard_D4s_v3, East US)","description":"Mid-episode observation for an Azure simulation. The cluster is running\n2 × Standard_D4s_v3 VMs behind Azure Load Balancer with Azure SQL.\nCPU is moderate; the agent has not yet triggered a scale-out.\nconnection_pressure reflects the Azure SQL connection-pool ratio.\n","value":{"obs":{"rps":4300,"cpu_util":0.551,"instances":2,"traffic":4300,"currentTime":28},"metrics":{"cost_usd_hr":0.52,"latency_p95":104,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.51}}},"ociObservation":{"summary":"OCI — observation at step 25 (VM.Standard3.Flex, us-ashburn-1)","description":"Mid-episode observation for an OCI simulation. The cluster is running\n2 × VM.Standard3.Flex instances behind OCI Load Balancer with\nAutonomous Database. CPU is low; the agent may consider scaling in.\nconnection_pressure reflects the Autonomous DB connection-pool ratio.\n","value":{"obs":{"rps":4820,"cpu_util":0.342,"instances":2,"traffic":4820,"currentTime":25},"metrics":{"cost_usd_hr":0.31,"latency_p95":72,"error_rate":0.001,"uptime":0.999,"sla_violations":0,"connection_pressure":0.29}}},"digitaloceanObservation":{"summary":"DigitalOcean — observation at step 15 (s-2vcpu-4gb Droplets, nyc3)","description":"Mid-episode observation for a DigitalOcean simulation. The cluster is\nrunning 2 × s-2vcpu-4gb Droplets behind a DO Load Balancer with\nManaged PostgreSQL. CPU is stable within the target range.\nconnection_pressure reflects the Managed PostgreSQL connection-pool ratio.\n","value":{"obs":{"rps":1480,"cpu_util":0.614,"instances":2,"traffic":1480,"currentTime":15},"metrics":{"cost_usd_hr":0.72,"latency_p95":118,"error_rate":0.004,"uptime":0.996,"sla_violations":0,"connection_pressure":0.63}}},"digitaloceanAMDNVMeObservation":{"summary":"DigitalOcean — observation at step 20 (s-2vcpu-4gb-amd AMD NVMe Droplets, nyc3)","description":"Mid-episode observation for a DigitalOcean simulation backed by AMD NVMe\nDroplets (s-2vcpu-4gb-amd) at $0.038/hr per instance. The cluster is\nrunning 2 × s-2vcpu-4gb-amd Droplets behind a DO Load Balancer with\nManaged PostgreSQL db-s-2vcpu-4gb. CPU is moderate at 58 %; the agent\nmay consider scaling out before the 65 % threshold is crossed. The AMD\nEPYC variant's NVMe-backed storage contributes to slightly lower P95\nlatency than the Intel equivalent under the same traffic load.\nconnection_pressure reflects the Managed PostgreSQL connection-pool\nratio; 0.48 is healthy with headroom remaining.\n","value":{"obs":{"rps":1580,"cpu_util":0.578,"instances":2,"traffic":1580,"currentTime":20},"metrics":{"cost_usd_hr":0.62,"latency_p95":108,"error_rate":0.003,"uptime":0.997,"sla_violations":0,"connection_pressure":0.48}}}}}}},"404":{"$ref":"#/components/responses/NotFound"}}}},"/rl/environments/{environmentId}":{"x-stability":"stable","get":{"tags":["RL Environments"],"summary":"Get RL environment details","description":"Retrieve the current state and configuration of an RL environment","operationId":"getRLEnvironment","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/rl/environments/env-aws-001 \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nENV_ID = \"env-aws-001\"\n\nresp = requests.get(\n    f\"{BASE_URL}/rl/environments/{ENV_ID}\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\nenv = resp.json()\nprint(f\"Environment {env['id']}  isActive={env['isActive']}  \"\n      f\"step={env['currentStep']}/{env['maxSteps']}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst ENV_ID = \"env-aws-001\";\n\nconst resp = await fetch(`${BASE_URL}/rl/environments/${ENV_ID}`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst env = await resp.json();\nconsole.log(`Environment ${env.id}  isActive=${env.isActive}  step=${env.currentStep}/${env.maxSteps}`);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"environmentId","in":"path","required":true,"schema":{"type":"string","format":"uuid"}}],"responses":{"200":{"description":"Environment details","content":{"application/json":{"schema":{"$ref":"#/components/schemas/RLEnvironment"}}}},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"tags":["RL Environments"],"summary":"Cancel RL training episode","description":"Cancel a running RL training episode. This endpoint is idempotent - calling it multiple times\non the same episode will return success without error.\n\n**Cancellation Rules:**\n- Episodes with isActive=true will be cancelled\n- Episodes already cancelled (isActive=false) will return success (idempotent behavior)\n- Cancelled episodes will have isActive set to false and a cancelledAt timestamp\n","operationId":"cancelRLEnvironment","security":[{"BearerAuth":[]}],"parameters":[{"name":"environmentId","in":"path","required":true,"schema":{"type":"string","format":"uuid"}}],"responses":{"200":{"description":"Episode cancelled successfully or was already cancelled.\nReturns the same response whether cancelling for the first time or if already cancelled\n(idempotent operation).\n","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"isActive":{"type":"boolean","enum":[false]},"cancelledAt":{"type":"string","format":"date-time"},"message":{"type":"string","description":"Message indicating if episode was just cancelled or already cancelled"}}},"examples":{"newlyCancelled":{"value":{"id":"env_abc123","isActive":false,"cancelledAt":"2024-01-15T10:30:00Z","message":"RL training episode cancelled successfully"}},"alreadyCancelled":{"value":{"id":"env_abc123","isActive":false,"cancelledAt":"2024-01-15T09:45:00Z","message":"Episode already cancelled"}}}}}},"404":{"$ref":"#/components/responses/NotFound"}}}},"/analysis/optimize":{"x-stability":"stable","post":{"tags":["Infrastructure Optimization"],"summary":"Submit infrastructure optimization job","description":"Analyzes your current architecture and generates 50+ tested variations\nwith ranked recommendations for cost/performance optimization.\n\n**Ownership requirement:** The `simulationId` must refer to a simulation\nthat was created with, or claimed by, the same API key used in this\nrequest. Simulations created via the public browser workspace (i.e.\n`POST /api/simulations` without an `Authorization` header) are unowned\nand will return `403` here until claimed. To associate ownership, either:\n- Create the simulation with a Bearer token from the start:\n  `POST /api/simulations` with `Authorization: Bearer <key>`, or\n- Claim an existing unowned simulation before calling this endpoint:\n  `POST /api/simulations/{simulationId}/claim`.\n","operationId":"submitOptimization","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/analysis/optimize \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"simulationId\": \"sim-abc123\",\n    \"goals\": {\n      \"primary\": \"minimize_cost\",\n      \"constraints\": {\n        \"max_cost_per_hour\": 10.0,\n        \"min_throughput\": 5000,\n        \"max_latency_p95\": 200\n      }\n    },\n    \"testScenario\": {\n      \"traffic_pattern\": \"spike\",\n      \"duration_steps\": 100,\n      \"include_failures\": true\n    }\n  }'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/analysis/optimize\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\n        \"simulationId\": \"sim-abc123\",\n        \"goals\": {\n            \"primary\": \"minimize_cost\",\n            \"constraints\": {\n                \"max_cost_per_hour\": 10.0,\n                \"min_throughput\": 5000,\n                \"max_latency_p95\": 200,\n            },\n        },\n        \"testScenario\": {\n            \"traffic_pattern\": \"spike\",\n            \"duration_steps\": 100,\n            \"include_failures\": True,\n        },\n    },\n)\nresp.raise_for_status()\njob = resp.json()[\"job\"]\nprint(\"Job ID:\", job[\"id\"], \"Status:\", job[\"status\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/analysis/optimize`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({\n    simulationId: \"sim-abc123\",\n    goals: {\n      primary: \"minimize_cost\",\n      constraints: {\n        max_cost_per_hour: 10.0,\n        min_throughput: 5000,\n        max_latency_p95: 200,\n      },\n    },\n    testScenario: {\n      traffic_pattern: \"spike\",\n      duration_steps: 100,\n      include_failures: true,\n    },\n  }),\n});\nconst { job } = await resp.json();\nconsole.log(\"Job ID:\", job.id, \"Status:\", job.status);\n"}],"security":[{"BearerAuth":[]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["simulationId","goals"],"properties":{"simulationId":{"type":"string","description":"ID of simulation to optimize"},"goals":{"$ref":"#/components/schemas/OptimizationGoals"},"testScenario":{"type":"object","properties":{"traffic_pattern":{"type":"string"},"duration_steps":{"type":"integer","default":100},"include_failures":{"type":"boolean"}}},"webhookUrl":{"type":"string","format":"uri","description":"Optional HTTPS URL to receive webhook notification when job completes","example":"https://your-app.com/webhooks/optimization"},"webhookSecret":{"type":"string","description":"Optional secret for HMAC-SHA256 webhook signature verification","example":"your-secret-key-here"}}}}}},"responses":{"202":{"description":"Optimization job accepted","content":{"application/json":{"schema":{"type":"object","properties":{"job":{"type":"object","properties":{"id":{"type":"string"},"status":{"type":"string"},"createdAt":{"type":"string"}}},"message":{"type":"string"}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"403":{"description":"Simulation not owned by this API key","content":{"application/json":{"schema":{"type":"object","required":["error","reason","remedy"],"properties":{"error":{"type":"string","example":"Simulation not owned by this API key"},"reason":{"type":"string","example":"The simulationId refers to a simulation that was not created with, or claimed by, the API key used in this request. Simulations created via the public browser workspace (without an Authorization header) have no owner and cannot be used here until claimed."},"remedy":{"type":"string","example":"Either create an owned simulation via POST /api/simulations with a write-scoped Bearer token, or claim an existing unowned simulation via POST /api/simulations/{simulationId}/claim before calling this endpoint."}}}}}},"404":{"$ref":"#/components/responses/NotFound"}}}},"/analysis/jobs/{id}":{"x-stability":"stable","get":{"tags":["Infrastructure Optimization"],"summary":"Get optimization job status","description":"Check the progress and status of an optimization job","operationId":"getOptimizationJob","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/analysis/jobs/job_abc123 \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"job_abc123\"\n\nresp = requests.get(\n    f\"{BASE_URL}/analysis/jobs/{JOB_ID}\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\njob = resp.json()\nprint(\"Status:\", job[\"status\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"job_abc123\";\n\nconst resp = await fetch(`${BASE_URL}/analysis/jobs/${JOB_ID}`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst job = await resp.json();\nconsole.log(\"Status:\", job.status);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"id","in":"path","required":true,"schema":{"type":"string"}}],"responses":{"200":{"description":"Job status","content":{"application/json":{"schema":{"$ref":"#/components/schemas/OptimizationJob"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"summary":"Cancel infrastructure optimization job","description":"Cancel a running infrastructure optimization job. This endpoint is idempotent - calling it multiple times\non the same job will return success without error.\n\n**Cancellation Rules:**\n- Jobs with status \"pending\" or \"running\" will be cancelled\n- Jobs already \"cancelled\" will return success (idempotent behavior)\n- Jobs with status \"completed\" or \"failed\" cannot be cancelled (returns 409)\n- Cancelled jobs will have status set to \"cancelled\" and a cancelledAt timestamp\n","operationId":"cancelOptimizationJob","tags":["Infrastructure Optimization"],"security":[{"BearerAuth":[]}],"parameters":[{"name":"id","in":"path","required":true,"schema":{"type":"string"},"description":"Job ID"}],"responses":{"200":{"description":"Job cancelled successfully or was already cancelled.\nReturns the same response whether cancelling for the first time or if already cancelled\n(idempotent operation).\n","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"status":{"type":"string","enum":["cancelled"]},"cancelledAt":{"type":"string","format":"date-time"},"message":{"type":"string","description":"Message indicating if job was just cancelled or already cancelled"}}},"examples":{"newlyCancelled":{"value":{"id":"job_abc123","status":"cancelled","cancelledAt":"2024-01-15T10:30:00Z","message":"Job cancelled successfully"}},"alreadyCancelled":{"value":{"id":"job_abc123","status":"cancelled","cancelledAt":"2024-01-15T09:45:00Z","message":"Job already cancelled"}}}}}},"404":{"description":"Job not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"409":{"description":"Cannot cancel job that is already completed or failed","content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string"},"status":{"type":"string"}}},"example":{"error":"Cannot cancel job that is already completed or failed","status":"completed"}}}},"500":{"description":"Failed to cancel job","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/analysis/jobs/{id}/recommendations":{"x-stability":"stable","get":{"tags":["Infrastructure Optimization"],"summary":"Get optimization recommendations","description":"Retrieve ranked recommendations from a completed optimization job","operationId":"getRecommendations","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/analysis/jobs/job_abc123/recommendations \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"job_abc123\"\n\nresp = requests.get(\n    f\"{BASE_URL}/analysis/jobs/{JOB_ID}/recommendations\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nfor rec in data[\"recommendations\"]:\n    print(rec[\"title\"], \"— savings:\", rec.get(\"estimatedSavings\"))\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"job_abc123\";\n\nconst resp = await fetch(`${BASE_URL}/analysis/jobs/${JOB_ID}/recommendations`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst { recommendations } = await resp.json();\nrecommendations.forEach(rec =>\n  console.log(rec.title, \"— savings:\", rec.estimatedSavings)\n);\n"}],"security":[{"BearerAuth":[]}],"parameters":[{"name":"id","in":"path","required":true,"schema":{"type":"string"}}],"responses":{"200":{"description":"Ranked recommendations","content":{"application/json":{"schema":{"type":"object","properties":{"recommendations":{"type":"array","items":{"$ref":"#/components/schemas/OptimizationRecommendation"}},"totalVariants":{"type":"integer"},"goals":{"$ref":"#/components/schemas/OptimizationGoals"}}}}}},"202":{"description":"Job still in progress — poll the job status endpoint until status is 'completed'","content":{"application/json":{"schema":{"type":"object","properties":{"status":{"type":"string","enum":["pending","running"],"description":"Current job status"},"recommendations":{"type":"null","description":"Always null while the job is in progress"}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/predictions/validate":{"x-stability":"stable","post":{"tags":["Predictive Scaling"],"summary":"Validate infrastructure against traffic forecast","description":"Tests whether current infrastructure can handle a predicted traffic pattern.\nReturns validation results with bottlenecks and recommendations.\n\nWhen a `webhookUrl` is provided, the following payload is POSTed to that URL\nwhen the job completes (example shown for the AWS EC2 validation request above):\n\n```json\n{\n  \"event\": \"prediction.completed\",\n  \"jobId\": \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\",\n  \"jobType\": \"prediction_validation\",\n  \"status\": \"completed\",\n  \"completedAt\": \"2025-11-23T10:35:00Z\",\n  \"data\": {\n    \"validationResult\": {\n      \"passed\": false,\n      \"summary\": \"Infrastructure will fail under peak load due to CPU saturation\",\n      \"peakMetrics\": {\n        \"timestamp\": 60,\n        \"traffic\": 12000,\n        \"cpuUsage\": 98,\n        \"latencyP95\": 820,\n        \"errorRate\": 8.4,\n        \"costPerHour\": 3.20\n      },\n      \"bottlenecksDetected\": [\n        \"CPU saturation at 98%\",\n        \"Error rate exceeds 5%\"\n      ],\n      \"failurePoints\": [\n        { \"timestamp\": 55, \"traffic\": 10500, \"reason\": \"CPU saturation\" }\n      ],\n      \"recommendations\": [\n        \"Scale out to 5 instances before peak\",\n        \"Increase CPU threshold to 75%\"\n      ]\n    }\n  }\n}\n```\n\nThe request is signed with HMAC-SHA256; verify the `X-Webhook-Signature` header\nagainst your `webhookSecret` before processing the payload.\n","operationId":"validateTrafficForecast","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/predictions/validate \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"simulationId\": \"sim-aws-ec2-prod\",\n    \"trafficForecast\": {\n      \"name\": \"Black Friday 2025\",\n      \"dataPoints\": [\n        {\"timestamp\": 0, \"rps\": 2000, \"label\": \"Baseline\"},\n        {\"timestamp\": 60, \"rps\": 12000, \"label\": \"Peak\"},\n        {\"timestamp\": 100, \"rps\": 2500, \"label\": \"Return to baseline\"}\n      ]\n    },\n    \"testSteps\": 100\n  }'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/predictions/validate\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\n        \"simulationId\": \"sim-aws-ec2-prod\",\n        \"trafficForecast\": {\n            \"name\": \"Black Friday 2025\",\n            \"dataPoints\": [\n                {\"timestamp\": 0, \"rps\": 2000, \"label\": \"Baseline\"},\n                {\"timestamp\": 60, \"rps\": 12000, \"label\": \"Peak\"},\n                {\"timestamp\": 100, \"rps\": 2500, \"label\": \"Return to baseline\"},\n            ],\n        },\n        \"testSteps\": 100,\n    },\n)\nresp.raise_for_status()\njob = resp.json()[\"job\"]\nprint(\"Job ID:\", job[\"id\"], \"Status:\", job[\"status\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/predictions/validate`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({\n    simulationId: \"sim-aws-ec2-prod\",\n    trafficForecast: {\n      name: \"Black Friday 2025\",\n      dataPoints: [\n        { timestamp: 0, rps: 2000, label: \"Baseline\" },\n        { timestamp: 60, rps: 12000, label: \"Peak\" },\n        { timestamp: 100, rps: 2500, label: \"Return to baseline\" },\n      ],\n    },\n    testSteps: 100,\n  }),\n});\nconst { job } = await resp.json();\nconsole.log(\"Job ID:\", job.id, \"Status:\", job.status);\n"}],"security":[{"BearerAuth":["write"]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["simulationId","trafficForecast"],"properties":{"simulationId":{"type":"string","description":"ID of simulation to test","example":"sim-abc123"},"trafficForecast":{"$ref":"#/components/schemas/TrafficForecast"},"testSteps":{"type":"integer","default":100,"description":"Number of simulation steps to run","example":100},"webhookUrl":{"type":"string","format":"uri","description":"Optional HTTPS URL to receive webhook notification when job completes","example":"https://your-app.com/webhooks/validation"},"webhookSecret":{"type":"string","description":"Optional secret for HMAC-SHA256 webhook signature verification","example":"your-secret-key-here"}}},"examples":{"awsValidation":{"summary":"AWS — validate EC2 m5.xlarge cluster against a Black Friday traffic spike","value":{"simulationId":"sim-aws-ec2-prod","trafficForecast":{"name":"Black Friday 2025 — AWS Production","description":"Predicted 6x traffic spike starting at step 30, peaking at step 60","dataPoints":[{"timestamp":0,"rps":2000,"label":"Baseline"},{"timestamp":30,"rps":6000,"label":"Pre-peak ramp"},{"timestamp":60,"rps":12000,"label":"Peak — Black Friday midnight"},{"timestamp":90,"rps":8000,"label":"Post-peak decline"},{"timestamp":100,"rps":2500,"label":"Return to baseline"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"aws-prediction-secret"}},"gcpValidation":{"summary":"GCP — validate Cloud Run service against a seasonal holiday burst","value":{"simulationId":"sim-gcp-cloudrun-prod","trafficForecast":{"name":"GCP Seasonal Spike — Holiday 2025","description":"Gradual ramp over 70 steps peaking at 4x baseline","dataPoints":[{"timestamp":0,"rps":3000,"label":"Baseline"},{"timestamp":40,"rps":8000,"label":"Holiday ramp"},{"timestamp":70,"rps":12000,"label":"Peak — Holiday noon"},{"timestamp":90,"rps":6000,"label":"Post-holiday wind-down"},{"timestamp":100,"rps":3200,"label":"Return to baseline"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"gcp-prediction-secret"}},"azureValidation":{"summary":"Azure — validate AKS cluster against a sudden product launch spike","value":{"simulationId":"sim-azure-aks-prod","trafficForecast":{"name":"Azure Product Launch Traffic","description":"Sudden 10x spike from launch announcement, sustained for 50 steps","dataPoints":[{"timestamp":0,"rps":1000,"label":"Pre-launch baseline"},{"timestamp":20,"rps":10000,"label":"Launch announcement — spike"},{"timestamp":50,"rps":8000,"label":"Sustained high traffic"},{"timestamp":80,"rps":3000,"label":"Gradual decline"},{"timestamp":100,"rps":1500,"label":"New elevated baseline"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"azure-prediction-secret"}},"ociValidation":{"summary":"OCI — validate Compute + Autonomous Database against a month-end batch processing surge","value":{"simulationId":"sim-oci-compute-prod","trafficForecast":{"name":"OCI Month-End Batch Surge","description":"Recurring month-end reporting job — 3x query load for steps 25 through 80","dataPoints":[{"timestamp":0,"rps":800,"label":"Normal operations"},{"timestamp":25,"rps":2400,"label":"Month-end batch start"},{"timestamp":60,"rps":2600,"label":"Peak batch load"},{"timestamp":80,"rps":1200,"label":"Batch wind-down"},{"timestamp":100,"rps":850,"label":"Return to normal"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"oci-prediction-secret"}},"digitalOceanValidation":{"summary":"DigitalOcean — validate Droplet cluster against an unpredictable viral content spike","value":{"simulationId":"sim-do-droplets-prod","trafficForecast":{"name":"DigitalOcean Viral Traffic Event","description":"Sudden 5x baseline spike within 15 steps from a viral post","dataPoints":[{"timestamp":0,"rps":500,"label":"Normal baseline"},{"timestamp":15,"rps":2500,"label":"Viral spike onset"},{"timestamp":40,"rps":3000,"label":"Peak viral traffic"},{"timestamp":70,"rps":1500,"label":"Declining viral effect"},{"timestamp":100,"rps":700,"label":"New elevated baseline"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"do-prediction-secret"}},"awsSpotValidation":{"summary":"AWS EC2 Spot — validate fault-tolerant batch cluster against a cost-optimized ramp forecast","value":{"simulationId":"sim-aws-spot-batch-prod","trafficForecast":{"name":"AWS Spot Batch Ramp — Cost-Optimized Workload","description":"Gradual ramp to 3x baseline reflecting a nightly batch pipeline; relaxed latency acceptable given Spot pricing savings","dataPoints":[{"timestamp":0,"rps":500,"label":"Idle baseline"},{"timestamp":20,"rps":1000,"label":"Batch job start"},{"timestamp":50,"rps":1500,"label":"Peak batch throughput"},{"timestamp":75,"rps":1000,"label":"Batch wind-down"},{"timestamp":100,"rps":500,"label":"Return to idle"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"aws-spot-prediction-secret"}},"gcpSpotValidation":{"summary":"GCP Spot VM — validate fault-tolerant batch cluster against a cost-optimized ramp forecast","value":{"simulationId":"sim-gcp-spot-batch-prod","trafficForecast":{"name":"GCP Spot Batch Ramp — Cost-Optimized Workload","description":"Gradual ramp to 3x baseline reflecting a nightly batch pipeline on GCP Spot VMs; relaxed latency acceptable given ~70% preemptible pricing savings","dataPoints":[{"timestamp":0,"rps":400,"label":"Idle baseline"},{"timestamp":20,"rps":800,"label":"Batch job start"},{"timestamp":50,"rps":1200,"label":"Peak batch throughput"},{"timestamp":75,"rps":800,"label":"Batch wind-down"},{"timestamp":100,"rps":400,"label":"Return to idle"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"gcp-spot-prediction-secret"}},"azureSpotValidation":{"summary":"Azure Spot VM — validate fault-tolerant batch cluster against a cost-optimized ramp forecast","value":{"simulationId":"sim-azure-spot-batch-prod","trafficForecast":{"name":"Azure Spot Batch Ramp — Cost-Optimized Workload","description":"Gradual ramp to 3x baseline reflecting a nightly batch pipeline on Azure Spot VMs; relaxed latency acceptable given ~70% pay-as-you-go pricing savings","dataPoints":[{"timestamp":0,"rps":300,"label":"Idle baseline"},{"timestamp":20,"rps":600,"label":"Batch job start"},{"timestamp":50,"rps":900,"label":"Peak batch throughput"},{"timestamp":75,"rps":600,"label":"Batch wind-down"},{"timestamp":100,"rps":300,"label":"Return to idle"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"azure-spot-prediction-secret"}}}}}},"responses":{"202":{"description":"Validation job accepted","content":{"application/json":{"schema":{"type":"object","properties":{"job":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"status":{"type":"string","enum":["pending","running"]},"type":{"type":"string","example":"validation"},"createdAt":{"type":"string","format":"date-time"}}},"message":{"type":"string","example":"Validation job started. Poll /predictions/jobs/{id} for status."}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/predictions/optimize-thresholds":{"x-stability":"stable","post":{"tags":["Predictive Scaling"],"summary":"Find optimal autoscaling thresholds for traffic forecast","description":"Tests multiple threshold combinations to find the best autoscaling configuration\nfor the predicted traffic pattern.\n\nWhen a `webhookUrl` is provided, the following payload is POSTed to that URL\nwhen the job completes (example shown for the AWS EC2 Auto Scaling request above):\n\n```json\n{\n  \"event\": \"prediction.completed\",\n  \"jobId\": \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\",\n  \"jobType\": \"prediction_optimization\",\n  \"status\": \"completed\",\n  \"completedAt\": \"2025-11-23T10:35:00Z\",\n  \"data\": {\n    \"bestThresholds\": {\n      \"scaleOutCpuThreshold\": 70,\n      \"scaleInCpuThreshold\": 30,\n      \"scaleOutThroughputThreshold\": 75,\n      \"scaleInThroughputThreshold\": 35,\n      \"scaleOutLatencyThreshold\": 120,\n      \"cooldownSeconds\": 180,\n      \"minInstances\": 3,\n      \"maxInstances\": 15\n    }\n  }\n}\n```\n\nThe request is signed with HMAC-SHA256; verify the `X-Webhook-Signature` header\nagainst your `webhookSecret` before processing the payload.\n","operationId":"optimizeThresholds","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/predictions/optimize-thresholds \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"simulationId\": \"sim-aws-ec2-prod\",\n    \"trafficForecast\": {\n      \"name\": \"Black Friday 2025\",\n      \"dataPoints\": [\n        {\"timestamp\": 0, \"rps\": 2000, \"label\": \"Baseline\"},\n        {\"timestamp\": 60, \"rps\": 12000, \"label\": \"Peak\"},\n        {\"timestamp\": 100, \"rps\": 2500, \"label\": \"Return to baseline\"}\n      ]\n    },\n    \"testSteps\": 100\n  }'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/predictions/optimize-thresholds\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\n        \"simulationId\": \"sim-aws-ec2-prod\",\n        \"trafficForecast\": {\n            \"name\": \"Black Friday 2025\",\n            \"dataPoints\": [\n                {\"timestamp\": 0, \"rps\": 2000, \"label\": \"Baseline\"},\n                {\"timestamp\": 60, \"rps\": 12000, \"label\": \"Peak\"},\n                {\"timestamp\": 100, \"rps\": 2500, \"label\": \"Return to baseline\"},\n            ],\n        },\n        \"testSteps\": 100,\n    },\n)\nresp.raise_for_status()\njob = resp.json()[\"job\"]\nprint(\"Job ID:\", job[\"id\"], \"Status:\", job[\"status\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/predictions/optimize-thresholds`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({\n    simulationId: \"sim-aws-ec2-prod\",\n    trafficForecast: {\n      name: \"Black Friday 2025\",\n      dataPoints: [\n        { timestamp: 0, rps: 2000, label: \"Baseline\" },\n        { timestamp: 60, rps: 12000, label: \"Peak\" },\n        { timestamp: 100, rps: 2500, label: \"Return to baseline\" },\n      ],\n    },\n    testSteps: 100,\n  }),\n});\nconst { job } = await resp.json();\nconsole.log(\"Job ID:\", job.id, \"Status:\", job.status);\n"}],"security":[{"BearerAuth":["write"]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["simulationId","trafficForecast"],"properties":{"simulationId":{"type":"string","description":"ID of simulation to optimize","example":"sim-abc123"},"trafficForecast":{"$ref":"#/components/schemas/TrafficForecast"},"testSteps":{"type":"integer","default":100,"description":"Number of simulation steps to run per test","example":100},"webhookUrl":{"type":"string","format":"uri","description":"Optional HTTPS URL to receive webhook notification when job completes","example":"https://your-app.com/webhooks/threshold-optimization"},"webhookSecret":{"type":"string","description":"Optional secret for HMAC-SHA256 webhook signature verification","example":"your-secret-key-here"}}},"examples":{"awsOptimization":{"summary":"AWS — find optimal EC2 Auto Scaling thresholds for Black Friday traffic","value":{"simulationId":"sim-aws-ec2-prod","trafficForecast":{"name":"Black Friday 2025 — AWS Production","description":"Predicted 6x traffic spike starting at step 30, peaking at step 60","dataPoints":[{"timestamp":0,"rps":2000,"label":"Baseline"},{"timestamp":30,"rps":6000,"label":"Pre-peak ramp"},{"timestamp":60,"rps":12000,"label":"Peak — Black Friday midnight"},{"timestamp":90,"rps":8000,"label":"Post-peak decline"},{"timestamp":100,"rps":2500,"label":"Return to baseline"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"aws-optimization-secret"}},"gcpOptimization":{"summary":"GCP — find optimal Cloud Run thresholds for a seasonal holiday burst","value":{"simulationId":"sim-gcp-cloudrun-prod","trafficForecast":{"name":"GCP Seasonal Spike — Holiday 2025","description":"Gradual ramp over 70 steps peaking at 4x baseline","dataPoints":[{"timestamp":0,"rps":3000,"label":"Baseline"},{"timestamp":40,"rps":8000,"label":"Holiday ramp"},{"timestamp":70,"rps":12000,"label":"Peak — Holiday noon"},{"timestamp":90,"rps":6000,"label":"Post-holiday wind-down"},{"timestamp":100,"rps":3200,"label":"Return to baseline"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"gcp-optimization-secret"}},"azureOptimization":{"summary":"Azure — find optimal AKS HPA thresholds for a product launch spike","value":{"simulationId":"sim-azure-aks-prod","trafficForecast":{"name":"Azure Product Launch Traffic","description":"Sudden 10x spike from launch announcement, sustained for 50 steps","dataPoints":[{"timestamp":0,"rps":1000,"label":"Pre-launch baseline"},{"timestamp":20,"rps":10000,"label":"Launch announcement — spike"},{"timestamp":50,"rps":8000,"label":"Sustained high traffic"},{"timestamp":80,"rps":3000,"label":"Gradual decline"},{"timestamp":100,"rps":1500,"label":"New elevated baseline"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"azure-optimization-secret"}},"ociOptimization":{"summary":"OCI — find optimal autoscaling thresholds for a month-end batch processing surge","value":{"simulationId":"sim-oci-compute-prod","trafficForecast":{"name":"OCI Month-End Batch Surge","description":"Recurring month-end reporting job — 3x query load for steps 25 through 80","dataPoints":[{"timestamp":0,"rps":800,"label":"Normal operations"},{"timestamp":25,"rps":2400,"label":"Month-end batch start"},{"timestamp":60,"rps":2600,"label":"Peak batch load"},{"timestamp":80,"rps":1200,"label":"Batch wind-down"},{"timestamp":100,"rps":850,"label":"Return to normal"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"oci-optimization-secret"}},"digitalOceanOptimization":{"summary":"DigitalOcean — find optimal Droplet autoscaling thresholds for a viral content spike","value":{"simulationId":"sim-do-droplets-prod","trafficForecast":{"name":"DigitalOcean Viral Traffic Event","description":"Sudden 5x baseline spike within 15 steps from a viral post","dataPoints":[{"timestamp":0,"rps":500,"label":"Normal baseline"},{"timestamp":15,"rps":2500,"label":"Viral spike onset"},{"timestamp":40,"rps":3000,"label":"Peak viral traffic"},{"timestamp":70,"rps":1500,"label":"Declining viral effect"},{"timestamp":100,"rps":700,"label":"New elevated baseline"}]},"testSteps":100,"webhookUrl":"https://your-app.example.com/webhooks/predictions","webhookSecret":"do-optimization-secret"}}}}}},"responses":{"202":{"description":"Optimization job accepted","content":{"application/json":{"schema":{"type":"object","properties":{"job":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"status":{"type":"string","enum":["pending","running"]},"type":{"type":"string","example":"threshold_optimization"},"createdAt":{"type":"string","format":"date-time"}}},"message":{"type":"string","example":"Threshold optimization job started. Poll /predictions/jobs/{id} for status."}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/predictions/jobs/{jobId}":{"x-stability":"stable","get":{"tags":["Predictive Scaling"],"summary":"Get prediction job status","description":"Check the status and progress of a prediction job.\n\nUse this endpoint when your agent cannot hold a long-lived SSE connection\n(e.g. serverless functions, short-lived scripts, environments that block\nstreaming). Poll until the job reaches a terminal state, then fetch results\nfrom `GET /api/predictions/jobs/{jobId}/results`.\n\n**Terminal states:** `completed`, `failed`, `cancelled`\n\n**Client Code Sample (Python — polling with exponential back-off):**\n\nInstall with `pip install requests`.\n\n```python\nimport os\nimport time\nimport requests\n\n\ndef poll_prediction_job(job_id: str, api_token: str) -> dict:\n    \"\"\"Poll GET /predictions/jobs/{jobId} until a terminal state is reached.\"\"\"\n    url = f\"https://your-host/api/predictions/jobs/{job_id}\"\n    headers = {\"Authorization\": f\"Bearer {api_token}\"}\n\n    delay = 2       # initial poll interval in seconds\n    max_delay = 30  # cap backoff at 30 seconds\n\n    while True:\n        response = requests.get(url, headers=headers, timeout=30)\n        response.raise_for_status()\n        job = response.json()\n\n        status = job.get(\"status\", \"unknown\")\n        job_type = job.get(\"type\", \"\")\n        print(f\"[{job_type}] status={status}\")\n\n        if status == \"completed\":\n            print(\"Job completed successfully.\")\n            return job\n\n        elif status == \"failed\":\n            error = job.get(\"error\", \"unknown error\")\n            raise RuntimeError(f\"Prediction job failed: {error}\")\n\n        elif status == \"cancelled\":\n            print(\"Job was cancelled.\")\n            return job\n\n        time.sleep(delay)\n        delay = min(delay * 2, max_delay)\n\n\nif __name__ == \"__main__\":\n    result = poll_prediction_job(\n        job_id=\"pred_aws_val001\",\n        api_token=os.environ[\"API_KEY\"],\n    )\n    print(\"Final job record:\", result)\n```\n","operationId":"getPredictionJob","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/predictions/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890 \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\"\n\nresp = requests.get(\n    f\"{BASE_URL}/predictions/jobs/{JOB_ID}\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\njob = resp.json()\nprint(\"Type:\", job[\"type\"], \"Status:\", job[\"status\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\";\n\nconst resp = await fetch(`${BASE_URL}/predictions/jobs/${JOB_ID}`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst job = await resp.json();\nconsole.log(\"Type:\", job.type, \"Status:\", job.status);\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the prediction job"}],"responses":{"200":{"description":"Job status","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"type":{"type":"string","enum":["validation","threshold_optimization"]},"status":{"type":"string","enum":["pending","running","completed","failed","cancelled"]},"createdAt":{"type":"string","format":"date-time"},"completedAt":{"type":"string","format":"date-time"},"error":{"type":"string"}}},"examples":{"awsValidationFailed":{"summary":"AWS — validation job failed; simulationId not found","value":{"id":"pred_aws_val_fail001","type":"validation","status":"failed","createdAt":"2024-01-17T08:14:32Z","completedAt":"2024-01-17T08:14:35Z","error":"Simulation 'sim_aws_7c4e1b2d-9f83-4a11-bc45-aef987654321' not found. The simulation may have been deleted before the validation job was processed. Re-create the simulation and resubmit the validation job."}},"gcpValidationFailed":{"summary":"GCP — validation job failed; traffic forecast malformed","value":{"id":"pred_gcp_val_fail002","type":"validation","status":"failed","createdAt":"2024-01-18T11:22:10Z","completedAt":"2024-01-18T11:22:13Z","error":"Traffic forecast 'gcp-holiday-burst-v3' is malformed: timestamps are not strictly increasing (step 45 appears before step 38). Validation requires a monotonically increasing timestamp sequence. Correct the forecast data and resubmit."}},"azureValidationFailed":{"summary":"Azure — validation job failed; validation engine internal error","value":{"id":"pred_az_val_fail003","type":"validation","status":"failed","createdAt":"2024-01-19T14:05:47Z","completedAt":"2024-01-19T14:08:31Z","error":"Validation engine encountered an internal error while replaying simulation 'sim_azure_aks_prod_westeurope' against the 'product-launch-forecast' traffic pattern: capacity model returned a negative throughput value at step 22 (traffic=10000 RPS, instances=3). This indicates an inconsistent resource configuration. Verify that all AKS node sizes and replica counts are set to positive non-zero values, then resubmit."}},"ociValidationFailed":{"summary":"OCI — validation job failed; simulation contains no resources","value":{"id":"pred_oci_val_fail004","type":"validation","status":"failed","createdAt":"2024-01-20T09:30:15Z","completedAt":"2024-01-20T09:30:16Z","error":"Simulation 'sim_oci_3b9d72f1' contains no resources and cannot be validated. Add at least one compute resource (e.g. VM.Standard.E4.Flex instance pool) and one database resource (e.g. Autonomous Database ATP) before submitting a validation job."}},"digitalOceanValidationFailed":{"summary":"DigitalOcean — validation job failed; traffic forecast has insufficient data points","value":{"id":"pred_do_val_fail005","type":"validation","status":"failed","createdAt":"2024-01-21T16:44:02Z","completedAt":"2024-01-21T16:44:05Z","error":"Traffic forecast 'do-viral-traffic-short' contains only 2 data points spanning 15 simulation steps. Validation requires a forecast with at least 5 data points covering a minimum of 60 steps to accurately replay a ramp-and-drain traffic cycle. Extend the forecast to cover the full event window and resubmit."}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"summary":"Cancel prediction job","description":"Cancel a running prediction job. This endpoint is idempotent - calling it multiple times\non the same job will return success without error.\n\n**Cancellation Rules:**\n- Jobs with status \"pending\" or \"running\" will be cancelled\n- Jobs already \"cancelled\" will return success (idempotent behavior)\n- Jobs with status \"completed\" or \"failed\" cannot be cancelled (returns 409)\n- Cancelled jobs will have status set to \"cancelled\" and a cancelledAt timestamp\n","operationId":"cancelPredictionJob","tags":["Predictive Scaling"],"security":[{"BearerAuth":[]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string"},"description":"Job ID"}],"responses":{"200":{"description":"Job cancelled successfully or was already cancelled.\nReturns the same response whether cancelling for the first time or if already cancelled\n(idempotent operation).\n","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"status":{"type":"string","enum":["cancelled"]},"cancelledAt":{"type":"string","format":"date-time"},"message":{"type":"string","description":"Message indicating if job was just cancelled or already cancelled"}}},"examples":{"newlyCancelled":{"value":{"id":"job_abc123","status":"cancelled","cancelledAt":"2024-01-15T10:30:00Z","message":"Job cancelled successfully"}},"alreadyCancelled":{"value":{"id":"job_abc123","status":"cancelled","cancelledAt":"2024-01-15T09:45:00Z","message":"Job already cancelled"}}}}}},"404":{"description":"Job not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"409":{"description":"Cannot cancel job that is already completed or failed","content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string"},"status":{"type":"string"}}},"example":{"error":"Cannot cancel job that is already completed or failed","status":"completed"}}}},"500":{"description":"Failed to cancel job","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/predictions/jobs/{jobId}/results":{"x-stability":"stable","get":{"tags":["Predictive Scaling"],"summary":"Get prediction job results","description":"Retrieve results from a completed prediction job","operationId":"getPredictionResults","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/predictions/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890/results \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\"\n\nresp = requests.get(\n    f\"{BASE_URL}/predictions/jobs/{JOB_ID}/results\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nif \"validationResult\" in data:\n    print(\"Passed:\", data[\"validationResult\"][\"passed\"])\nif \"bestThresholds\" in data:\n    print(\"Best CPU threshold:\", data[\"bestThresholds\"][\"scaleOutCpuThreshold\"])\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\";\n\nconst resp = await fetch(`${BASE_URL}/predictions/jobs/${JOB_ID}/results`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst data = await resp.json();\nif (data.validationResult) {\n  console.log(\"Passed:\", data.validationResult.passed);\n}\nif (data.bestThresholds) {\n  console.log(\"Best CPU threshold:\", data.bestThresholds.scaleOutCpuThreshold);\n}\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the prediction job"}],"responses":{"200":{"description":"Job results","content":{"application/json":{"schema":{"type":"object","properties":{"validationResult":{"$ref":"#/components/schemas/ValidationResult"},"thresholdTests":{"type":"array","items":{"$ref":"#/components/schemas/ThresholdTestResult"}},"bestThresholds":{"$ref":"#/components/schemas/AutoscalingConfig"},"recommendations":{"type":"array","items":{"$ref":"#/components/schemas/PredictionRecommendation"}},"trafficForecast":{"$ref":"#/components/schemas/TrafficForecast"},"bottlenecks":{"type":"array","description":"Performance bottlenecks identified during the prediction run. For simulations that include an `aurora-serverless` resource, entries will note the **2–4 step ACU ramp delay** that occurs after a load increase — during those steps the database operates at reduced capacity, causing upstream latency to exceed steady-state projections. Use this field to set appropriate scale-out lead time in autoscaling policies.","items":{"type":"string"},"example":["aurora-serverless ACU ramp delay: up to 4 steps after load increase before full capacity is available","Upstream API latency elevated during ACU scale-out window"]},"costProjection":{"type":"object","description":"Incremental cost projection for the traffic forecast window. For simulations that include an `aurora-serverless` resource, `costProjection` reflects the **3–6 step incremental ACU cost delta** as Aurora Capacity Units increase one tier at a time toward the new steady state — meaning projected cost in early ramp steps is lower than the provisioned-equivalent level and rises incrementally until ACU scaling completes. Account for this lag when comparing `aurora-serverless` cost estimates against fixed-instance baselines.","properties":{"totalUsd":{"type":"number","description":"Total projected cost in USD for the forecast window","example":14.72},"peakCostPerHour":{"type":"number","description":"Peak cost per hour at maximum load","example":2.4},"averageCostPerHour":{"type":"number","description":"Average cost per hour across the forecast window (reflects blended ACU cost during ramp and steady state)","example":1.85},"costByStep":{"type":"array","description":"Per-step cost breakdown, useful for observing the incremental ACU cost delta on `aurora-serverless` resources","items":{"type":"object","properties":{"step":{"type":"integer","description":"Simulation step index"},"costPerHour":{"type":"number","description":"Cost per hour at this step"}}}}}},"error":{"type":"string","description":"Present only when the job status is 'failed'. Describes why threshold optimization could not complete (e.g. simulation not found, traffic forecast too short, no valid threshold combination found).","example":"No valid threshold combination found — all candidate configurations exceeded the SLA error-rate limit."},"suggestions":{"type":"array","items":{"type":"string"},"description":"Present only when the job status is 'failed'. Short, actionable steps the caller can take to resolve the failure and resubmit the optimization job.","example":["Increase maxInstances in the simulation's autoscaling config to give the optimizer more headroom","Upgrade to a larger instance size so individual instances handle more load before triggering scale-out"]}}},"examples":{"awsResults":{"summary":"AWS — Black Friday validation passed; EC2 Auto Scaling thresholds tuned","value":{"validationResult":{"passed":true,"summary":"AWS EC2 m5.xlarge Auto Scaling group handles the 12,000 RPS Black Friday peak with CPU at 81% — within the 85% threshold","peakMetrics":{"timestamp":60,"traffic":12000,"cpuUsage":81,"latencyP95":44,"errorRate":0.4,"costPerHour":18.2},"bottlenecksDetected":["CPU approaches threshold at peak (81%) — scale-out buffer is thin","RDS connection pool at 87% utilization during peak"],"failurePoints":[],"recommendations":["Lower scale-out CPU threshold from 80% to 70% to trigger earlier scale-out","Enable RDS Proxy to reduce connection pool pressure","Pre-warm 2 extra instances 30 minutes before projected peak"]},"bestThresholds":{"scaleOutCpuThreshold":70,"scaleInCpuThreshold":30,"scaleOutThroughputThreshold":75,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":120,"cooldownSeconds":180,"minInstances":3,"maxInstances":15},"recommendations":[{"rank":1,"title":"Lower CPU scale-out threshold to 70%","description":"Triggering scale-out at 70% CPU instead of 80% gives 60–90 seconds of lead time before saturation under the Black Friday ramp","priority":"high","action":"Set scaleOutCpuThreshold to 70 in the EC2 Auto Scaling policy","expectedImpact":"Reduces peak CPU from 81% to ~68%, drops error rate from 0.4% to <0.1%"},{"rank":2,"title":"Enable RDS Proxy for connection pooling","description":"RDS Proxy absorbs connection spikes and reduces db.r5.large connection saturation during the Black Friday peak","priority":"medium","action":"Attach an RDS Proxy endpoint to the db.r5.large Multi-AZ instance","expectedImpact":"Reduces RDS connection pool utilization from 87% to ~55%"}],"trafficForecast":{"name":"Black Friday 2025 — AWS Production","dataPoints":[{"timestamp":0,"rps":2000,"label":"Baseline"},{"timestamp":60,"rps":12000,"label":"Peak"},{"timestamp":100,"rps":2500,"label":"Return to baseline"}]}}},"gcpResults":{"summary":"GCP — holiday validation failed; Cloud Run concurrency limit too low","value":{"validationResult":{"passed":false,"summary":"Cloud Run service saturates at 10,000 RPS due to per-instance concurrency limit of 80 — error rate spikes to 12% during the holiday peak","peakMetrics":{"timestamp":70,"traffic":12000,"cpuUsage":94,"latencyP95":310,"errorRate":12.1,"costPerHour":9.4},"bottlenecksDetected":["Cloud Run concurrency limit (80) exceeded — request queuing drives latency above 300 ms","CPU saturation at 94% on active instances — scale-out lagging by ~45 seconds"],"failurePoints":[{"timestamp":68,"traffic":10200,"reason":"Cloud Run per-instance concurrency limit reached; new instances not yet warm"},{"timestamp":70,"traffic":12000,"reason":"Error rate exceeds 5% SLA threshold"}],"recommendations":["Increase Cloud Run max-instances from 10 to 25","Set min-instances to 3 to eliminate cold start latency during ramp","Raise per-instance concurrency limit from 80 to 200 to reduce instance count needed at peak"]},"bestThresholds":{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"scaleOutThroughputThreshold":65,"scaleInThroughputThreshold":30,"scaleOutLatencyThreshold":150,"cooldownSeconds":60,"minInstances":3,"maxInstances":25},"recommendations":[{"rank":1,"title":"Increase Cloud Run max-instances to 25","description":"The current limit of 10 instances is the root cause of saturation at peak — raising it to 25 fully absorbs the 12,000 RPS holiday load","priority":"critical","action":"Set --max-instances=25 on the Cloud Run service revision","expectedImpact":"Eliminates error spike; latency p95 drops from 310 ms to ~55 ms at peak"},{"rank":2,"title":"Set min-instances to 3 to eliminate cold starts","description":"Keeping 3 warm instances prevents the 45-second scale-out lag that causes early errors during the traffic ramp","priority":"high","action":"Set --min-instances=3 on the Cloud Run service revision","expectedImpact":"Eliminates cold-start latency; scale-out lag drops from 45 s to <5 s"}],"trafficForecast":{"name":"GCP Seasonal Spike — Holiday 2025","dataPoints":[{"timestamp":0,"rps":3000,"label":"Baseline"},{"timestamp":70,"rps":12000,"label":"Peak"},{"timestamp":100,"rps":3200,"label":"Return to baseline"}]}}},"azureResults":{"summary":"Azure — product launch validation passed; AKS HPA thresholds optimized","value":{"validationResult":{"passed":true,"summary":"AKS cluster (Standard_D4s_v3, 3–10 nodes) handles the 10,000 RPS launch spike with CPU at 76% and p95 latency at 62 ms — within the 100 ms SLA","peakMetrics":{"timestamp":20,"traffic":10000,"cpuUsage":76,"latencyP95":62,"errorRate":0.2,"costPerHour":14.8},"bottlenecksDetected":["HPA scale-out takes 90 seconds — brief latency spike to 110 ms at step 21"],"failurePoints":[],"recommendations":["Lower HPA CPU target utilization from 75% to 60% to scale out earlier","Pre-scale to 5 nodes before launch announcement using a scheduled KEDA trigger"]},"bestThresholds":{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"scaleOutThroughputThreshold":70,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":80,"cooldownSeconds":120,"minInstances":3,"maxInstances":12},"recommendations":[{"rank":1,"title":"Lower HPA CPU target to 60%","description":"Scaling out at 60% CPU instead of 75% gives the AKS node pool an extra 45 seconds of lead time for the product launch spike","priority":"high","action":"Update HorizontalPodAutoscaler targetCPUUtilizationPercentage to 60","expectedImpact":"Eliminates the 110 ms latency spike at launch; steady-state p95 drops from 62 ms to 48 ms"},{"rank":2,"title":"Pre-scale to 5 nodes with a scheduled KEDA trigger","description":"A time-based KEDA ScaledObject set to fire 10 minutes before launch eliminates the HPA reaction delay entirely","priority":"medium","action":"Create a KEDA CronScaler targeting minReplicaCount=5 at launch time","expectedImpact":"Removes 90-second HPA lag; keeps latency p95 below 60 ms throughout the spike"}],"trafficForecast":{"name":"Azure Product Launch Traffic","dataPoints":[{"timestamp":0,"rps":1000,"label":"Pre-launch baseline"},{"timestamp":20,"rps":10000,"label":"Launch spike"},{"timestamp":100,"rps":1500,"label":"New elevated baseline"}]}}},"ociResults":{"summary":"OCI — month-end batch validation passed; Autonomous Database concurrency tuned","value":{"validationResult":{"passed":true,"summary":"OCI VM.Standard.E4.Flex instances and Autonomous Database (ATP) handle the 2,600 RPS month-end batch surge with CPU at 72% and DB CPU at 68%","peakMetrics":{"timestamp":60,"traffic":2600,"cpuUsage":72,"latencyP95":88,"errorRate":0.1,"costPerHour":6.2},"bottlenecksDetected":["ATP OCPU utilization at 68% during peak — headroom is adequate but narrow","Connection pool pressure at 74% due to long-running reporting queries"],"failurePoints":[],"recommendations":["Increase ATP OCPU count from 4 to 6 to provide 30% more headroom during batch","Schedule reporting queries with lower parallelism to reduce connection pool contention"]},"bestThresholds":{"scaleOutCpuThreshold":65,"scaleInCpuThreshold":30,"scaleOutThroughputThreshold":70,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":200,"cooldownSeconds":300,"minInstances":2,"maxInstances":8},"recommendations":[{"rank":1,"title":"Scale ATP from 4 to 6 OCPU before month-end batch","description":"Increasing ATP OCPU count to 6 drops DB CPU utilization from 68% to ~46%, providing safe headroom for unexpected query spikes","priority":"medium","action":"Resize Autonomous Database to 6 OCPU using OCI Console or CLI before the 25th of each month","expectedImpact":"DB CPU drops to ~46%; connection pool pressure drops from 74% to ~52%"},{"rank":2,"title":"Throttle reporting query parallelism during batch window","description":"Limiting parallel query degree on month-end reports reduces connection pool contention without requiring a capacity increase","priority":"low","action":"Set PARALLEL_DEGREE_POLICY=MANUAL and MAX_PARALLEL_DEGREE=4 in the ATP session profile for the reporting user","expectedImpact":"Reduces connection pool utilization from 74% to ~55% during the batch window"}],"trafficForecast":{"name":"OCI Month-End Batch Surge","dataPoints":[{"timestamp":0,"rps":800,"label":"Normal operations"},{"timestamp":60,"rps":2600,"label":"Peak batch load"},{"timestamp":100,"rps":850,"label":"Return to normal"}]}}},"digitalOceanResults":{"summary":"DigitalOcean — viral traffic validation failed; Droplet pool too small to absorb sudden spike","value":{"validationResult":{"passed":false,"summary":"DigitalOcean Droplet cluster (3x s-4vcpu-8gb) saturates at 2,000 RPS — the viral spike reaches 3,000 RPS and drives CPU to 98% with error rate at 18%","peakMetrics":{"timestamp":40,"traffic":3000,"cpuUsage":98,"latencyP95":480,"errorRate":18.3,"costPerHour":1.8},"bottlenecksDetected":["CPU saturation at 98% — all 3 Droplets at capacity","DigitalOcean Load Balancer connection queue full — new connections timing out","Managed Database max_connections limit reached (100/100)"],"failurePoints":[{"timestamp":22,"traffic":2100,"reason":"CPU exceeds 90% on all active Droplets; error rate crosses 5% SLA"},{"timestamp":40,"traffic":3000,"reason":"Full saturation — CPU 98%, error rate 18%, p95 latency 480 ms"}],"recommendations":["Increase Droplet pool minimum from 3 to 6 to handle baseline + spike headroom","Configure DigitalOcean Load Balancer with connection draining (30 s) to reduce timeout errors during scale-out","Upgrade Managed Database to the db-4vcpu-8gb plan to raise max_connections to 200"]},"bestThresholds":{"scaleOutCpuThreshold":55,"scaleInCpuThreshold":20,"scaleOutThroughputThreshold":60,"scaleInThroughputThreshold":25,"scaleOutLatencyThreshold":100,"cooldownSeconds":90,"minInstances":6,"maxInstances":20},"recommendations":[{"rank":1,"title":"Increase minimum Droplet pool size from 3 to 6","description":"The viral spike rises from 500 to 3,000 RPS within 40 steps — 3 Droplets cannot absorb the surge even with fast scale-out. Starting from 6 instances provides the necessary baseline capacity","priority":"critical","action":"Update DigitalOcean App Platform or Droplet autoscaling min_size to 6","expectedImpact":"CPU peak drops from 98% to ~52%; error rate drops from 18% to <1%"},{"rank":2,"title":"Upgrade Managed Database to db-4vcpu-8gb","description":"The current db-2vcpu-4gb plan caps max_connections at 100, which is exhausted during the viral spike. The next tier doubles the connection limit","priority":"high","action":"Resize DigitalOcean Managed Database cluster to db-4vcpu-8gb (raises max_connections to 200)","expectedImpact":"Eliminates database connection exhaustion; reduces connection-related errors from 18% to <0.5%"}],"trafficForecast":{"name":"DigitalOcean Viral Traffic Event","dataPoints":[{"timestamp":0,"rps":500,"label":"Normal baseline"},{"timestamp":40,"rps":3000,"label":"Peak viral traffic"},{"timestamp":100,"rps":700,"label":"New elevated baseline"}]}}},"auroraServerlessResults":{"summary":"AWS — Aurora Serverless v2 ACU ramp; bottlenecks and cost projection populated","value":{"validationResult":{"passed":true,"summary":"AWS Aurora Serverless v2 (0.5–16 ACU) and EC2 m5.large Auto Scaling group handle the 5,000 RPS flash-sale peak with CPU at 74% — within the 85% threshold. ACU ramp delay causes elevated upstream latency for 3 steps post-load-increase.","peakMetrics":{"timestamp":45,"traffic":5000,"cpuUsage":74,"latencyP95":118,"errorRate":0.6,"costPerHour":6.85},"bottlenecksDetected":["Aurora Serverless v2 ACU ramp delay: 3 steps of elevated latency (p95 ~118 ms) after load increase before ACUs scale from 2 to 8","Upstream API latency elevated during ACU scale-out window (steps 43–46)","EC2 connection pool at 79% utilization during ACU ramp window"],"failurePoints":[],"recommendations":["Set Aurora Serverless v2 min ACU to 4 to reduce the ramp window from 3 steps to 1 step","Add a 120-second pre-warm window before flash-sale start using a scheduled Lambda to issue warm-up queries","Lower EC2 Auto Scaling scale-out CPU threshold from 80% to 65% to maintain connection headroom during ACU ramp"]},"bestThresholds":{"scaleOutCpuThreshold":65,"scaleInCpuThreshold":28,"scaleOutThroughputThreshold":70,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":100,"cooldownSeconds":150,"minInstances":2,"maxInstances":12},"recommendations":[{"rank":1,"title":"Raise Aurora Serverless v2 minimum ACU from 0.5 to 4","description":"Starting from 0.5 ACU requires Aurora to traverse multiple ACU tiers during a load spike, causing a 3-step ramp delay. Setting min ACU to 4 keeps the database warm enough to absorb flash-sale traffic in a single scaling step.","priority":"high","action":"Update Aurora Serverless v2 cluster min_capacity to 4 ACU in the RDS cluster modification API or Console","expectedImpact":"ACU ramp window shrinks from 3 steps to ≤1 step; p95 latency at peak drops from 118 ms to ~62 ms"},{"rank":2,"title":"Pre-warm Aurora with scheduled Lambda warm-up queries","description":"Issuing lightweight SELECT queries 2 minutes before flash-sale start forces Aurora to pre-allocate ACUs before the actual traffic spike arrives, eliminating the ramp window entirely","priority":"medium","action":"Schedule an EventBridge rule 2 minutes before flash-sale start to invoke a Lambda that runs 10 parallel SELECT 1 queries against the Aurora endpoint","expectedImpact":"Eliminates ACU ramp delay; p95 latency stays below 65 ms throughout the flash-sale window"}],"trafficForecast":{"name":"Flash Sale — Aurora Serverless v2 Ramp Test","dataPoints":[{"timestamp":0,"rps":800,"label":"Pre-sale baseline"},{"timestamp":43,"rps":5000,"label":"Flash-sale spike"},{"timestamp":100,"rps":1200,"label":"Post-sale elevated baseline"}]},"bottlenecks":["aurora-serverless ACU ramp delay: up to 4 steps after load increase before full capacity is available","Upstream API latency elevated during ACU scale-out window"],"costProjection":{"totalUsd":14.72,"peakCostPerHour":2.4,"averageCostPerHour":1.85,"costByStep":[{"step":0,"costPerHour":1.1},{"step":43,"costPerHour":1.4},{"step":44,"costPerHour":1.75},{"step":45,"costPerHour":2.4},{"step":46,"costPerHour":2.38},{"step":100,"costPerHour":1.55}]}}},"awsValidationFailed":{"summary":"AWS — validation job failed; simulationId not found","value":{"error":"Simulation 'sim_aws_7c4e1b2d-9f83-4a11-bc45-aef987654321' not found. The simulation may have been deleted before the validation job was processed. Re-create the simulation and resubmit the validation job."}},"gcpValidationFailed":{"summary":"GCP — validation job failed; traffic forecast malformed","value":{"error":"Traffic forecast 'gcp-holiday-burst-v3' is malformed: timestamps are not strictly increasing (step 45 appears before step 38). Validation requires a monotonically increasing timestamp sequence. Correct the forecast data and resubmit."}},"azureValidationFailed":{"summary":"Azure — validation job failed; validation engine internal error","value":{"error":"Validation engine encountered an internal error while replaying simulation 'sim_azure_aks_prod_westeurope' against the 'product-launch-forecast' traffic pattern: capacity model returned a negative throughput value at step 22 (traffic=10000 RPS, instances=3). This indicates an inconsistent resource configuration. Verify that all AKS node sizes and replica counts are set to positive non-zero values, then resubmit."}},"ociValidationFailed":{"summary":"OCI — validation job failed; simulation contains no resources","value":{"error":"Simulation 'sim_oci_3b9d72f1' contains no resources and cannot be validated. Add at least one compute resource (e.g. VM.Standard.E4.Flex instance pool) and one database resource (e.g. Autonomous Database ATP) before submitting a validation job."}},"digitalOceanValidationFailed":{"summary":"DigitalOcean — validation job failed; traffic forecast has insufficient data points","value":{"error":"Traffic forecast 'do-viral-traffic-short' contains only 2 data points spanning 15 simulation steps. Validation requires a forecast with at least 5 data points covering a minimum of 60 steps to accurately replay a ramp-and-drain traffic cycle. Extend the forecast to cover the full event window and resubmit."}},"awsThresholdOptResult":{"summary":"AWS — threshold optimization results for EC2 Auto Scaling (Black Friday)","value":{"thresholdTests":[{"scaleOutCpuThreshold":80,"scaleInCpuThreshold":40,"score":62,"peakCpu":81,"peakLatencyP95":44,"peakErrorRate":0.4,"costPerHour":18.2},{"scaleOutCpuThreshold":70,"scaleInCpuThreshold":30,"score":91,"peakCpu":68,"peakLatencyP95":38,"peakErrorRate":0.05,"costPerHour":19.6},{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"score":85,"peakCpu":61,"peakLatencyP95":35,"peakErrorRate":0.02,"costPerHour":21.4}],"bestThresholds":{"scaleOutCpuThreshold":70,"scaleInCpuThreshold":30,"scaleOutThroughputThreshold":75,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":120,"cooldownSeconds":180,"minInstances":3,"maxInstances":15},"recommendations":[{"rank":1,"title":"Lower CPU scale-out threshold to 70%","description":"Triggering scale-out at 70% CPU instead of 80% gives 60–90 seconds of lead time before saturation under the Black Friday ramp","priority":"high","action":"Set scaleOutCpuThreshold to 70 in the EC2 Auto Scaling policy","expectedImpact":"Reduces peak CPU from 81% to ~68%, drops error rate from 0.4% to <0.1%"},{"rank":2,"title":"Keep scale-in threshold at 30% to avoid flapping","description":"A conservative scale-in threshold prevents the Auto Scaling group from terminating instances too quickly after the Black Friday peak, avoiding a secondary spike during wind-down","priority":"medium","action":"Set scaleInCpuThreshold to 30 in the EC2 Auto Scaling policy","expectedImpact":"Eliminates post-peak scale-in flap; saves one unnecessary scale-out cycle during wind-down"}],"trafficForecast":{"name":"Black Friday 2025 — AWS Production","dataPoints":[{"timestamp":0,"rps":2000,"label":"Baseline"},{"timestamp":60,"rps":12000,"label":"Peak"},{"timestamp":100,"rps":2500,"label":"Return to baseline"}]}}},"gcpThresholdOptResult":{"summary":"GCP — threshold optimization results for Cloud Run (Holiday seasonal burst)","value":{"thresholdTests":[{"scaleOutCpuThreshold":80,"scaleInCpuThreshold":40,"score":38,"peakCpu":94,"peakLatencyP95":310,"peakErrorRate":12.1,"costPerHour":9.4},{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"score":88,"peakCpu":67,"peakLatencyP95":58,"peakErrorRate":0.3,"costPerHour":11.2},{"scaleOutCpuThreshold":50,"scaleInCpuThreshold":20,"score":81,"peakCpu":58,"peakLatencyP95":52,"peakErrorRate":0.1,"costPerHour":13.8}],"bestThresholds":{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"scaleOutThroughputThreshold":65,"scaleInThroughputThreshold":30,"scaleOutLatencyThreshold":150,"cooldownSeconds":60,"minInstances":3,"maxInstances":25},"recommendations":[{"rank":1,"title":"Set CPU scale-out threshold to 60% to prevent concurrency saturation","description":"Cloud Run saturates when per-instance concurrency fills before CPU-based scale-out fires. Triggering at 60% CPU ensures new instances are warm before the holiday ramp overwhelms the active pool","priority":"critical","action":"Configure Cloud Run --cpu-throttling and set Knative autoscaling target annotation to 60","expectedImpact":"Peak error rate drops from 12.1% to <0.5%; p95 latency drops from 310 ms to ~58 ms"},{"rank":2,"title":"Set minimum instances to 3 to eliminate cold-start lag","description":"Keeping 3 instances warm prevents the 45-second scale-out delay at ramp start that causes early-stage errors before the optimizer thresholds can take effect","priority":"high","action":"Set --min-instances=3 on the Cloud Run service revision","expectedImpact":"Removes cold-start lag; threshold optimizer can respond within 5 s instead of 45 s"}],"trafficForecast":{"name":"GCP Seasonal Spike — Holiday 2025","dataPoints":[{"timestamp":0,"rps":3000,"label":"Baseline"},{"timestamp":70,"rps":12000,"label":"Peak"},{"timestamp":100,"rps":3200,"label":"Return to baseline"}]}}},"azureThresholdOptResult":{"summary":"Azure — threshold optimization results for AKS HPA (product launch spike)","value":{"thresholdTests":[{"scaleOutCpuThreshold":75,"scaleInCpuThreshold":35,"score":70,"peakCpu":76,"peakLatencyP95":110,"peakErrorRate":0.8,"costPerHour":14.8},{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"score":93,"peakCpu":61,"peakLatencyP95":48,"peakErrorRate":0.05,"costPerHour":16.2},{"scaleOutCpuThreshold":50,"scaleInCpuThreshold":20,"score":86,"peakCpu":52,"peakLatencyP95":43,"peakErrorRate":0.02,"costPerHour":18.5}],"bestThresholds":{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"scaleOutThroughputThreshold":70,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":80,"cooldownSeconds":120,"minInstances":3,"maxInstances":12},"recommendations":[{"rank":1,"title":"Lower HPA CPU target to 60%","description":"Scaling out at 60% CPU instead of 75% gives the AKS node pool an extra 45 seconds of lead time for the product launch spike, eliminating the brief 110 ms latency overshoot at step 21","priority":"high","action":"Update HorizontalPodAutoscaler targetCPUUtilizationPercentage to 60","expectedImpact":"Eliminates p95 latency spike at launch; steady-state p95 drops from 62 ms to 48 ms"},{"rank":2,"title":"Set cooldown to 120 s to prevent HPA thrashing during sustained load","description":"The product launch pattern holds elevated traffic for 50 steps — a 120-second cooldown prevents the HPA from scale-in oscillations during the sustained period","priority":"medium","action":"Set HorizontalPodAutoscaler spec.behavior.scaleDown.stabilizationWindowSeconds to 120","expectedImpact":"Eliminates 3 unnecessary scale-in/scale-out cycles during the sustained traffic window"}],"trafficForecast":{"name":"Azure Product Launch Traffic","dataPoints":[{"timestamp":0,"rps":1000,"label":"Pre-launch baseline"},{"timestamp":20,"rps":10000,"label":"Launch spike"},{"timestamp":100,"rps":1500,"label":"New elevated baseline"}]}}},"ociThresholdOptResult":{"summary":"OCI — threshold optimization results for VM.Standard.E4.Flex autoscaling (month-end batch)","value":{"thresholdTests":[{"scaleOutCpuThreshold":75,"scaleInCpuThreshold":40,"score":74,"peakCpu":72,"peakLatencyP95":88,"peakErrorRate":0.1,"costPerHour":6.2},{"scaleOutCpuThreshold":65,"scaleInCpuThreshold":30,"score":89,"peakCpu":63,"peakLatencyP95":76,"peakErrorRate":0.05,"costPerHour":6.9},{"scaleOutCpuThreshold":55,"scaleInCpuThreshold":25,"score":82,"peakCpu":54,"peakLatencyP95":70,"peakErrorRate":0.02,"costPerHour":7.8}],"bestThresholds":{"scaleOutCpuThreshold":65,"scaleInCpuThreshold":30,"scaleOutThroughputThreshold":70,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":200,"cooldownSeconds":300,"minInstances":2,"maxInstances":8},"recommendations":[{"rank":1,"title":"Set scale-out CPU threshold to 65% for the batch window","description":"Month-end batch load ramps gradually over 35 steps — triggering at 65% CPU provides a 2-instance buffer before peak query load hits, keeping ATP connection pool below 60%","priority":"medium","action":"Update OCI Autoscaling policy CPU threshold to 65% for the VM.Standard.E4.Flex instance pool","expectedImpact":"Peak CPU drops from 72% to ~63%; ATP connection pool pressure drops from 74% to ~58%"},{"rank":2,"title":"Use a 300-second cooldown to prevent premature scale-in mid-batch","description":"Month-end batch jobs run for 55 steps — a short cooldown causes the autoscaler to prematurely scale in between reporting sub-jobs, then immediately scale out again","priority":"low","action":"Set OCI Autoscaling policy cooldown period to 300 seconds","expectedImpact":"Eliminates 2 mid-batch scale-in/out cycles; reduces ATP reconnect overhead"}],"trafficForecast":{"name":"OCI Month-End Batch Surge","dataPoints":[{"timestamp":0,"rps":800,"label":"Normal operations"},{"timestamp":60,"rps":2600,"label":"Peak batch load"},{"timestamp":100,"rps":850,"label":"Return to normal"}]}}},"digitalOceanThresholdOptResult":{"summary":"DigitalOcean — threshold optimization results for Droplet autoscaling (viral traffic spike)","value":{"thresholdTests":[{"scaleOutCpuThreshold":75,"scaleInCpuThreshold":35,"score":22,"peakCpu":98,"peakLatencyP95":480,"peakErrorRate":18.3,"costPerHour":1.8},{"scaleOutCpuThreshold":55,"scaleInCpuThreshold":20,"score":87,"peakCpu":58,"peakLatencyP95":72,"peakErrorRate":0.4,"costPerHour":3.2},{"scaleOutCpuThreshold":45,"scaleInCpuThreshold":15,"score":79,"peakCpu":49,"peakLatencyP95":65,"peakErrorRate":0.1,"costPerHour":4.1}],"bestThresholds":{"scaleOutCpuThreshold":55,"scaleInCpuThreshold":20,"scaleOutThroughputThreshold":60,"scaleInThroughputThreshold":25,"scaleOutLatencyThreshold":100,"cooldownSeconds":90,"minInstances":6,"maxInstances":20},"recommendations":[{"rank":1,"title":"Lower scale-out CPU threshold to 55% to react before viral saturation","description":"The viral spike reaches full intensity within 15 steps — the default 75% threshold fires too late for DigitalOcean App Platform to provision new Droplets in time. 55% gives a 10-step head start","priority":"critical","action":"Update DigitalOcean App Platform autoscaling CPU threshold to 55%","expectedImpact":"Peak CPU drops from 98% to ~58%; error rate drops from 18% to <0.5%"},{"rank":2,"title":"Set minimum Droplet pool to 6 as a prerequisite for the threshold to take effect","description":"Even the optimized 55% threshold cannot compensate if the starting pool is too small — the 15-step viral ramp outpaces Droplet provisioning speed from a pool of 3","priority":"critical","action":"Set DigitalOcean App Platform min_instance_count to 6","expectedImpact":"Ensures the threshold optimizer has sufficient baseline capacity; CPU peak drops to ~52% when combined with the 55% scale-out trigger"}],"trafficForecast":{"name":"DigitalOcean Viral Traffic Event","dataPoints":[{"timestamp":0,"rps":500,"label":"Normal baseline"},{"timestamp":40,"rps":3000,"label":"Peak viral traffic"},{"timestamp":100,"rps":700,"label":"New elevated baseline"}]}}},"awsThresholdOptFailed":{"summary":"AWS — threshold optimization failed; simulation not found","value":{"error":"Simulation 'sim_b3f2a1c9-4d78-4e02-9f61-aef123456789' not found. The simulation may have been deleted before the threshold optimization job ran. Re-create the simulation and submit a new optimization job.","suggestions":["Re-create the simulation and submit a new optimization job referencing the new simulationId","Verify the simulationId is correct and has not been deleted","List your simulations via GET /api/simulations to confirm the ID before submitting"]}},"gcpThresholdOptFailed":{"summary":"GCP — threshold optimization failed; traffic forecast too short","value":{"error":"Traffic forecast 'gcp-holiday-spike' contains only 3 data points spanning 30 simulation steps. Threshold optimization requires a forecast with at least 5 data points covering a minimum of 60 steps to evaluate scale-out and scale-in behaviour across a full ramp-and-drain cycle. Extend the forecast and resubmit.","suggestions":["Extend the traffic forecast to include at least 5 data points spanning 60 or more simulation steps","Include a full ramp-up and drain cycle in the forecast so the optimizer can evaluate scale-out and scale-in behaviour","Re-submit the optimization job with the updated trafficForecast"]}},"azureThresholdOptFailed":{"summary":"Azure — threshold optimization failed; no valid threshold combination found","value":{"error":"No valid threshold combination found for AKS cluster 'aks-prod-westeurope' under the provided traffic forecast. All 27 candidate combinations (scaleOutCpuThreshold 50–80%, cooldown 60–300 s) produced peak error rates above the 5% SLA limit during the sustained 9,500 RPS window. Consider increasing maxInstances beyond the current cap of 6, raising node SKU to Standard_D8s_v3, or splitting traffic across two AKS clusters before rerunning optimization.","suggestions":["Increase maxInstances in the simulation's autoscaling config to give the optimizer more headroom (current cap of 6 is insufficient for the 9,500 RPS sustained window)","Upgrade the node SKU to Standard_D8s_v3 so individual nodes handle more load before triggering scale-out","Split traffic across two AKS clusters to reduce per-cluster peak load below the SLA threshold","Recreate the simulation with relaxed SLA targets and resubmit the optimization job"]}},"ociThresholdOptFailed":{"summary":"OCI — threshold optimization failed; invalid simulation reference","value":{"error":"simulationId 'sim_oci_9a2b3c4d' does not exist or belongs to a different API key scope. Threshold optimization jobs must reference a simulation owned by the requesting key. Verify the simulationId and ensure the API key has the 'read' scope for the target simulation.","suggestions":["Verify the simulationId belongs to the API key used in this request","Re-create the simulation under the current API key and resubmit the job","Ensure the API key has the 'read' scope for the target simulation"]}},"digitalOceanThresholdOptFailed":{"summary":"DigitalOcean — threshold optimization failed; no valid threshold combination found","value":{"error":"No valid threshold combination found for DigitalOcean Droplet pool 'app-prod-nyc3' under the 'viral-traffic-v2' forecast. The viral spike reaches 4,200 RPS within 8 steps, which outpaces Droplet provisioning speed regardless of scale-out threshold. All 18 tested combinations exceeded the 10% error rate SLA. Increase the minimum Droplet count to at least 8 (currently 2) so the pool can absorb the initial surge before autoscaling adds capacity, then resubmit the optimization job.","suggestions":["Increase maxInstances in the simulation's autoscaling config to give the optimizer more headroom (raise minimum Droplet count to at least 8)","Upgrade to a larger Droplet size so individual instances handle more load before triggering scale-out","Split traffic across multiple resources or regions to reduce per-resource peak load","Recreate the simulation with relaxed SLA targets and resubmit the optimization job"]}}}}}},"400":{"description":"Job not completed yet","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/predictions/jobs/{jobId}/stream":{"x-stability":"stable","get":{"summary":"Stream prediction job progress in real-time","description":"Stream prediction job progress using Server-Sent Events (SSE).\nThis endpoint provides real-time updates as validation or threshold\noptimization steps are executed.\n\n**SSE Event Types:**\n- `init`: Initial job state when connection is established\n- `progressUpdate`: Progress update sent periodically as the job runs\n- `completed`: Job has finished successfully (final event before connection closes)\n- `failed`: Job encountered an unrecoverable error (final event before connection closes)\n- `cancelled`: Job was cancelled via `DELETE /api/predictions/jobs/{jobId}` (final event before connection closes)\n\n**Connection Behavior:**\n- Connection remains open until the job completes, fails, or is cancelled\n- Connection automatically closes when the job reaches a terminal state\n\n**Agent Reconnection and Recovery Guide:**\n\nAfter receiving a terminal SSE event (`completed`, `failed`, or `cancelled`) the server\ncloses the connection. Each event requires a different agent response:\n\n- **`completed`**: The job finished successfully. No reconnection is needed. Fetch the\n  full results from `GET /api/predictions/jobs/{jobId}/results` to retrieve the\n  `validationResult`, `bestThresholds`, and `recommendations`.\n\n- **`failed`**: The job encountered an unrecoverable error. Inspect the `error` field in\n  the event payload for the root cause. Transient errors (e.g. a simulation lookup\n  timeout) are safe to retry — submit a new job via `POST /api/predictions/validate` or\n  `POST /api/predictions/optimize-thresholds`. Permanent errors (e.g. an invalid\n  simulation ID or a traffic forecast that is too short) should not be retried without\n  fixing the underlying input first. Do not attempt to reconnect to the same `jobId`;\n  it will not recover.\n\n- **`cancelled`**: Cancellation is terminal and intentional. No retry is needed or\n  recommended. If the cancellation was unintended, submit a new job.\n\n**Handling unexpected connection drops (no terminal event received):**\n\nIf the SSE connection closes without a `completed`, `failed`, or `cancelled` event —\nfor example due to a network interruption, proxy timeout, or server restart — the job\nmay still be running. Use the following fallback strategy:\n\n1. Poll `GET /api/predictions/jobs/{jobId}` to check the current `status` field.\n2. If `status` is `running` or `pending`, reconnect to this stream endpoint.\n3. If `status` is `completed`, `failed`, or `cancelled`, treat it the same as if you\n   had received the corresponding terminal SSE event (see above).\n\nAgents should implement an exponential back-off (e.g. 1 s, 2 s, 4 s, cap at 30 s)\nbefore each reconnection attempt to avoid hammering the server during an outage.\n","operationId":"streamPredictionJob","tags":["Predictive Scaling"],"security":[{"BearerAuth":[]}],"x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -N -X GET \"https://your-production-domain.com/api/predictions/jobs/pred_aws_val001/stream\" \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Accept: text/event-stream\"\n"},{"lang":"Python","label":"Python","source":"import os\nimport json\nimport requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = os.environ[\"API_KEY\"]\n\njob_id = \"pred_aws_val001\"\nurl = f\"{BASE_URL}/predictions/jobs/{job_id}/stream\"\nheaders = {\n    \"Authorization\": f\"Bearer {API_KEY}\",\n    \"Accept\": \"text/event-stream\",\n}\n\nwith requests.get(url, headers=headers, stream=True, timeout=120) as resp:\n    resp.raise_for_status()\n    for line in resp.iter_lines():\n        if not line:\n            continue\n        text = line.decode(\"utf-8\")\n        if text.startswith(\"data:\"):\n            payload = json.loads(text[5:].strip())\n            event_type = payload.get(\"type\")\n            print(f\"Event: {event_type}\")\n            if event_type == \"completed\":\n                print(\"Job complete — fetch results via /results\")\n                break\n            elif event_type == \"failed\":\n                print(\"Job failed:\", payload.get(\"error\"))\n                break\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = process.env.API_KEY;\n\nconst jobId = \"pred_aws_val001\";\nconst resp = await fetch(`${BASE_URL}/predictions/jobs/${jobId}/stream`, {\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Accept\": \"text/event-stream\",\n  },\n});\n\nconst reader = resp.body.getReader();\nconst decoder = new TextDecoder();\nlet buffer = \"\";\n\nwhile (true) {\n  const { done, value } = await reader.read();\n  if (done) break;\n  buffer += decoder.decode(value, { stream: true });\n  const lines = buffer.split(\"\\n\");\n  buffer = lines.pop();\n  for (const line of lines) {\n    if (!line.startsWith(\"data:\")) continue;\n    const payload = JSON.parse(line.slice(5).trim());\n    console.log(\"Event:\", payload.type);\n    if (payload.type === \"completed\") {\n      console.log(\"Job complete — fetch results via /results\");\n      return;\n    }\n    if (payload.type === \"failed\") {\n      console.error(\"Job failed:\", payload.error);\n      return;\n    }\n  }\n}\n"}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"Prediction job ID"}],"responses":{"200":{"description":"SSE stream of prediction job progress","content":{"text/event-stream":{"schema":{"type":"string","description":"Server-Sent Events stream"},"examples":{"predictionStream":{"summary":"Validation job — progress then completion","value":"event: init\ndata: {\"jobId\":\"pred_abc123\",\"status\":\"running\",\"progress\":0,\"type\":\"validation\"}\n\nevent: progressUpdate\ndata: {\"jobId\":\"pred_abc123\",\"progress\":40,\"message\":\"Simulating peak traffic window\"}\n\nevent: progressUpdate\ndata: {\"jobId\":\"pred_abc123\",\"progress\":80,\"message\":\"Evaluating SLA thresholds\"}\n\nevent: completed\ndata: {\"jobId\":\"pred_abc123\",\"status\":\"completed\",\"progress\":100}\n"},"predictionFailedStream":{"summary":"Threshold optimization job — failed due to invalid simulation","value":"event: init\ndata: {\"jobId\":\"pred_xyz789\",\"status\":\"running\",\"progress\":0,\"type\":\"threshold_optimization\"}\n\nevent: failed\ndata: {\"jobId\":\"pred_xyz789\",\"status\":\"failed\",\"error\":\"Simulation 'sim_404' not found. Re-create the simulation and submit a new optimization job.\"}\n"},"predictionCancelledStream":{"summary":"Validation job — cancelled by agent","value":"event: init\ndata: {\"jobId\":\"pred_can456\",\"status\":\"running\",\"progress\":0,\"type\":\"validation\"}\n\nevent: cancelled\ndata: {\"jobId\":\"pred_can456\",\"status\":\"cancelled\"}\n"}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/chaos/scenarios":{"x-stability":"stable","get":{"tags":["Chaos Engineering"],"summary":"List all available chaos scenarios","description":"Returns a list of pre-built chaos engineering scenarios that can be used\nto test infrastructure resilience.\n","operationId":"listChaosScenarios","responses":{"200":{"description":"List of chaos scenarios","content":{"application/json":{"schema":{"type":"array","items":{"$ref":"#/components/schemas/ChaosScenario"}}}}},"500":{"$ref":"#/components/responses/InternalError"}}}},"/chaos/run":{"x-stability":"stable","post":{"tags":["Chaos Engineering"],"summary":"Run a chaos engineering test","description":"Execute a chaos engineering test by injecting failures into a simulation.\nCan use pre-built scenarios or custom failure injections.\n\nThis endpoint works with simulations built on **any supported provider**,\nincluding AWS, GCP, Azure, OCI, and **DigitalOcean**. When targeting a\nDigitalOcean-based simulation, failure types such as `kill_instance`\naffect Droplets, `zone_failure` targets DigitalOcean datacenter regions,\nand `database_crash` targets Managed Database clusters.\n","operationId":"runChaosTest","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/chaos/run \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"simulationId\": \"a638caad-7423-40a3-bb09-f91235d9392d\",\n    \"scenarioId\": \"zone_failure\",\n    \"duration\": 300,\n    \"webhookUrl\": \"https://your-app.com/webhooks/chaos\",\n    \"webhookSecret\": \"your-secret-key-here\"\n  }'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/chaos/run\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\n        \"simulationId\": \"a638caad-7423-40a3-bb09-f91235d9392d\",\n        \"scenarioId\": \"zone_failure\",\n        \"duration\": 300,\n        \"webhookUrl\": \"https://your-app.com/webhooks/chaos\",\n        \"webhookSecret\": \"your-secret-key-here\",\n    },\n)\nresp.raise_for_status()\njob = resp.json()[\"job\"]\nprint(f\"Chaos job started: {job['id']}  status={job['status']}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/chaos/run`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({\n    simulationId: \"a638caad-7423-40a3-bb09-f91235d9392d\",\n    scenarioId: \"zone_failure\",\n    duration: 300,\n    webhookUrl: \"https://your-app.com/webhooks/chaos\",\n    webhookSecret: \"your-secret-key-here\",\n  }),\n});\nconst { job } = await resp.json();\nconsole.log(`Chaos job started: ${job.id}  status=${job.status}`);\n"}],"security":[{"BearerAuth":["write"]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["simulationId","duration"],"properties":{"simulationId":{"type":"string","format":"uuid","description":"ID of the base simulation to test","example":"a638caad-7423-40a3-bb09-f91235d9392d"},"scenarioId":{"type":"string","description":"Pre-built scenario ID (optional)","example":"zone_failure","enum":["zone_failure","database_crash","network_partition","cascading_failure","random_instance_failure","database_slowdown","database_overload"]},"customInjections":{"type":"array","description":"Custom failure injections (optional)","items":{"$ref":"#/components/schemas/ChaosInjectionConfig"}},"duration":{"type":"integer","description":"Test duration in simulation steps","default":300,"example":300},"webhookUrl":{"type":"string","format":"uri","description":"Optional HTTPS URL to receive webhook notification when job completes","example":"https://your-app.com/webhooks/chaos"},"webhookSecret":{"type":"string","description":"Optional secret for HMAC-SHA256 webhook signature verification","example":"your-secret-key-here"}}},"examples":{"prebuiltScenario":{"summary":"Pre-built zone failure scenario (any provider)","value":{"simulationId":"a638caad-7423-40a3-bb09-f91235d9392d","scenarioId":"zone_failure","duration":300,"webhookUrl":"https://your-app.com/webhooks/chaos","webhookSecret":"your-secret-key-here"}},"digitalOceanDropletCrash":{"summary":"DigitalOcean — crash a Droplet with custom injection","value":{"simulationId":"d1234abc-0000-40a3-bb09-d091235d9392","duration":180,"customInjections":[{"type":"kill_instance","targetId":"droplet-web-1","injectionTime":30,"duration":90}],"webhookUrl":"https://your-app.com/webhooks/chaos","webhookSecret":"your-secret-key-here"}},"digitalOceanDatabaseCrash":{"summary":"DigitalOcean — crash a Managed Database cluster","value":{"simulationId":"d1234abc-0000-40a3-bb09-d091235d9392","scenarioId":"database_crash","duration":240,"webhookUrl":"https://your-app.com/webhooks/chaos","webhookSecret":"your-secret-key-here"}}}}}},"responses":{"202":{"description":"Chaos test job accepted and started","content":{"application/json":{"schema":{"type":"object","properties":{"jobId":{"type":"string","format":"uuid","description":"Top-level chaos job ID (mirrors `job.id`) for convenient access.","example":"job-abc123"},"job":{"type":"object","properties":{"id":{"type":"string","format":"uuid","example":"job-abc123"},"type":{"type":"string","enum":["chaos_test"],"example":"chaos_test"},"status":{"type":"string","enum":["pending","running"],"example":"running"},"simulationId":{"type":"string","format":"uuid"},"createdAt":{"type":"string","format":"date-time"}}},"message":{"type":"string","example":"Chaos test job started. Use GET /chaos/jobs/{id} to check status."}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"description":"Simulation not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/chaos/jobs/{jobId}":{"x-stability":"stable","get":{"tags":["Chaos Engineering"],"summary":"Get chaos job status","description":"Retrieve the current status of a single chaos engineering test job including\nresilience score, vulnerabilities detected, and remediation recommendations.\n\n**Client Code Sample (JavaScript / Node.js):**\n\n```javascript\nasync function pollChaosJobStatus(jobId, apiToken) {\n  const headers = {\n    Authorization: `Bearer ${apiToken}`,\n    Accept: 'application/json',\n  };\n\n  const terminal = new Set(['completed', 'failed', 'cancelled']);\n\n  while (true) {\n    const resp = await fetch(\n      `https://your-host/api/chaos/jobs/${jobId}`,\n      { headers }\n    );\n    if (!resp.ok) {\n      throw new Error(`HTTP ${resp.status}: ${await resp.text()}`);\n    }\n    const job = await resp.json();\n    console.log(`Job ${job.id}  status=${job.status}`);\n    if (terminal.has(job.status)) {\n      if (job.resilienceScore) {\n        const s = job.resilienceScore;\n        console.log(`Resilience score: ${s.overall} (Grade: ${s.grade})`);\n      }\n      if (job.vulnerabilities && job.vulnerabilities.length > 0) {\n        console.log(`Vulnerabilities found: ${job.vulnerabilities.length}`);\n        for (const v of job.vulnerabilities) {\n          console.log(`  [${v.severity}] ${v.title}`);\n        }\n      }\n      return job;\n    }\n    await new Promise((r) => setTimeout(r, 5000));\n  }\n}\n```\n\n**Client Code Sample (Python / httpx):**\n\n```python\nimport time\nimport httpx\n\ndef poll_chaos_job_status(job_id: str, api_token: str) -> dict:\n    headers = {\n        \"Authorization\": f\"Bearer {api_token}\",\n        \"Accept\": \"application/json\",\n    }\n    terminal = {\"completed\", \"failed\", \"cancelled\"}\n\n    with httpx.Client() as client:\n        while True:\n            resp = client.get(\n                f\"https://your-host/api/chaos/jobs/{job_id}\",\n                headers=headers,\n            )\n            resp.raise_for_status()\n            job = resp.json()\n            print(f\"Job {job['id']}  status={job['status']}\")\n            if job[\"status\"] in terminal:\n                score = job.get(\"resilienceScore\")\n                if score:\n                    print(f\"Resilience score: {score['overall']} (Grade: {score['grade']})\")\n                vulns = job.get(\"vulnerabilities\") or []\n                if vulns:\n                    print(f\"Vulnerabilities found: {len(vulns)}\")\n                    for v in vulns:\n                        print(f\"  [{v['severity']}] {v['title']}\")\n                return job\n            time.sleep(5)\n```\n","operationId":"getChaosJob","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/chaos/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890 \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\"\n\nresp = requests.get(\n    f\"{BASE_URL}/chaos/jobs/{JOB_ID}\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\njob = resp.json()\nprint(f\"Job {job['id']}  status={job['status']}\")\nif job[\"status\"] == \"completed\" and job.get(\"resilienceScore\"):\n    score = job[\"resilienceScore\"]\n    print(f\"Resilience score: {score['overall']} (Grade: {score['grade']})\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\";\n\nconst resp = await fetch(`${BASE_URL}/chaos/jobs/${JOB_ID}`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst job = await resp.json();\nconsole.log(`Job ${job.id}  status=${job.status}`);\nif (job.status === \"completed\" && job.resilienceScore) {\n  console.log(`Resilience score: ${job.resilienceScore.overall} (Grade: ${job.resilienceScore.grade})`);\n}\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the chaos job"}],"responses":{"200":{"description":"Chaos job status","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"type":{"type":"string","enum":["chaos_test"]},"status":{"type":"string","enum":["pending","running","completed","failed"]},"simulationId":{"type":"string","format":"uuid"},"scenarioId":{"type":"string"},"createdAt":{"type":"string","format":"date-time"},"completedAt":{"type":"string","format":"date-time","nullable":true},"error":{"type":"string","nullable":true},"resilienceScore":{"nullable":true,"description":"Resilience score for the job (null while still running)","$ref":"#/components/schemas/ResilienceScore"},"vulnerabilities":{"type":"array","nullable":true,"description":"Vulnerabilities detected so far (may be partial while running)","items":{"$ref":"#/components/schemas/Vulnerability"}},"recommendations":{"type":"array","nullable":true,"description":"Remediation recommendations (null while still running)","items":{"type":"string"}}}},"examples":{"digitalocean_zone_failure_running":{"summary":"Running zone_failure job targeting a DigitalOcean NYC3 datacenter","value":{"id":"a1b2c3d4-e5f6-7890-abcd-ef1234567890","type":"chaos_test","status":"running","simulationId":"c3d4e5f6-a7b8-9012-cdef-123456789012","scenarioId":"zone_failure_do_nyc3","createdAt":"2024-06-12T14:00:00Z","completedAt":null,"error":null,"resilienceScore":null,"vulnerabilities":null,"recommendations":null}},"digitalocean_database_crash_completed":{"summary":"Completed database_crash job on a DigitalOcean Managed Database (SFO3)","value":{"id":"b2c3d4e5-f6a7-8901-bcde-f12345678901","type":"chaos_test","status":"completed","simulationId":"d4e5f6a7-b8c9-0123-defa-234567890123","scenarioId":"database_crash_do_sfo3","createdAt":"2024-06-12T13:45:00Z","completedAt":"2024-06-12T13:47:32Z","error":null,"resilienceScore":null,"vulnerabilities":null,"recommendations":null}},"digitalOceanInProgress":{"summary":"DigitalOcean — Droplet crash job running against droplet-web-1 in nyc3","value":{"id":"c3d4e5f6-a7b8-9012-cdef-234567890123","type":"chaos_test","status":"running","simulationId":"e5f6a7b8-c9d0-1234-efab-345678901234","scenarioId":"droplet_crash_do_nyc3","createdAt":"2025-11-23T11:00:00Z","completedAt":null,"error":null,"resilienceScore":null,"vulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency (nyc3)","description":"droplet-web-1 has no replica in a second datacenter. An nyc3 outage would take down the service entirely."}],"recommendations":null}},"digitalOcean":{"summary":"DigitalOcean — completed zone failure job for droplet-web-1 across nyc3 and sfo3","value":{"id":"d4e5f6a7-b8c9-0123-defa-345678901234","type":"chaos_test","status":"completed","simulationId":"f6a7b8c9-d0e1-2345-fabc-456789012345","scenarioId":"zone_failure_do_nyc3_sfo3","createdAt":"2025-11-23T11:05:00Z","completedAt":"2025-11-23T11:08:47Z","error":null,"resilienceScore":{"overall":58.4,"grade":"F","metrics":{"recoveryTimeSeconds":142,"availabilityPercent":81.2,"meanTimeToDetect":6.1,"meanTimeToRecover":142,"errorRateDuringFailure":61.5}},"vulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency (nyc3)","description":"droplet-web-1 and droplet-api-1 are both in nyc3. A datacenter outage causes complete service unavailability."},{"id":"no_managed_db_standby","severity":"medium","title":"Managed Database Has No Standby Node","description":"The Managed PostgreSQL cluster in nyc3 has no standby replica. Failover requires manual intervention and causes extended downtime."},{"id":"no_global_lb","severity":"medium","title":"No Global Load Balancer Configured","description":"Without a Global Load Balancer, traffic cannot automatically re-route from nyc3 to sfo3 during a regional outage."}],"recommendations":["Spread droplet-web-1 and droplet-api-1 across nyc3 and sfo3 and enable a DigitalOcean Global Load Balancer for automatic multi-region failover","Enable a standby node on the Managed PostgreSQL cluster to reduce failover time from minutes to seconds","Attach Reserved IPs to droplet-web-1 and droplet-api-1 so traffic re-routes instantly when a Droplet is replaced","Configure Droplet monitoring alerts and auto-remediation workflows via DigitalOcean Functions"]}},"aws":{"summary":"AWS — completed zone failure job for EC2 instances in us-east-1a","value":{"id":"e5f6a7b8-c9d0-1234-efab-456789012345","type":"chaos_test","status":"completed","simulationId":"g7h8i9j0-k1l2-3456-mnop-567890123456","scenarioId":"zone_failure_aws_us_east_1a","createdAt":"2025-11-23T12:00:00Z","completedAt":"2025-11-23T12:04:18Z","error":null,"resilienceScore":{"overall":71.2,"grade":"C","metrics":{"recoveryTimeSeconds":98,"availabilityPercent":86.5,"meanTimeToDetect":4.2,"meanTimeToRecover":98,"errorRateDuringFailure":48.3}},"vulnerabilities":[{"id":"single_az","severity":"high","title":"Single Availability Zone Dependency (us-east-1a)","description":"All EC2 instances are in us-east-1a. A zone outage takes down all compute with no automatic failover to us-east-1b or us-east-1c."},{"id":"no_multi_az_rds","severity":"medium","title":"RDS Multi-AZ Not Enabled","description":"The RDS db.r5.large instance is not configured with Multi-AZ. A zone failure forces a manual failover with extended downtime."}],"recommendations":["Deploy EC2 instances across at least two AZs (us-east-1a and us-east-1b) and use an ALB with cross-zone load balancing","Enable RDS Multi-AZ to allow automatic standby promotion within 60 seconds of a primary AZ failure","Use EC2 Auto Scaling groups with AZ rebalancing so replacement instances are provisioned in healthy zones automatically","Configure Route 53 health checks and DNS failover for the application endpoint"]}},"gcp":{"summary":"GCP — completed zone failure job for GCE instances in us-central1-a","value":{"id":"f6a7b8c9-d0e1-2345-fabc-567890123456","type":"chaos_test","status":"completed","simulationId":"h8i9j0k1-l2m3-4567-nopq-678901234567","scenarioId":"zone_failure_gcp_us_central1_a","createdAt":"2025-11-23T12:10:00Z","completedAt":"2025-11-23T12:14:32Z","error":null,"resilienceScore":{"overall":74.8,"grade":"C","metrics":{"recoveryTimeSeconds":82,"availabilityPercent":88.9,"meanTimeToDetect":3.8,"meanTimeToRecover":82,"errorRateDuringFailure":41.2}},"vulnerabilities":[{"id":"single_zone_gce","severity":"high","title":"GCE Instances Concentrated in us-central1-a","description":"All GCE e2-standard-4 instances are in a single zone. A zone outage causes full compute unavailability with no regional failover."},{"id":"no_cloud_sql_ha","severity":"medium","title":"Cloud SQL High Availability Not Enabled","description":"The Cloud SQL instance has no HA standby replica. A zone failure on us-central1-a requires manual intervention and causes extended read/write downtime."}],"recommendations":["Use a regional Managed Instance Group (MIG) spanning us-central1-a, us-central1-b, and us-central1-c so GCE instances survive a single-zone outage","Enable Cloud SQL High Availability to provision an automatic standby in a secondary zone with sub-60 s failover","Configure a global HTTP(S) load balancer so traffic is rerouted to healthy backends across zones automatically","Use Cloud Monitoring uptime checks and alerting policies to detect zone-level failures within seconds"]}},"azure":{"summary":"Azure — running zone failure job for VMs in East US","value":{"id":"a7b8c9d0-e1f2-3456-abcd-678901234567","type":"chaos_test","status":"running","simulationId":"i9j0k1l2-m3n4-5678-opqr-789012345678","scenarioId":"zone_failure_azure_east_us_zone1","createdAt":"2025-11-23T12:20:00Z","completedAt":null,"error":null,"resilienceScore":null,"vulnerabilities":[{"id":"no_availability_zones_vm","severity":"high","title":"VM Scale Set Not Zone-Redundant (East US)","description":"The Standard_D4s_v3 VM Scale Set is deployed to a single availability zone. An Azure zone outage takes down all VM instances with no automatic failover."}],"recommendations":null}},"oci":{"summary":"OCI — completed zone failure job for VM.Standard3.Flex in us-ashburn-1","value":{"id":"b8c9d0e1-f2a3-4567-bcde-789012345678","type":"chaos_test","status":"completed","simulationId":"j0k1l2m3-n4o5-6789-pqrs-890123456789","scenarioId":"zone_failure_oci_us_ashburn_ad1","createdAt":"2025-11-23T12:30:00Z","completedAt":"2025-11-23T12:34:55Z","error":null,"resilienceScore":{"overall":68.5,"grade":"D","metrics":{"recoveryTimeSeconds":118,"availabilityPercent":83.2,"meanTimeToDetect":5.1,"meanTimeToRecover":118,"errorRateDuringFailure":55.8}},"vulnerabilities":[{"id":"single_ad","severity":"high","title":"VM Instances Confined to AD-1 (us-ashburn-1)","description":"All VM.Standard3.Flex instances are in Availability Domain AD-1. An AD-level failure takes the entire compute fleet offline with no cross-AD failover."},{"id":"no_adb_cross_ad_replication","severity":"medium","title":"Autonomous Database Has No Cross-AD Data Guard","description":"The Autonomous Database is configured in a single AD. Enabling Cross-AD Data Guard ensures automatic failover to a standby in AD-2 during an AD-level outage."}],"recommendations":["Distribute VM.Standard3.Flex instances across AD-1, AD-2, and AD-3 using an instance pool with OCI Load Balancer health checks","Enable Autonomous Database Cross-AD Data Guard for automatic failover to a standby in AD-2 within seconds","Use OCI Traffic Management Steering Policies to route traffic away from the affected AD during an outage","Configure OCI Monitoring alarms and Notifications to alert on compute instance health failures within 30 seconds"]}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"summary":"Cancel chaos engineering job","description":"Cancel a running chaos engineering job. This endpoint is idempotent - calling it multiple times\non the same job will return success without error.\n\n**Cancellation Rules:**\n- Jobs with status \"pending\" or \"running\" will be cancelled\n- Jobs already \"cancelled\" will return success (idempotent behavior)\n- Jobs with status \"completed\" or \"failed\" cannot be cancelled (returns 409)\n- Cancelled jobs will have status set to \"cancelled\" and a cancelledAt timestamp\n","operationId":"cancelChaosJob","tags":["Chaos Engineering"],"security":[{"BearerAuth":[]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string"},"description":"Job ID"}],"responses":{"200":{"description":"Job cancelled successfully or was already cancelled.\nReturns the same response whether cancelling for the first time or if already cancelled\n(idempotent operation).\n","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"status":{"type":"string","enum":["cancelled"]},"cancelledAt":{"type":"string","format":"date-time"},"message":{"type":"string","description":"Message indicating if job was just cancelled or already cancelled"}}},"examples":{"newlyCancelled":{"value":{"id":"job_abc123","status":"cancelled","cancelledAt":"2024-01-15T10:30:00Z","message":"Job cancelled successfully"}},"alreadyCancelled":{"value":{"id":"job_abc123","status":"cancelled","cancelledAt":"2024-01-15T09:45:00Z","message":"Job already cancelled"}},"digitalOceanNewlyCancelled":{"summary":"DigitalOcean — Droplet-based chaos job cancelled mid-run","value":{"id":"job_do_droplet-web-1","status":"cancelled","cancelledAt":"2025-11-23T11:04:17Z","message":"Job cancelled successfully"}},"digitalOceanAlreadyCancelled":{"summary":"DigitalOcean — Droplet-based chaos job was already cancelled (idempotent)","value":{"id":"job_do_droplet-web-1","status":"cancelled","cancelledAt":"2025-11-23T11:03:55Z","message":"Job already cancelled"}},"awsNewlyCancelled":{"summary":"AWS — EC2 zone-failure chaos job cancelled mid-run (us-east-1a)","value":{"id":"job-aws-zone-ec2-001","status":"cancelled","cancelledAt":"2025-11-23T12:03:41Z","message":"Job cancelled successfully"}},"awsAlreadyCancelled":{"summary":"AWS — EC2 chaos job already cancelled (idempotent)","value":{"id":"job-aws-zone-ec2-001","status":"cancelled","cancelledAt":"2025-11-23T12:02:18Z","message":"Job already cancelled"}},"gcpNewlyCancelled":{"summary":"GCP — Cloud SQL crash chaos job cancelled mid-run (us-central1)","value":{"id":"job-gcp-csql-crash-001","status":"cancelled","cancelledAt":"2025-11-23T12:13:05Z","message":"Job cancelled successfully"}},"gcpAlreadyCancelled":{"summary":"GCP — Cloud SQL chaos job already cancelled (idempotent)","value":{"id":"job-gcp-csql-crash-001","status":"cancelled","cancelledAt":"2025-11-23T12:11:50Z","message":"Job already cancelled"}},"azureNewlyCancelled":{"summary":"Azure — VM Scale Set zone-failure chaos job cancelled mid-run (East US)","value":{"id":"job-azure-vmss-zone-001","status":"cancelled","cancelledAt":"2025-11-23T12:23:14Z","message":"Job cancelled successfully"}},"azureAlreadyCancelled":{"summary":"Azure — VM Scale Set chaos job already cancelled (idempotent)","value":{"id":"job-azure-vmss-zone-001","status":"cancelled","cancelledAt":"2025-11-23T12:22:30Z","message":"Job already cancelled"}},"ociNewlyCancelled":{"summary":"OCI — Autonomous Database crash chaos job cancelled mid-run (us-ashburn-1)","value":{"id":"job-oci-adb-crash-001","status":"cancelled","cancelledAt":"2025-11-23T12:33:27Z","message":"Job cancelled successfully"}},"ociAlreadyCancelled":{"summary":"OCI — Autonomous Database chaos job already cancelled (idempotent)","value":{"id":"job-oci-adb-crash-001","status":"cancelled","cancelledAt":"2025-11-23T12:32:09Z","message":"Job already cancelled"}}}}}},"404":{"description":"Job not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"409":{"description":"Cannot cancel job that is already completed or failed","content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string"},"status":{"type":"string"}}},"example":{"error":"Cannot cancel job that is already completed or failed","status":"completed"}}}},"500":{"description":"Failed to cancel job","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/chaos/jobs/{jobId}/results":{"x-stability":"stable","get":{"tags":["Chaos Engineering"],"summary":"Get chaos test results","description":"Retrieve full results from a completed chaos test including resilience score,\nvulnerabilities detected, and recommendations for improvement.\n\nResults are provider-agnostic and work identically for simulations running\non AWS, GCP, Azure, OCI, and **DigitalOcean**. For DigitalOcean simulations the\n`timeline` events will reference DO resource types (e.g. Droplets, Managed\nDatabases, Spaces), and recommendations will be framed around DO-specific\nmitigations such as enabling Managed Database standby nodes or distributing\nDroplets across multiple datacenters.\n\n**Client Code Sample (JavaScript / Node.js):**\n\n```javascript\nasync function getChaosResults(jobId, apiToken) {\n  const headers = {\n    Authorization: `Bearer ${apiToken}`,\n    Accept: 'application/json',\n  };\n\n  // The results endpoint returns 400 while the job is still pending/running,\n  // so poll the job status endpoint until it reaches a terminal state first.\n  let status;\n  do {\n    const statusResp = await fetch(\n      `https://your-host/api/chaos/jobs/${jobId}`,\n      { headers }\n    );\n    if (!statusResp.ok) {\n      throw new Error(`HTTP ${statusResp.status}: ${await statusResp.text()}`);\n    }\n    ({ status } = await statusResp.json());\n    console.log('Job status:', status);\n    if (['completed', 'failed', 'cancelled'].includes(status)) break;\n    await new Promise((r) => setTimeout(r, 5000));\n  } while (true);\n\n  // Now the job is done — fetch the detailed results.\n  const response = await fetch(\n    `https://your-host/api/chaos/jobs/${jobId}/results`,\n    { headers }\n  );\n\n  if (!response.ok) {\n    throw new Error(`HTTP ${response.status}: ${await response.text()}`);\n  }\n\n  const data = await response.json();\n\n  console.log('Resilience score:', data.resilienceScore.overall);\n  console.log('Grade:', data.resilienceScore.grade);\n\n  if (data.vulnerabilities.length > 0) {\n    console.log('Vulnerabilities found:');\n    for (const vuln of data.vulnerabilities) {\n      console.log(`  [${vuln.severity.toUpperCase()}] ${vuln.title}: ${vuln.description}`);\n    }\n  }\n\n  if (data.recommendations.length > 0) {\n    console.log('Recommendations:');\n    data.recommendations.forEach((rec, i) => console.log(`  ${i + 1}. ${rec}`));\n  }\n\n  return data;\n}\n```\n\n**Client Code Sample (Python / httpx):**\n\n```python\nimport time\nimport httpx\n\ndef get_chaos_results(job_id: str, api_token: str) -> dict:\n    headers = {\n        \"Authorization\": f\"Bearer {api_token}\",\n        \"Accept\": \"application/json\",\n    }\n\n    with httpx.Client() as client:\n        while True:\n            status_resp = client.get(\n                f\"https://your-host/api/chaos/jobs/{job_id}\",\n                headers=headers,\n            )\n            status_resp.raise_for_status()\n            status = status_resp.json()[\"status\"]\n            print(f\"Job status: {status}\")\n            if status in (\"completed\", \"failed\", \"cancelled\"):\n                break\n            time.sleep(5)\n\n        response = client.get(\n            f\"https://your-host/api/chaos/jobs/{job_id}/results\",\n            headers=headers,\n        )\n        response.raise_for_status()\n\n    data = response.json()\n\n    resilience = data[\"resilienceScore\"]\n    print(f\"Resilience score: {resilience['overall']} (Grade: {resilience['grade']})\")\n\n    vulnerabilities = data.get(\"vulnerabilities\", [])\n    if vulnerabilities:\n        print(\"Vulnerabilities found:\")\n        for vuln in vulnerabilities:\n            print(f\"  [{vuln['severity'].upper()}] {vuln['title']}: {vuln['description']}\")\n\n    recommendations = data.get(\"recommendations\", [])\n    if recommendations:\n        print(\"Recommendations:\")\n        for i, rec in enumerate(recommendations, start=1):\n            print(f\"  {i}. {rec}\")\n\n    return data\n```\n","operationId":"getChaosResults","x-codeSamples":[{"lang":"curl","label":"curl","source":"while true; do\n  STATUS=$(curl -s https://your-production-domain.com/api/chaos/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890 \\\n    -H \"Authorization: Bearer $API_KEY\" | jq -r '.status')\n  echo \"job status: $STATUS\"\n  case \"$STATUS\" in\n    completed|failed|cancelled) break ;;\n  esac\n  sleep 5\ndone\n\ncurl https://your-production-domain.com/api/chaos/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890/results \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import time\nimport requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\"\nHEADERS = {\"Authorization\": f\"Bearer {API_KEY}\"}\n\nwhile True:\n    status_resp = requests.get(f\"{BASE_URL}/chaos/jobs/{JOB_ID}\", headers=HEADERS)\n    status_resp.raise_for_status()\n    status = status_resp.json()[\"status\"]\n    print(f\"job status: {status}\")\n    if status in (\"completed\", \"failed\", \"cancelled\"):\n        break\n    time.sleep(5)\n\nresp = requests.get(\n    f\"{BASE_URL}/chaos/jobs/{JOB_ID}/results\",\n    headers=HEADERS,\n)\nresp.raise_for_status()\ndata = resp.json()\nscore = data[\"resilienceScore\"]\nprint(f\"Resilience score: {score['overall']} (Grade: {score['grade']})\")\nfor vuln in data.get(\"vulnerabilities\", []):\n    print(f\"  [{vuln['severity'].upper()}] {vuln['title']}\")\nfor i, rec in enumerate(data.get(\"recommendations\", []), 1):\n    print(f\"  {i}. {rec}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\";\nconst headers = { \"Authorization\": `Bearer ${API_KEY}` };\n\n// The results endpoint returns 400 while the job is still pending/running,\n// so poll the job status endpoint until it reaches a terminal state first.\nlet status;\ndo {\n  const statusResp = await fetch(`${BASE_URL}/chaos/jobs/${JOB_ID}`, { headers });\n  if (!statusResp.ok) throw new Error(`HTTP ${statusResp.status}: ${await statusResp.text()}`);\n  ({ status } = await statusResp.json());\n  console.log(`job status: ${status}`);\n  if ([\"completed\", \"failed\", \"cancelled\"].includes(status)) break;\n  await new Promise((r) => setTimeout(r, 5000));\n} while (true);\n\nconst resp = await fetch(`${BASE_URL}/chaos/jobs/${JOB_ID}/results`, { headers });\nif (!resp.ok) throw new Error(`HTTP ${resp.status}: ${await resp.text()}`);\nconst data = await resp.json();\nconsole.log(`Resilience score: ${data.resilienceScore.overall} (Grade: ${data.resilienceScore.grade})`);\nfor (const vuln of data.vulnerabilities ?? [])\n  console.log(`  [${vuln.severity.toUpperCase()}] ${vuln.title}`);\n(data.recommendations ?? []).forEach((rec, i) => console.log(`  ${i + 1}. ${rec}`));\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the chaos job"}],"responses":{"200":{"description":"Chaos test results","content":{"application/json":{"schema":{"type":"object","properties":{"resilienceScore":{"$ref":"#/components/schemas/ResilienceScore"},"vulnerabilities":{"type":"array","items":{"$ref":"#/components/schemas/Vulnerability"}},"timeline":{"type":"array","description":"Chronological events during the test","items":{"type":"object","properties":{"time":{"type":"integer","description":"Simulation step when event occurred"},"event":{"type":"string","description":"Event description"},"severity":{"type":"string","enum":["info","warning","critical"]}}}},"recommendations":{"type":"array","description":"Recommendations for improving resilience","items":{"type":"string"}},"scenario":{"$ref":"#/components/schemas/ChaosScenario"},"injections":{"type":"array","description":"Failures that were injected","items":{"$ref":"#/components/schemas/ChaosInjectionConfig"}},"timeToRecover":{"type":"number","description":"Estimated number of simulation steps for the system to fully recover after all injected failures end. For `database_crash` failures targeting an `aurora-serverless` resource, this value **includes the 2–4 step ACU warm-up window** after the crash ends — during those steps the resource operates at reduced capacity with elevated latency even though the failure injection has technically completed. Treat recovery as incomplete until this many steps have elapsed after the last active failure event.","example":6},"affectedServices":{"type":"array","description":"IDs of services that remain partially degraded after the primary failure ends. For `database_crash` on an `aurora-serverless` resource, this list includes upstream compute resources (e.g. EC2 instances, Lambda functions) that continue to experience elevated error rates or latency until the ACU scaling warm-up completes — typically **2–4 steps** after crash recovery. Use this field to identify which application tiers need circuit-breaker or retry budget adjustments to tolerate the warm-up window.","items":{"type":"string"},"example":["web-1","api-service"]}}},"examples":{"genericResult":{"summary":"Typical completed chaos result","value":{"resilienceScore":{"overall":72,"grade":"C","breakdown":{"availability":68,"recoverability":75,"faultTolerance":73}},"vulnerabilities":[{"id":"single_zone","severity":"high","title":"Single Zone Dependency","description":"All instances are in one datacenter"}],"timeline":[{"time":0,"event":"Chaos test started","severity":"info"},{"time":30,"event":"Failure injection triggered","severity":"warning"},{"time":60,"event":"Service degraded — error rate 42%","severity":"critical"},{"time":180,"event":"System recovered","severity":"info"}],"recommendations":["Add a secondary replica in a different zone","Implement automatic failover"],"injections":[{"type":"kill_instance","targetId":"web-1","injectionTime":30,"duration":90}]}},"digitalOceanDropletCrash":{"summary":"DigitalOcean — Droplet crash result","value":{"resilienceScore":{"overall":65,"grade":"D","breakdown":{"availability":60,"recoverability":70,"faultTolerance":65}},"vulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency","description":"All Droplets are in the NYC3 datacenter. A regional outage would take down the entire service."},{"id":"no_managed_db_standby","severity":"medium","title":"Managed Database Has No Standby Node","description":"The DigitalOcean Managed Database cluster db-primary has no standby node enabled, increasing recovery time after a failure."}],"timeline":[{"time":0,"event":"Chaos test started against DigitalOcean simulation","severity":"info"},{"time":30,"event":"Droplet 'droplet-web-1' (NYC3) crashed — instance killed","severity":"critical"},{"time":35,"event":"Load balancer detected unhealthy Droplet; traffic redistributed to remaining Droplets","severity":"warning"},{"time":40,"event":"Error rate spiked to 38% — remaining Droplets over capacity","severity":"critical"},{"time":90,"event":"Droplet 'droplet-web-1' restored; recovery time 60 steps","severity":"info"},{"time":95,"event":"Error rate returned to baseline","severity":"info"}],"recommendations":["Distribute Droplets across multiple DigitalOcean datacenters (e.g. NYC3 + SFO3) and use a Global Load Balancer","Enable standby node on the Managed Database cluster to reduce failover time","Configure Droplet monitoring alerts to trigger auto-remediation workflows","Use Reserved IPs to ensure fast traffic re-routing when a Droplet is replaced"],"injections":[{"type":"kill_instance","targetId":"droplet-web-1","injectionTime":30,"duration":60}]}},"digitalOceanZoneFailure":{"summary":"DigitalOcean — zone failure result","value":{"resilienceScore":{"overall":54,"grade":"F","breakdown":{"availability":48,"recoverability":58,"faultTolerance":56}},"vulnerabilities":[{"id":"single_datacenter","severity":"critical","title":"All Droplets Concentrated in One Datacenter","description":"All web and API Droplets are deployed exclusively in NYC3. A datacenter-level outage takes the entire fleet offline with no automatic failover."},{"id":"no_cross_region_lb","severity":"high","title":"No Cross-Region Load Balancing","description":"The DigitalOcean Load Balancer is scoped to a single datacenter. Traffic cannot be rerouted to an unaffected region during a zone outage."},{"id":"spaces_region_locked","severity":"medium","title":"Object Storage Bucket Region-Locked","description":"The Spaces bucket is located in NYC3. Assets stored there are inaccessible while the datacenter is offline."}],"timeline":[{"time":0,"event":"Chaos test started against DigitalOcean simulation","severity":"info"},{"time":15,"event":"Zone failure injected — NYC3 datacenter marked unavailable","severity":"critical"},{"time":16,"event":"All Droplets in NYC3 (droplet-web-1, droplet-web-2, droplet-api-1) became unreachable","severity":"critical"},{"time":17,"event":"Load balancer health checks failing — no healthy backend targets","severity":"critical"},{"time":20,"event":"Error rate reached 100% — service fully unavailable","severity":"critical"},{"time":90,"event":"Managed Database standby node promoted in NYC3 (partially restored)","severity":"warning"},{"time":135,"event":"NYC3 datacenter restored; Droplets restarted","severity":"info"},{"time":140,"event":"Error rate returned to baseline; total outage duration 125 steps","severity":"info"}],"recommendations":["Deploy Droplets across at least two DigitalOcean datacenters (e.g. NYC3 + SFO3) and use a Global Load Balancer to route traffic automatically","Enable Managed Database standby nodes in a secondary datacenter so failover is automatic during a regional outage","Replicate the Spaces bucket to a second region or use a CDN (e.g. DigitalOcean CDN) to serve cached assets during an origin outage","Use Reserved IPs with datacenter-level failover scripting so the public endpoint can be re-pointed without a DNS TTL delay"],"injections":[{"type":"kill_zone","targetId":"nyc3","injectionTime":15,"duration":120}]}},"digitalOceanDatabaseCrash":{"summary":"DigitalOcean — Managed Database crash result","value":{"resilienceScore":{"overall":61,"grade":"D","breakdown":{"availability":55,"recoverability":63,"faultTolerance":65}},"vulnerabilities":[{"id":"no_managed_db_standby","severity":"critical","title":"Managed Database Has No Standby Node","description":"The DigitalOcean Managed PostgreSQL cluster db-primary has no standby node configured. A primary crash requires manual failover, causing extended downtime."},{"id":"no_connection_pooling","severity":"high","title":"Connection Pooling Not Enabled","description":"PgBouncer connection pooling is not enabled on the Managed Database cluster. Under high reconnect load after recovery, the database exhausts its connection limit quickly."},{"id":"no_read_replica","severity":"medium","title":"No Read Replica Configured","description":"All queries route to the primary node. A read replica would allow read traffic to continue serving during a primary failure."}],"timeline":[{"time":0,"event":"Chaos test started against DigitalOcean simulation","severity":"info"},{"time":20,"event":"Managed Database crash injected — db-primary marked unavailable","severity":"critical"},{"time":21,"event":"Application Droplets began reporting database connection errors","severity":"critical"},{"time":25,"event":"Error rate spiked to 87% — write and read paths both failed","severity":"critical"},{"time":30,"event":"Connection pool on application Droplets exhausted — cascading timeouts","severity":"critical"},{"time":75,"event":"DigitalOcean automated recovery detected; new primary provisioning started","severity":"warning"},{"time":110,"event":"Managed Database primary restored; applications reconnecting","severity":"info"},{"time":120,"event":"Error rate returned to baseline; recovery time 100 steps","severity":"info"}],"recommendations":["Enable a standby node on the Managed Database cluster so DigitalOcean can perform automatic failover in under 60 seconds","Enable PgBouncer connection pooling on the Managed Database cluster to absorb reconnection bursts after recovery","Add a read replica and route SELECT queries there so read traffic continues serving during a primary-only failure","Implement application-level retry logic with exponential back-off to handle transient database unavailability gracefully"],"injections":[{"type":"database_crash","targetId":"db-primary","injectionTime":20,"duration":90}]}},"digitalOceanNetworkPartition":{"summary":"DigitalOcean — network partition result","value":{"resilienceScore":{"overall":58,"grade":"F","breakdown":{"availability":52,"recoverability":61,"faultTolerance":60}},"vulnerabilities":[{"id":"no_inter_datacenter_traffic_shaping","severity":"critical","title":"No Inter-Datacenter Traffic Shaping","description":"There is no traffic shaping or circuit-breaking configured between the NYC3 and SFO3 datacenters. When the inter-datacenter link is partitioned, services that depend on cross-region calls hang until TCP timeout, causing cascading latency across the entire request path."},{"id":"no_vpc_peering","severity":"high","title":"VPC Peering Not Configured","description":"Droplets in NYC3 and SFO3 communicate over public IP addresses rather than a DigitalOcean VPC peering connection. During a network partition, traffic traverses the public internet with no guaranteed routing, increasing packet loss and preventing private-network fallback."},{"id":"no_private_networking_for_db","severity":"high","title":"Managed Database Accessible Only via Public Endpoint","description":"The Managed PostgreSQL cluster db-primary is accessed through its public connection string. Application Droplets have no private-network route to the database, so a partition that affects public routing also severs all database connectivity."},{"id":"no_circuit_breaker","severity":"medium","title":"No Circuit Breaker on Cross-Service Calls","description":"API Droplets retry failed calls to backend services without a circuit breaker. During the partition, retries amplify load on already-degraded services and exhaust connection pools faster."}],"timeline":[{"time":0,"event":"Chaos test started against DigitalOcean simulation","severity":"info"},{"time":25,"event":"Network partition injected — inter-datacenter link between NYC3 and SFO3 severed","severity":"critical"},{"time":26,"event":"Cross-datacenter calls from droplet-api-1 (NYC3) to droplet-worker-1 (SFO3) began timing out","severity":"critical"},{"time":30,"event":"Error rate climbed to 54% — requests depending on SFO3 workers failed","severity":"critical"},{"time":35,"event":"Connection pool on droplet-api-1 exhausted due to hanging cross-datacenter retries","severity":"critical"},{"time":40,"event":"Managed Database public endpoint intermittently unreachable from SFO3 Droplets","severity":"critical"},{"time":45,"event":"Error rate peaked at 71% — NYC3 services partially degraded; SFO3 services fully unavailable","severity":"critical"},{"time":100,"event":"Network partition lifted — inter-datacenter link restored","severity":"info"},{"time":105,"event":"Cross-datacenter calls resumed; connection pools draining backlog","severity":"warning"},{"time":120,"event":"Error rate returned to baseline; total degradation window 95 steps","severity":"info"}],"recommendations":["Configure VPC peering between NYC3 and SFO3 so Droplets communicate over DigitalOcean's private network, bypassing public-internet routing failures","Enable private networking on all Droplets and update the Managed Database connection string to use the private-network endpoint, ensuring database connectivity survives public-routing partitions","Implement a circuit breaker (e.g. using a service mesh or application-level library) on all cross-datacenter calls to fast-fail instead of hanging until TCP timeout","Add traffic shaping and timeout budgets on inter-datacenter links so a partition causes immediate, bounded failures rather than cascading latency across the entire request path","Use DigitalOcean's Managed Database connection pools (PgBouncer) to limit the blast radius of reconnection storms after partition recovery"],"injections":[{"type":"network_partition","targetId":"nyc3-sfo3-link","injectionTime":25,"duration":75}]}},"awsZoneFailure":{"summary":"AWS — zone failure result (EC2 in us-east-1a)","value":{"resilienceScore":{"overall":71,"grade":"C","breakdown":{"availability":66,"recoverability":74,"faultTolerance":73}},"vulnerabilities":[{"id":"single_az","severity":"high","title":"Single Availability Zone Dependency (us-east-1a)","description":"All EC2 m5.large instances are deployed exclusively in us-east-1a. A zone outage takes the entire fleet offline with no automatic cross-AZ failover."},{"id":"no_multi_az_rds","severity":"medium","title":"RDS Multi-AZ Not Enabled","description":"The RDS db.r5.large instance is in us-east-1a without a Multi-AZ standby. Zone failure requires manual promotion to the backup, extending downtime."}],"timeline":[{"time":0,"event":"Chaos test started against AWS simulation","severity":"info"},{"time":12,"event":"Zone failure injected — us-east-1a marked unavailable","severity":"critical"},{"time":13,"event":"All EC2 instances in us-east-1a (web-1, web-2, api-1) became unreachable","severity":"critical"},{"time":14,"event":"ALB health checks failing — no healthy backend targets","severity":"critical"},{"time":18,"event":"Error rate reached 100% — service fully unavailable","severity":"critical"},{"time":85,"event":"us-east-1a restored; EC2 instances restarting","severity":"warning"},{"time":92,"event":"ALB health checks passing; traffic resumed","severity":"info"},{"time":98,"event":"Error rate returned to baseline; total outage 86 steps","severity":"info"}],"recommendations":["Deploy EC2 Auto Scaling groups across us-east-1a, us-east-1b, and us-east-1c with the ALB configured for cross-zone load balancing","Enable RDS Multi-AZ to allow automatic standby promotion within 60 seconds of a primary AZ failure","Use Route 53 health checks and DNS failover to re-route traffic to a secondary region if the primary region becomes unavailable","Configure EC2 Auto Scaling AZ rebalancing so replacement instances are provisioned in healthy zones automatically"],"injections":[{"type":"kill_zone","targetId":"us-east-1a","injectionTime":12,"duration":73}]}},"awsDatabaseCrashAuroraServerless":{"summary":"AWS — Aurora Serverless v2 database_crash result with ACU warm-up (us-east-1)","value":{"resilienceScore":{"overall":67,"grade":"D","breakdown":{"availability":61,"recoverability":70,"faultTolerance":70}},"vulnerabilities":[{"id":"no_aurora_serverless_multi_az_reader","severity":"high","title":"Aurora Serverless v2 Has No Multi-AZ Reader","description":"The aurora-serverless cluster has no read replica in a secondary AZ. A primary crash requires ACU scaling to ramp up from minimum capacity, extending the degradation window beyond the raw crash duration by 2–4 steps."},{"id":"no_circuit_breaker_on_acu_warmup","severity":"medium","title":"No Circuit Breaker Configured for ACU Warm-Up Window","description":"EC2 instances and Lambda functions continue sending requests during ACU scale-up after crash recovery. Without a circuit breaker, retries amplify latency and sustain elevated error rates throughout the warm-up window."}],"timeline":[{"time":0,"event":"Chaos test started against AWS simulation","severity":"info"},{"time":15,"event":"Aurora Serverless v2 crash injected — aurora-serverless cluster marked unavailable","severity":"critical"},{"time":16,"event":"EC2 instances web-1 and api-1 and Lambda function lambda-processor began reporting database connection errors","severity":"critical"},{"time":20,"event":"Error rate spiked to 89% — all Aurora-dependent paths failing","severity":"critical"},{"time":75,"event":"Aurora Serverless v2 crash ended — cluster restored, ACU ramp-up beginning from minimum capacity (0.5 ACU)","severity":"warning"},{"time":76,"event":"ACU scaling in progress — capacity at 1 ACU; elevated write-path latency (p99 ~850 ms); error rate 31%","severity":"warning"},{"time":78,"event":"ACU scaling in progress — capacity at 4 ACU; error rate declining but still above baseline (12%)","severity":"warning"},{"time":81,"event":"ACU warm-up complete — cluster at full capacity; error rate returned to baseline","severity":"info"}],"recommendations":["Add an Aurora Serverless v2 reader instance in a secondary AZ to absorb read traffic during primary recovery and reduce the ACU warm-up blast radius","Set the Aurora Serverless v2 minimum ACU to match expected steady-state load so ramp-up after a crash starts from a higher baseline capacity","Implement a circuit breaker on EC2 and Lambda database clients to shed load during the ACU warm-up window instead of amplifying latency with retries","Use RDS Proxy in front of Aurora Serverless to pool and hold connections during warm-up, preventing reconnection storms that exhaust ACU budget"],"injections":[{"type":"database_crash","targetId":"aurora-serverless","injectionTime":15,"duration":60}],"timeToRecover":6,"affectedServices":["web-1","api-1","lambda-processor"]}},"gcpDatabaseCrash":{"summary":"GCP — Cloud SQL crash result (us-central1)","value":{"resilienceScore":{"overall":69,"grade":"D","breakdown":{"availability":63,"recoverability":72,"faultTolerance":72}},"vulnerabilities":[{"id":"no_cloud_sql_ha","severity":"critical","title":"Cloud SQL High Availability Not Enabled","description":"The Cloud SQL db-standard-4 instance has no HA standby replica. A zone failure on us-central1-a requires manual intervention with extended downtime."},{"id":"no_cloud_sql_read_replica","severity":"medium","title":"No Cloud SQL Read Replica","description":"All queries route to the primary Cloud SQL instance. A read replica would allow read traffic to continue during a primary failure."}],"timeline":[{"time":0,"event":"Chaos test started against GCP simulation","severity":"info"},{"time":18,"event":"Cloud SQL instance crash injected — db-primary marked unavailable","severity":"critical"},{"time":20,"event":"GCE instances began reporting Cloud SQL connection errors","severity":"critical"},{"time":24,"event":"Error rate spiked to 91% — all DB-dependent paths failing","severity":"critical"},{"time":28,"event":"Connection pool exhausted on gce-web-2 and gce-api-1","severity":"critical"},{"time":80,"event":"Cloud SQL automated recovery initiated — new primary provisioning","severity":"warning"},{"time":105,"event":"Cloud SQL primary restored; GCE instances reconnecting","severity":"info"},{"time":115,"event":"Error rate returned to baseline; recovery time 97 steps","severity":"info"}],"recommendations":["Enable Cloud SQL High Availability to provision a standby replica in a secondary zone with automatic failover in under 60 seconds","Add a Cloud SQL read replica to allow read traffic to continue serving during a primary-only failure","Implement Cloud SQL Auth Proxy connection pooling to absorb reconnection bursts after recovery","Configure Cloud Monitoring alerts on Cloud SQL connection count and error rate to detect failures within seconds"],"injections":[{"type":"database_crash","targetId":"db-primary","injectionTime":18,"duration":87}]}},"azureZoneFailure":{"summary":"Azure — VM Scale Set zone failure result (East US)","value":{"resilienceScore":{"overall":66,"grade":"D","breakdown":{"availability":60,"recoverability":70,"faultTolerance":68}},"vulnerabilities":[{"id":"no_zone_redundant_vmss","severity":"high","title":"VM Scale Set Not Zone-Redundant (East US)","description":"The Standard_D4s_v3 VM Scale Set is pinned to a single availability zone. An Azure infrastructure event in that zone takes down all application VMs."},{"id":"no_zone_redundant_sql","severity":"high","title":"Azure SQL Not Zone-Redundant","description":"Azure SQL Database General Purpose tier is not configured with zone-redundant backup. A zone failure may require a point-in-time restore rather than automatic failover."},{"id":"no_traffic_manager","severity":"medium","title":"No Azure Traffic Manager Configured","description":"Without Traffic Manager, there is no DNS-level failover to a secondary region if the East US deployment becomes unavailable."}],"timeline":[{"time":0,"event":"Chaos test started against Azure simulation","severity":"info"},{"time":14,"event":"Zone failure injected — East US availability zone 1 marked unavailable","severity":"critical"},{"time":15,"event":"All Standard_D4s_v3 VM instances in zone 1 became unreachable","severity":"critical"},{"time":17,"event":"Azure Load Balancer health probes failing — no healthy backend VMs","severity":"critical"},{"time":22,"event":"Error rate reached 100% — service fully unavailable","severity":"critical"},{"time":95,"event":"East US zone 1 restored; VM Scale Set instances returning","severity":"warning"},{"time":108,"event":"Load balancer health probes passing; traffic resumed","severity":"info"},{"time":118,"event":"Error rate returned to baseline; total outage 104 steps","severity":"info"}],"recommendations":["Configure the VM Scale Set to use zone redundancy across all three East US availability zones for automatic rebalancing after a zone failure","Enable zone-redundant backup for Azure SQL Database to avoid point-in-time restore delays during zone failures","Add an Azure Traffic Manager profile with priority routing to a West US 2 standby deployment for regional failover","Use Azure Monitor alerts and Action Groups to notify on-call staff within 60 seconds of a zone-level health event"],"injections":[{"type":"kill_zone","targetId":"east-us-zone-1","injectionTime":14,"duration":81}]}},"ociDatabaseCrash":{"summary":"OCI — Autonomous Database crash result (us-ashburn-1)","value":{"resilienceScore":{"overall":73,"grade":"C","breakdown":{"availability":68,"recoverability":76,"faultTolerance":75}},"vulnerabilities":[{"id":"no_adb_cross_ad_data_guard","severity":"high","title":"Autonomous Database Has No Cross-AD Data Guard","description":"The Autonomous Database is deployed in AD-1 only. Enabling Cross-AD Data Guard provides automatic failover to a standby in AD-2 with a recovery time under 30 seconds."},{"id":"no_connection_pooling_oci","severity":"medium","title":"No Connection Pool Configured (JDBC/UCP)","description":"VM instances connect directly to Autonomous Database without a connection pool. After a failover, reconnection storms can temporarily exhaust available OCPU capacity."}],"timeline":[{"time":0,"event":"Chaos test started against OCI simulation","severity":"info"},{"time":16,"event":"Autonomous Database crash injected — primary in AD-1 marked unavailable","severity":"critical"},{"time":18,"event":"VM.Standard3.Flex instances began reporting database connection errors","severity":"critical"},{"time":22,"event":"Error rate spiked to 84% — all database-dependent endpoints failing","severity":"critical"},{"time":65,"event":"OCI automated recovery detected; Autonomous Database provisioning in AD-1","severity":"warning"},{"time":90,"event":"Autonomous Database primary restored; VM instances reconnecting","severity":"info"},{"time":98,"event":"Error rate returned to baseline; recovery time 82 steps","severity":"info"}],"recommendations":["Enable Autonomous Database Cross-AD Data Guard for automatic failover to a standby in AD-2 within 30 seconds of a primary failure","Use Universal Connection Pool (UCP) on VM instances to absorb reconnection bursts and reduce OCPU pressure after a database failover","Configure OCI Monitoring alarms on Autonomous Database CPU utilization and connection count to detect anomalies within 30 seconds","Enable Data Safe activity auditing to track connection-level events and assist with post-incident analysis"],"injections":[{"type":"database_crash","targetId":"autonomous-db-primary","injectionTime":16,"duration":74}]}}}}}},"400":{"description":"Job not completed yet","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/chaos/jobs/{jobId}/stream":{"x-stability":"stable","get":{"summary":"Stream chaos test job progress in real-time","description":"Stream chaos engineering job progress using Server-Sent Events (SSE).\nThis endpoint provides real-time updates as failure injections are applied\nand resilience analysis progresses.\n\n**SSE Event Types:**\n- `init`: Initial job state when connection is established\n- `progressUpdate`: Progress update sent periodically as the job runs\n- `completed`: Job has finished successfully (final event before connection closes)\n- `failed`: Job encountered an unrecoverable error (final event before connection closes)\n- `cancelled`: Job was cancelled via `DELETE /api/chaos/jobs/{jobId}` (final event before connection closes)\n\n**Connection Behavior:**\n- Connection remains open until the job completes, fails, or is cancelled\n- Connection automatically closes when the job reaches a terminal state\n\n**Agent Reconnection and Recovery Guide:**\n\nAfter receiving a terminal SSE event (`completed`, `failed`, or `cancelled`) the server\ncloses the connection. Each event requires a different agent response:\n\n- **`completed`**: The job finished successfully. No reconnection is needed. Fetch the\n  full results from `GET /api/chaos/jobs/{jobId}/results` to retrieve the\n  `resilienceScore`, `vulnerabilities`, `recommendations`, and `timeline`.\n\n- **`failed`**: The job encountered an unrecoverable error. Inspect the `error` field in\n  the event payload for the root cause. Transient errors (e.g. a simulation state\n  read timeout) are safe to retry — submit a new job via `POST /api/chaos/run`.\n  Permanent errors (e.g. an invalid simulation ID or an unresolvable scenario) should\n  not be retried without fixing the underlying input first. Do not attempt to reconnect\n  to the same `jobId`; it will not recover.\n\n- **`cancelled`**: Cancellation is terminal and intentional. No retry is needed or\n  recommended. If the cancellation was unintended, submit a new job.\n\n**Handling unexpected connection drops (no terminal event received):**\n\nIf the SSE connection closes without a `completed`, `failed`, or `cancelled` event —\nfor example due to a network interruption, proxy timeout, or server restart — the job\nmay still be running. Use the following fallback strategy:\n\n1. Poll `GET /api/chaos/jobs/{jobId}` to check the current `status` field.\n2. If `status` is `running` or `pending`, reconnect to this stream endpoint.\n3. If `status` is `completed`, `failed`, or `cancelled`, treat it the same as if you\n   had received the corresponding terminal SSE event (see above).\n\nAgents should implement an exponential back-off (e.g. 1 s, 2 s, 4 s, cap at 30 s)\nbefore each reconnection attempt to avoid hammering the server during an outage.\n","operationId":"streamChaosJob","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -N https://your-production-domain.com/api/chaos/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890/stream \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Accept: text/event-stream\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\"\n\nwith requests.get(\n    f\"{BASE_URL}/chaos/jobs/{JOB_ID}/stream\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\", \"Accept\": \"text/event-stream\"},\n    stream=True,\n) as resp:\n    resp.raise_for_status()\n    for line in resp.iter_lines():\n        if line:\n            print(line.decode())\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"a1b2c3d4-e5f6-7890-abcd-ef1234567890\";\n\nconst resp = await fetch(`${BASE_URL}/chaos/jobs/${JOB_ID}/stream`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}`, \"Accept\": \"text/event-stream\" },\n});\nconst reader = resp.body.getReader();\nconst decoder = new TextDecoder();\nwhile (true) {\n  const { done, value } = await reader.read();\n  if (done) break;\n  process.stdout.write(decoder.decode(value));\n}\n"}],"tags":["Chaos Engineering"],"security":[{"BearerAuth":[]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"Chaos job ID"}],"responses":{"200":{"description":"SSE stream of chaos job progress","content":{"text/event-stream":{"schema":{"type":"string","description":"Server-Sent Events stream"},"examples":{"chaosStream":{"summary":"Zone failure test — progress then completion","value":"event: init\ndata: {\"jobId\":\"chaos_abc123\",\"status\":\"running\",\"progress\":0,\"scenarioId\":\"zone_failure\"}\n\nevent: progressUpdate\ndata: {\"jobId\":\"chaos_abc123\",\"progress\":35,\"message\":\"Injecting zone failure into us-east-1a\"}\n\nevent: progressUpdate\ndata: {\"jobId\":\"chaos_abc123\",\"progress\":70,\"message\":\"Analyzing resilience and detecting vulnerabilities\"}\n\nevent: completed\ndata: {\"jobId\":\"chaos_abc123\",\"status\":\"completed\",\"progress\":100}\n"},"chaosFailedStream":{"summary":"Chaos test — failed due to invalid simulation","value":"event: init\ndata: {\"jobId\":\"chaos_xyz789\",\"status\":\"running\",\"progress\":0,\"scenarioId\":\"database_crash\"}\n\nevent: failed\ndata: {\"jobId\":\"chaos_xyz789\",\"status\":\"failed\",\"error\":\"Simulation 'sim_404' not found. Re-create the simulation and submit a new chaos run.\"}\n"},"chaosCancelledStream":{"summary":"Chaos test — cancelled by agent","value":"event: init\ndata: {\"jobId\":\"chaos_can456\",\"status\":\"running\",\"progress\":0,\"scenarioId\":\"network_partition\"}\n\nevent: cancelled\ndata: {\"jobId\":\"chaos_can456\",\"status\":\"cancelled\"}\n"}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/chaos/batch":{"x-stability":"stable","post":{"tags":["Chaos Engineering"],"summary":"Create a batch of chaos test scenarios","description":"Execute multiple chaos engineering tests in parallel. Each scenario can use either\na pre-built scenario or custom failure injections. Results are aggregated across all tests.\n\n**DigitalOcean compatibility:** All pre-built scenarios (`zone_failure`, `database_crash`,\n`network_partition`, etc.) and custom injection types (`kill_instance`, `kill_zone`,\n`database_slowdown`, etc.) work identically on DigitalOcean simulations.\nUse Droplet-specific resource IDs (e.g. `droplet-web-1`) and DO datacenter region names\n(e.g. `nyc3`, `sfo3`) when targeting a DigitalOcean simulation.\n","operationId":"createBatchChaosTest","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/chaos/batch \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"simulationId\": \"a638caad-7423-40a3-bb09-f91235d9392d\",\n    \"scenarios\": [\n      {\"scenarioId\": \"zone_failure\", \"duration\": 120},\n      {\"scenarioId\": \"database_crash\", \"duration\": 90}\n    ],\n    \"webhookUrl\": \"https://your-app.com/webhooks/chaos\",\n    \"webhookSecret\": \"your-secret-key-here\"\n  }'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\nresp = requests.post(\n    f\"{BASE_URL}/chaos/batch\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n    json={\n        \"simulationId\": \"a638caad-7423-40a3-bb09-f91235d9392d\",\n        \"scenarios\": [\n            {\"scenarioId\": \"zone_failure\", \"duration\": 120},\n            {\"scenarioId\": \"database_crash\", \"duration\": 90},\n        ],\n        \"webhookUrl\": \"https://your-app.com/webhooks/chaos\",\n        \"webhookSecret\": \"your-secret-key-here\",\n    },\n)\nresp.raise_for_status()\njob = resp.json()[\"job\"]\nprint(f\"Batch job started: {job['id']}  totalJobs={job['totalJobs']}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst resp = await fetch(`${BASE_URL}/chaos/batch`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify({\n    simulationId: \"a638caad-7423-40a3-bb09-f91235d9392d\",\n    scenarios: [\n      { scenarioId: \"zone_failure\", duration: 120 },\n      { scenarioId: \"database_crash\", duration: 90 },\n    ],\n    webhookUrl: \"https://your-app.com/webhooks/chaos\",\n    webhookSecret: \"your-secret-key-here\",\n  }),\n});\nconst { job } = await resp.json();\nconsole.log(`Batch job started: ${job.id}  totalJobs=${job.totalJobs}`);\n"}],"security":[{"BearerAuth":["write"]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"$ref":"#/components/schemas/BatchChaosRequest"},"examples":{"generic":{"summary":"Generic batch — zone failure + DB crash + custom kill","value":{"simulationId":"sim_abc123","scenarios":[{"scenarioId":"zone_failure","duration":120},{"scenarioId":"database_crash","duration":90},{"customInjections":[{"type":"kill_instance","targetId":"web-1","injectionTime":30,"duration":60}],"duration":150}],"webhookUrl":"https://example.com/webhook","webhookSecret":"secret123"}},"digitalOcean":{"summary":"DigitalOcean — nyc3 zone failure + sfo3 database crash + Droplet API server kill in parallel","value":{"simulationId":"sim_do_droplets","scenarios":[{"scenarioId":"zone_failure","targetId":"nyc3","duration":120},{"scenarioId":"database_crash","targetId":"sfo3","duration":90},{"customInjections":[{"type":"kill_instance","targetId":"droplet-web-1","injectionTime":10,"duration":60},{"type":"kill_instance","targetId":"droplet-api-1","injectionTime":20,"duration":60}],"duration":90}],"webhookUrl":"https://your-app.example.com/webhooks/chaos","webhookSecret":"do-webhook-secret"}}}}}},"responses":{"202":{"description":"Batch chaos test job accepted and started","content":{"application/json":{"schema":{"type":"object","properties":{"job":{"type":"object","properties":{"id":{"type":"string","format":"uuid","example":"batch_xyz789"},"type":{"type":"string","enum":["batch_chaos_test"],"example":"batch_chaos_test"},"status":{"type":"string","enum":["pending","running"],"example":"running"},"totalJobs":{"type":"integer","description":"Number of child chaos tests in this batch","example":3},"createdAt":{"type":"string","format":"date-time"}}},"message":{"type":"string","example":"Batch chaos test started. Use GET /chaos/batch/{id} to check status."}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"description":"Simulation not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"429":{"$ref":"#/components/responses/TooManyRequests"}}}},"/chaos/batch/{batchId}":{"x-stability":"stable","get":{"tags":["Chaos Engineering"],"summary":"Get batch chaos job status and aggregated results","description":"Retrieve the current status of a batch chaos test including aggregated\nresilience scores, vulnerabilities, and recommendations across all child jobs.\n\n**Client Code Sample (JavaScript / Node.js):**\n\n```javascript\nasync function pollBatchChaosJobStatus(batchId, apiToken) {\n  const headers = {\n    Authorization: `Bearer ${apiToken}`,\n    Accept: 'application/json',\n  };\n\n  const terminal = new Set(['completed', 'failed', 'cancelled', 'partial_failed']);\n\n  while (true) {\n    const resp = await fetch(\n      `https://your-host/api/chaos/batch/${batchId}`,\n      { headers }\n    );\n    if (!resp.ok) {\n      throw new Error(`HTTP ${resp.status}: ${await resp.text()}`);\n    }\n    const batch = await resp.json();\n    console.log(\n      `Batch ${batch.id}  status=${batch.status}  ` +\n      `${batch.completedJobs}/${batch.totalJobs} completed  ` +\n      `failed=${batch.failedJobs}`\n    );\n    if (terminal.has(batch.status)) {\n      if (batch.aggregatedResilienceScore) {\n        const s = batch.aggregatedResilienceScore;\n        console.log(`Aggregated resilience: ${s.overall} (Grade: ${s.grade})`);\n      }\n      return batch;\n    }\n    await new Promise((r) => setTimeout(r, 5000));\n  }\n}\n```\n\n**Client Code Sample (Python / httpx):**\n\n```python\nimport time\nimport httpx\n\ndef poll_batch_chaos_job_status(batch_id: str, api_token: str) -> dict:\n    headers = {\n        \"Authorization\": f\"Bearer {api_token}\",\n        \"Accept\": \"application/json\",\n    }\n    terminal = {\"completed\", \"failed\", \"cancelled\", \"partial_failed\"}\n\n    with httpx.Client() as client:\n        while True:\n            resp = client.get(\n                f\"https://your-host/api/chaos/batch/{batch_id}\",\n                headers=headers,\n            )\n            resp.raise_for_status()\n            batch = resp.json()\n            print(\n                f\"Batch {batch['id']}  status={batch['status']}  \"\n                f\"{batch['completedJobs']}/{batch['totalJobs']} completed  \"\n                f\"failed={batch['failedJobs']}\"\n            )\n            if batch[\"status\"] in terminal:\n                score = batch.get(\"aggregatedResilienceScore\")\n                if score:\n                    print(f\"Aggregated resilience: {score['overall']} (Grade: {score['grade']})\")\n                return batch\n            time.sleep(5)\n```\n","operationId":"getBatchChaosJob","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/chaos/batch/batch_xyz789 \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nBATCH_ID = \"batch_xyz789\"\n\nresp = requests.get(\n    f\"{BASE_URL}/chaos/batch/{BATCH_ID}\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\nbatch = resp.json()\nprint(f\"Batch {batch['id']}  status={batch['status']}  \"\n      f\"{batch['completedJobs']}/{batch['totalJobs']} completed\")\nif batch[\"status\"] == \"completed\" and batch.get(\"aggregatedResilienceScore\"):\n    score = batch[\"aggregatedResilienceScore\"]\n    print(f\"Aggregated resilience: {score['overall']} (Grade: {score['grade']})\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst BATCH_ID = \"batch_xyz789\";\n\nconst resp = await fetch(`${BASE_URL}/chaos/batch/${BATCH_ID}`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst batch = await resp.json();\nconsole.log(`Batch ${batch.id}  status=${batch.status}  ${batch.completedJobs}/${batch.totalJobs} completed`);\nif (batch.status === \"completed\" && batch.aggregatedResilienceScore) {\n  const s = batch.aggregatedResilienceScore;\n  console.log(`Aggregated resilience: ${s.overall} (Grade: ${s.grade})`);\n}\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"batchId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the batch chaos job"}],"responses":{"200":{"description":"Batch chaos job status and aggregated results","content":{"application/json":{"schema":{"$ref":"#/components/schemas/BatchChaosJob"},"examples":{"generic":{"summary":"Generic completed batch result","value":{"id":"batch_xyz789","status":"completed","childJobIds":["job_1","job_2","job_3"],"totalJobs":3,"completedJobs":2,"failedJobs":1,"cancelledJobs":0,"aggregatedResilienceScore":{"overall":78.5,"grade":"C"},"aggregatedVulnerabilities":[{"id":"zone_dependency","severity":"high","title":"Single Availability Zone Dependency","occurrences":2}],"aggregatedRecommendations":["Distribute resources across multiple availability zones","Implement database connection pooling"],"createdAt":"2025-11-23T10:00:00Z","updatedAt":"2025-11-23T10:15:00Z","completedAt":"2025-11-23T10:15:00Z"}},"digitalOceanInProgress":{"summary":"DigitalOcean — batch running (2 of 3 jobs complete)","value":{"id":"batch_do_abc456","status":"running","childJobIds":["job_do_droplet-web-1","job_do_droplet-api-1","job_do_droplet-worker-1"],"totalJobs":3,"completedJobs":2,"failedJobs":0,"cancelledJobs":0,"aggregatedResilienceScore":null,"aggregatedVulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency (nyc3) — droplet-web-1, droplet-api-1","occurrences":2}],"aggregatedRecommendations":["Spread Droplets across nyc3 and sfo3 and enable Global Load Balancer for multi-region failover"],"createdAt":"2025-11-23T11:00:00Z","updatedAt":"2025-11-23T11:09:00Z","completedAt":null}},"digitalOcean":{"summary":"DigitalOcean — completed batch (zone failure + Droplet crash across nyc3 and sfo3)","value":{"id":"batch_do_def789","status":"completed","childJobIds":["job_do_droplet-web-1","job_do_droplet-api-1","job_do_droplet-worker-1"],"totalJobs":3,"completedJobs":3,"failedJobs":0,"cancelledJobs":0,"aggregatedResilienceScore":{"overall":61.3,"grade":"D"},"aggregatedVulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency (nyc3) — droplet-web-1, droplet-api-1, droplet-worker-1","occurrences":3},{"id":"no_managed_db_standby","severity":"medium","title":"Managed Database Has No Standby Node","occurrences":2},{"id":"no_global_lb","severity":"medium","title":"No Global Load Balancer Configured","occurrences":2}],"aggregatedRecommendations":["Spread Droplets (droplet-web-1, droplet-api-1, droplet-worker-1) across nyc3 and sfo3 and enable Global Load Balancer","Enable standby node on the Managed Database cluster to reduce failover time","Attach Reserved IPs to droplet-web-1 and droplet-api-1 so traffic re-routes instantly when a Droplet is replaced","Configure Droplet monitoring alerts and auto-remediation workflows via DigitalOcean Functions"],"createdAt":"2025-11-23T11:00:00Z","updatedAt":"2025-11-23T11:18:00Z","completedAt":"2025-11-23T11:18:00Z"}},"aws":{"summary":"AWS — completed batch (AZ outage + DB crash + EC2 kill in us-east-1)","value":{"id":"batch-aws-001","status":"completed","childJobIds":["job-aws-zone-001","job-aws-db-001","job-aws-ec2-001"],"totalJobs":3,"completedJobs":3,"failedJobs":0,"cancelledJobs":0,"aggregatedResilienceScore":{"overall":70.8,"grade":"C"},"aggregatedVulnerabilities":[{"id":"single_az","severity":"high","title":"Single Availability Zone Dependency (us-east-1a)","occurrences":3},{"id":"no_multi_az_rds","severity":"medium","title":"RDS Multi-AZ Not Enabled","occurrences":2}],"aggregatedRecommendations":["Deploy EC2 Auto Scaling groups across us-east-1a and us-east-1b with cross-zone ALB load balancing","Enable RDS Multi-AZ for automatic standby promotion within 60 seconds of a primary AZ failure","Configure Route 53 health checks and DNS failover for regional redundancy"],"createdAt":"2025-11-23T12:00:00Z","updatedAt":"2025-11-23T12:22:00Z","completedAt":"2025-11-23T12:22:00Z"}},"gcp":{"summary":"GCP — running batch (zone failure + Cloud SQL crash in us-central1)","value":{"id":"batch-gcp-001","status":"running","childJobIds":["job-gcp-zone-001","job-gcp-csql-001"],"totalJobs":2,"completedJobs":1,"failedJobs":0,"cancelledJobs":0,"aggregatedResilienceScore":null,"aggregatedVulnerabilities":[{"id":"single_zone_gce","severity":"high","title":"GCE Instances Concentrated in us-central1-a — gce-web-1, gce-api-1","occurrences":1}],"aggregatedRecommendations":["Use a regional MIG spanning us-central1-a, us-central1-b, and us-central1-c"],"createdAt":"2025-11-23T12:10:00Z","updatedAt":"2025-11-23T12:18:00Z","completedAt":null}},"azure":{"summary":"Azure — completed batch (zone failure + SQL crash in East US)","value":{"id":"batch-azure-001","status":"completed","childJobIds":["job-azure-zone-001","job-azure-sql-001"],"totalJobs":2,"completedJobs":2,"failedJobs":0,"cancelledJobs":0,"aggregatedResilienceScore":{"overall":67.5,"grade":"D"},"aggregatedVulnerabilities":[{"id":"no_zone_redundant_vmss","severity":"high","title":"VM Scale Set Not Zone-Redundant (East US)","occurrences":2},{"id":"no_zone_redundant_sql","severity":"high","title":"Azure SQL Not Zone-Redundant","occurrences":1}],"aggregatedRecommendations":["Configure the VM Scale Set with zone redundancy across all three East US availability zones","Enable zone-redundant backup for Azure SQL Database to avoid point-in-time restore delays","Add an Azure Traffic Manager profile for regional DNS failover to West US 2"],"createdAt":"2025-11-23T12:20:00Z","updatedAt":"2025-11-23T12:38:00Z","completedAt":"2025-11-23T12:38:00Z"}},"oci":{"summary":"OCI — completed batch (AD-1 failure + ADB crash in us-ashburn-1)","value":{"id":"batch-oci-001","status":"completed","childJobIds":["job-oci-ad-001","job-oci-adb-001"],"totalJobs":2,"completedJobs":2,"failedJobs":0,"cancelledJobs":0,"aggregatedResilienceScore":{"overall":69.2,"grade":"D"},"aggregatedVulnerabilities":[{"id":"single_ad","severity":"high","title":"VM Instances Confined to AD-1 (us-ashburn-1)","occurrences":2},{"id":"no_adb_cross_ad_data_guard","severity":"medium","title":"Autonomous Database Has No Cross-AD Data Guard","occurrences":2}],"aggregatedRecommendations":["Distribute VM.Standard3.Flex instances across AD-1, AD-2, and AD-3 using an OCI instance pool","Enable Autonomous Database Cross-AD Data Guard for automatic failover to AD-2 within 30 seconds","Use OCI Traffic Management Steering Policies for load balancing and health-based routing across ADs"],"createdAt":"2025-11-23T12:30:00Z","updatedAt":"2025-11-23T12:46:00Z","completedAt":"2025-11-23T12:46:00Z"}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"tags":["Chaos Engineering"],"summary":"Cancel batch chaos test and all child jobs","description":"Cancel a running batch chaos test. This cancels all child jobs that are still running.\nThis endpoint is idempotent - calling it multiple times on the same batch will return success.\n\n**Cancellation Rules:**\n- Batches with status \"pending\" or \"running\" will be cancelled\n- All child jobs with status \"pending\" or \"running\" will also be cancelled\n- Batches already \"cancelled\" will return success (idempotent behavior)\n- Batches with status \"completed\" or \"failed\" cannot be cancelled (returns 409)\n","operationId":"cancelBatchChaosJob","security":[{"BearerAuth":["write"]}],"parameters":[{"name":"batchId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the batch chaos job"}],"responses":{"200":{"description":"Batch cancelled successfully or was already cancelled.\nReturns the same response whether cancelling for the first time or if already cancelled.\n","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"status":{"type":"string","enum":["cancelled"]},"cancelledAt":{"type":"string","format":"date-time"},"message":{"type":"string"}}},"examples":{"newlyCancelled":{"value":{"id":"batch_xyz789","status":"cancelled","cancelledAt":"2025-11-23T10:30:00Z","message":"Batch chaos job cancelled successfully"}},"alreadyCancelled":{"value":{"id":"batch_xyz789","status":"cancelled","cancelledAt":"2025-11-23T10:15:00Z","message":"Batch chaos job already cancelled"}},"digitalOceanNewlyCancelled":{"summary":"DigitalOcean — batch (zone failure + DB crash in nyc3) cancelled mid-run","value":{"id":"batch-do-nyc3-001","status":"cancelled","cancelledAt":"2025-11-23T11:07:33Z","message":"Batch chaos job cancelled successfully"}},"digitalOceanAlreadyCancelled":{"summary":"DigitalOcean — batch already cancelled (idempotent)","value":{"id":"batch-do-nyc3-001","status":"cancelled","cancelledAt":"2025-11-23T11:06:10Z","message":"Batch chaos job already cancelled"}},"awsNewlyCancelled":{"summary":"AWS — batch (AZ outage + DB crash in us-east-1) cancelled mid-run","value":{"id":"batch-aws-us-east-001","status":"cancelled","cancelledAt":"2025-11-23T12:05:22Z","message":"Batch chaos job cancelled successfully"}},"awsAlreadyCancelled":{"summary":"AWS — batch already cancelled (idempotent)","value":{"id":"batch-aws-us-east-001","status":"cancelled","cancelledAt":"2025-11-23T12:04:08Z","message":"Batch chaos job already cancelled"}},"gcpNewlyCancelled":{"summary":"GCP — batch (zone failure + Cloud SQL crash in us-central1) cancelled mid-run","value":{"id":"batch-gcp-us-central-001","status":"cancelled","cancelledAt":"2025-11-23T12:15:47Z","message":"Batch chaos job cancelled successfully"}},"gcpAlreadyCancelled":{"summary":"GCP — batch already cancelled (idempotent)","value":{"id":"batch-gcp-us-central-001","status":"cancelled","cancelledAt":"2025-11-23T12:14:30Z","message":"Batch chaos job already cancelled"}},"azureNewlyCancelled":{"summary":"Azure — batch (zone failure + SQL crash in East US) cancelled mid-run","value":{"id":"batch-azure-east-us-001","status":"cancelled","cancelledAt":"2025-11-23T12:25:11Z","message":"Batch chaos job cancelled successfully"}},"azureAlreadyCancelled":{"summary":"Azure — batch already cancelled (idempotent)","value":{"id":"batch-azure-east-us-001","status":"cancelled","cancelledAt":"2025-11-23T12:23:58Z","message":"Batch chaos job already cancelled"}},"ociNewlyCancelled":{"summary":"OCI — batch (AD-1 failure + ADB crash in us-ashburn-1) cancelled mid-run","value":{"id":"batch-oci-ashburn-001","status":"cancelled","cancelledAt":"2025-11-23T12:35:44Z","message":"Batch chaos job cancelled successfully"}},"ociAlreadyCancelled":{"summary":"OCI — batch already cancelled (idempotent)","value":{"id":"batch-oci-ashburn-001","status":"cancelled","cancelledAt":"2025-11-23T12:34:20Z","message":"Batch chaos job already cancelled"}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"},"409":{"description":"Cannot cancel batch that is already completed or failed","content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string"},"status":{"type":"string"}}},"example":{"error":"Cannot cancel batch that is already completed or failed","status":"completed"}}}},"429":{"$ref":"#/components/responses/TooManyRequests"}}}},"/chaos/batch/{batchId}/results":{"x-stability":"stable","get":{"tags":["Chaos Engineering"],"summary":"Get detailed results for all child jobs in a batch","description":"Retrieve detailed results for all chaos tests in a batch, including full\nchild job details with resilience scores, vulnerabilities, timelines, and recommendations.\n\n**DigitalOcean compatibility:** When the batch was submitted against a DigitalOcean\nsimulation, child job timelines and recommendations reference DO resource types and\ndatacenter region names (e.g. `nyc3`, `sfo3`). Droplet resource IDs (e.g. `droplet-web-1`)\nappear in failure event records and vulnerability details exactly as supplied in the\noriginal request.\n\n**Client Code Sample (JavaScript / Node.js):**\n\n```javascript\nasync function getBatchChaosResults(batchId, apiToken) {\n  const headers = {\n    Authorization: `Bearer ${apiToken}`,\n    Accept: 'application/json',\n  };\n\n  // The results endpoint returns 400 while the batch is still pending/running,\n  // so poll the batch status endpoint until it reaches a terminal state first.\n  let status;\n  do {\n    const statusResp = await fetch(\n      `https://your-host/api/chaos/batch/${batchId}`,\n      { headers }\n    );\n    if (!statusResp.ok) {\n      throw new Error(`HTTP ${statusResp.status}: ${await statusResp.text()}`);\n    }\n    ({ status } = await statusResp.json());\n    console.log('Batch status:', status);\n    if (['completed', 'failed', 'cancelled', 'partial_failed'].includes(status)) break;\n    await new Promise((r) => setTimeout(r, 5000));\n  } while (true);\n\n  // Now the batch is done — fetch the detailed results.\n  const response = await fetch(\n    `https://your-host/api/chaos/batch/${batchId}/results`,\n    { headers }\n  );\n\n  if (!response.ok) {\n    throw new Error(`HTTP ${response.status}: ${await response.text()}`);\n  }\n\n  const data = await response.json();\n\n  console.log(`Batch ${batchId}  status=${data.status}  ${data.completedJobs}/${data.totalJobs} completed`);\n  if (data.aggregatedResilienceScore) {\n    const s = data.aggregatedResilienceScore;\n    console.log(`Aggregated resilience: ${s.overall} (Grade: ${s.grade})`);\n  }\n\n  for (const child of data.childJobs ?? []) {\n    const grade = child.resilienceScore?.grade ?? 'N/A';\n    console.log(`  child ${child.id}  scenario=${child.scenarioId ?? 'custom'}  status=${child.status}  grade=${grade}`);\n  }\n\n  return data;\n}\n```\n\n**Client Code Sample (Python / httpx):**\n\n```python\nimport time\nimport httpx\n\ndef get_batch_chaos_results(batch_id: str, api_token: str) -> dict:\n    headers = {\n        \"Authorization\": f\"Bearer {api_token}\",\n        \"Accept\": \"application/json\",\n    }\n\n    terminal = {\"completed\", \"failed\", \"cancelled\", \"partial_failed\"}\n    with httpx.Client() as client:\n        while True:\n            status_resp = client.get(\n                f\"https://your-host/api/chaos/batch/{batch_id}\",\n                headers=headers,\n            )\n            status_resp.raise_for_status()\n            status = status_resp.json()[\"status\"]\n            print(f\"Batch status: {status}\")\n            if status in terminal:\n                break\n            time.sleep(5)\n\n        response = client.get(\n            f\"https://your-host/api/chaos/batch/{batch_id}/results\",\n            headers=headers,\n        )\n        response.raise_for_status()\n\n    data = response.json()\n\n    print(f\"Batch {batch_id}  status={data['status']}  \"\n          f\"{data['completedJobs']}/{data['totalJobs']} completed\")\n    score = data.get(\"aggregatedResilienceScore\")\n    if score:\n        print(f\"Aggregated resilience: {score['overall']} (Grade: {score['grade']})\")\n\n    for child in data.get(\"childJobs\", []):\n        child_score = child.get(\"resilienceScore\")\n        grade = child_score[\"grade\"] if child_score else \"N/A\"\n        print(f\"  child {child['id']}  scenario={child.get('scenarioId', 'custom')}  \"\n              f\"status={child['status']}  grade={grade}\")\n\n    return data\n```\n","operationId":"getBatchChaosResults","x-codeSamples":[{"lang":"curl","label":"curl","source":"while true; do\n  STATUS=$(curl -s https://your-production-domain.com/api/chaos/batch/batch_xyz789 \\\n    -H \"Authorization: Bearer $API_KEY\" | jq -r '.status')\n  echo \"batch status: $STATUS\"\n  case \"$STATUS\" in\n    completed|failed|cancelled|partial_failed) break ;;\n  esac\n  sleep 5\ndone\n\ncurl https://your-production-domain.com/api/chaos/batch/batch_xyz789/results \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import time\nimport requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nBATCH_ID = \"batch_xyz789\"\nHEADERS = {\"Authorization\": f\"Bearer {API_KEY}\"}\n\nwhile True:\n    status_resp = requests.get(f\"{BASE_URL}/chaos/batch/{BATCH_ID}\", headers=HEADERS)\n    status_resp.raise_for_status()\n    status = status_resp.json()[\"status\"]\n    print(f\"batch status: {status}\")\n    if status in (\"completed\", \"failed\", \"cancelled\", \"partial_failed\"):\n        break\n    time.sleep(5)\n\nresp = requests.get(\n    f\"{BASE_URL}/chaos/batch/{BATCH_ID}/results\",\n    headers=HEADERS,\n)\nresp.raise_for_status()\ndata = resp.json()\nscore = data.get(\"aggregatedResilienceScore\", {})\nprint(f\"Batch {BATCH_ID}  status={data['status']}  \"\n      f\"{data['completedJobs']}/{data['totalJobs']} completed\")\nif score:\n    print(f\"Aggregated resilience: {score['overall']} (Grade: {score['grade']})\")\nfor child in data.get(\"childJobs\", []):\n    child_score = child.get(\"resilienceScore\", {})\n    grade = child_score.get(\"grade\", \"N/A\") if child_score else \"N/A\"\n    print(f\"  child {child['id']}  scenario={child.get('scenarioId', 'custom')}  \"\n          f\"status={child['status']}  grade={grade}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst BATCH_ID = \"batch_xyz789\";\nconst headers = { \"Authorization\": `Bearer ${API_KEY}` };\n\n// The results endpoint returns 400 while the batch is still pending/running,\n// so poll the batch status endpoint until it reaches a terminal state first.\nlet status;\ndo {\n  const statusResp = await fetch(`${BASE_URL}/chaos/batch/${BATCH_ID}`, { headers });\n  if (!statusResp.ok) throw new Error(`HTTP ${statusResp.status}: ${await statusResp.text()}`);\n  ({ status } = await statusResp.json());\n  console.log(`batch status: ${status}`);\n  if ([\"completed\", \"failed\", \"cancelled\", \"partial_failed\"].includes(status)) break;\n  await new Promise((r) => setTimeout(r, 5000));\n} while (true);\n\nconst resp = await fetch(`${BASE_URL}/chaos/batch/${BATCH_ID}/results`, { headers });\nif (!resp.ok) throw new Error(`HTTP ${resp.status}: ${await resp.text()}`);\nconst data = await resp.json();\nconsole.log(`Batch ${BATCH_ID}  status=${data.status}  ${data.completedJobs}/${data.totalJobs} completed`);\nif (data.aggregatedResilienceScore) {\n  const s = data.aggregatedResilienceScore;\n  console.log(`Aggregated resilience: ${s.overall} (Grade: ${s.grade})`);\n}\nfor (const child of data.childJobs ?? []) {\n  const grade = child.resilienceScore?.grade ?? \"N/A\";\n  console.log(`  child ${child.id}  scenario=${child.scenarioId ?? \"custom\"}  status=${child.status}  grade=${grade}`);\n}\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"batchId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the batch chaos job"}],"responses":{"200":{"description":"Detailed batch chaos test results with full child job data","content":{"application/json":{"schema":{"type":"object","properties":{"batchId":{"type":"string","format":"uuid"},"status":{"type":"string","enum":["pending","running","completed","failed","cancelled","partial_failed"]},"totalJobs":{"type":"integer"},"completedJobs":{"type":"integer"},"failedJobs":{"type":"integer"},"aggregatedResilienceScore":{"$ref":"#/components/schemas/ResilienceScore"},"aggregatedVulnerabilities":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"severity":{"type":"string"},"title":{"type":"string"},"occurrences":{"type":"integer"}}}},"aggregatedRecommendations":{"type":"array","items":{"type":"string"}},"childJobs":{"type":"array","description":"Full details for each child chaos job","items":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"status":{"type":"string","enum":["pending","running","completed","failed","cancelled"]},"scenarioId":{"type":"string"},"duration":{"type":"integer"},"resilienceScore":{"$ref":"#/components/schemas/ResilienceScore"},"vulnerabilities":{"type":"array","items":{"$ref":"#/components/schemas/Vulnerability"}},"recommendations":{"type":"array","items":{"type":"string"}},"timeline":{"type":"array","items":{"type":"object","properties":{"time":{"type":"integer"},"event":{"type":"string"},"severity":{"type":"string"}}}},"error":{"type":"string"}}}}}},"examples":{"generic":{"summary":"Generic completed batch — 3 child jobs with aggregated resilience","value":{"batchId":"a1b2c3d4-e5f6-7890-abcd-ef1234567890","status":"completed","totalJobs":3,"completedJobs":3,"failedJobs":0,"aggregatedResilienceScore":{"overall":70,"grade":"C","breakdown":{"availability":65,"recoverability":72,"faultTolerance":73}},"aggregatedVulnerabilities":[{"id":"single_zone","severity":"high","title":"Single Zone Dependency","occurrences":2},{"id":"no_db_replica","severity":"medium","title":"No Database Read Replica","occurrences":1}],"aggregatedRecommendations":["Deploy instances across multiple availability zones","Add a read replica and enable automatic failover for the database","Implement circuit breakers to isolate failures"],"childJobs":[{"id":"11111111-0000-0000-0000-000000000001","status":"completed","scenarioId":"zone_failure","duration":120,"resilienceScore":{"overall":68,"grade":"D","breakdown":{"availability":62,"recoverability":71,"faultTolerance":71}},"vulnerabilities":[{"id":"single_zone","severity":"high","title":"Single Zone Dependency","description":"All compute is in a single zone; a zone outage causes full downtime."}],"recommendations":["Spread instances across at least two zones","Use a load balancer with cross-zone routing"],"timeline":[{"time":0,"event":"Chaos test started","severity":"info"},{"time":20,"event":"Zone failure injected","severity":"critical"},{"time":80,"event":"Service partially restored via remaining capacity","severity":"warning"},{"time":120,"event":"Test completed","severity":"info"}]},{"id":"22222222-0000-0000-0000-000000000002","status":"completed","scenarioId":"database_crash","duration":90,"resilienceScore":{"overall":71,"grade":"C","breakdown":{"availability":66,"recoverability":74,"faultTolerance":73}},"vulnerabilities":[{"id":"no_db_replica","severity":"medium","title":"No Database Read Replica","description":"The primary database has no replica; a crash causes total data-layer unavailability."}],"recommendations":["Enable a read replica and automatic failover","Cache frequently-read data to reduce database dependency"],"timeline":[{"time":0,"event":"Chaos test started","severity":"info"},{"time":15,"event":"Database crash injected","severity":"critical"},{"time":50,"event":"Error rate peaked at 95% — all DB-dependent requests failing","severity":"critical"},{"time":90,"event":"Database restored; error rate returning to baseline","severity":"info"}]},{"id":"33333333-0000-0000-0000-000000000003","status":"completed","scenarioId":"kill_instance","duration":150,"resilienceScore":{"overall":71,"grade":"C","breakdown":{"availability":67,"recoverability":71,"faultTolerance":75}},"vulnerabilities":[],"recommendations":["Add retry logic with exponential backoff for instance kills"],"timeline":[{"time":0,"event":"Chaos test started","severity":"info"},{"time":30,"event":"Instance kill injected on web-1","severity":"warning"},{"time":90,"event":"Instance recovered","severity":"info"}]}]}},"digitalOcean":{"summary":"DigitalOcean — zone_failure + Droplet crash + database_crash batch result","value":{"batchId":"f7e8d9c0-b1a2-3456-cdef-789012345678","status":"completed","totalJobs":3,"completedJobs":3,"failedJobs":0,"aggregatedResilienceScore":{"overall":63,"grade":"D","breakdown":{"availability":58,"recoverability":66,"faultTolerance":65}},"aggregatedVulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency","occurrences":3},{"id":"no_managed_db_standby","severity":"medium","title":"Managed Database Has No Standby Node","occurrences":2},{"id":"no_droplet_monitoring_alerts","severity":"low","title":"No Droplet Monitoring Alerts Configured","occurrences":1}],"aggregatedRecommendations":["Distribute Droplets across multiple DigitalOcean datacenters (e.g. NYC3 + SFO3) and use a Global Load Balancer","Enable the standby node on the Managed Database cluster to reduce failover time from minutes to seconds","Configure Droplet monitoring alerts to trigger auto-remediation or notifications","Use Reserved IPs for fast traffic re-routing when a Droplet is replaced or rebooted"],"childJobs":[{"id":"aaaa1111-0000-0000-0000-000000000001","status":"completed","scenarioId":"zone_failure","duration":120,"resilienceScore":{"overall":60,"grade":"D","breakdown":{"availability":55,"recoverability":63,"faultTolerance":62}},"vulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency","description":"All Droplets are in the NYC3 datacenter. A full datacenter outage would take down the entire service."}],"recommendations":["Add Droplets in a second datacenter (e.g. SFO3) and route traffic via a DigitalOcean Global Load Balancer","Store session state in a Managed Redis cluster so any datacenter can serve returning users"],"timeline":[{"time":0,"event":"Chaos test started against DigitalOcean simulation (NYC3)","severity":"info"},{"time":10,"event":"Zone failure injected — all NYC3 Droplets marked unavailable","severity":"critical"},{"time":12,"event":"Load balancer health checks failing for all Droplet targets","severity":"critical"},{"time":15,"event":"Error rate reached 100% — no healthy Droplets available","severity":"critical"},{"time":90,"event":"NYC3 zone restored; Droplets returning online","severity":"warning"},{"time":105,"event":"Error rate returned to baseline","severity":"info"},{"time":120,"event":"Test completed","severity":"info"}]},{"id":"bbbb2222-0000-0000-0000-000000000002","status":"completed","scenarioId":"droplet_crash","duration":90,"resilienceScore":{"overall":65,"grade":"D","breakdown":{"availability":60,"recoverability":70,"faultTolerance":65}},"vulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency","description":"All Droplets are in the NYC3 datacenter. A regional outage would take down the entire service."},{"id":"no_managed_db_standby","severity":"medium","title":"Managed Database Has No Standby Node","description":"The DigitalOcean Managed Database cluster db-primary has no standby node enabled, increasing recovery time after a failure."}],"recommendations":["Enable standby node on the Managed Database cluster to achieve near-instant failover","Use a Reserved IP so a replacement Droplet can take over the same address without a DNS change","Configure Droplet monitoring alerts to notify on-call staff within seconds of an instance crash"],"timeline":[{"time":0,"event":"Chaos test started against DigitalOcean simulation","severity":"info"},{"time":15,"event":"Droplet 'droplet-web-1' (NYC3) killed — instance crashed","severity":"critical"},{"time":17,"event":"Load balancer detected unhealthy Droplet; traffic redistributed to remaining Droplets","severity":"warning"},{"time":20,"event":"Error rate spiked to 38% — remaining Droplets over capacity","severity":"critical"},{"time":75,"event":"Droplet 'droplet-web-1' restored; recovery time 60 steps","severity":"info"},{"time":80,"event":"Error rate returned to baseline","severity":"info"},{"time":90,"event":"Test completed","severity":"info"}],"injections":[{"type":"kill_instance","targetId":"droplet-web-1","injectionTime":15,"duration":60}]},{"id":"cccc3333-0000-0000-0000-000000000003","status":"completed","scenarioId":"database_crash","duration":100,"resilienceScore":{"overall":64,"grade":"D","breakdown":{"availability":59,"recoverability":65,"faultTolerance":68}},"vulnerabilities":[{"id":"single_datacenter","severity":"high","title":"Single Datacenter Dependency","description":"The Managed Database cluster is only available in NYC3. A datacenter outage has no failover path."},{"id":"no_managed_db_standby","severity":"medium","title":"Managed Database Has No Standby Node","description":"Without a standby node, the Managed Database cluster db-primary requires manual intervention to recover after a crash, leading to extended downtime."},{"id":"no_droplet_monitoring_alerts","severity":"low","title":"No Droplet Monitoring Alerts Configured","description":"Droplet monitoring alerts are not configured; the team will not be proactively notified when Droplet CPU or memory crosses critical thresholds during a database overload event."}],"recommendations":["Enable standby node on the DigitalOcean Managed Database cluster (db-primary) to reduce recovery time from minutes to seconds","Add a Managed Redis cluster for session and query caching to reduce database load and improve blast-radius isolation during a DB crash","Configure Droplet monitoring alerts so the team is notified immediately when error rates spike due to database unavailability","Use connection pooling (e.g. PgBouncer) to prevent Droplets from exhausting database connections during degraded states"],"timeline":[{"time":0,"event":"Chaos test started against DigitalOcean simulation","severity":"info"},{"time":10,"event":"Managed Database cluster 'db-primary' crash injected","severity":"critical"},{"time":12,"event":"All Droplets reporting database connection errors","severity":"critical"},{"time":14,"event":"Error rate reached 97% — all DB-dependent endpoints failing","severity":"critical"},{"time":30,"event":"Connection pool exhausted on droplet-web-2 and droplet-web-3","severity":"critical"},{"time":70,"event":"Managed Database cluster 'db-primary' restored","severity":"warning"},{"time":75,"event":"Droplets re-establishing database connections","severity":"warning"},{"time":85,"event":"Error rate returning to baseline","severity":"info"},{"time":100,"event":"Test completed","severity":"info"}],"injections":[{"type":"database_crash","targetId":"db-primary","injectionTime":10,"duration":60}]}]}},"aws":{"summary":"AWS — AZ outage + RDS crash + EC2 kill batch result (us-east-1)","value":{"batchId":"b2c3d4e5-f6a7-8901-bcde-f12345678901","status":"completed","totalJobs":3,"completedJobs":3,"failedJobs":0,"aggregatedResilienceScore":{"overall":70,"grade":"C","breakdown":{"availability":65,"recoverability":73,"faultTolerance":72}},"aggregatedVulnerabilities":[{"id":"single_az","severity":"high","title":"Single Availability Zone Dependency (us-east-1a)","occurrences":3},{"id":"no_multi_az_rds","severity":"medium","title":"RDS Multi-AZ Not Enabled","occurrences":2}],"aggregatedRecommendations":["Deploy EC2 Auto Scaling groups across at least two AZs and configure the ALB for cross-zone load balancing","Enable RDS Multi-AZ to allow automatic standby promotion within 60 seconds of a primary AZ failure","Configure Route 53 health checks for DNS-level failover to a secondary region"],"childJobs":[{"id":"aws-job-001","status":"completed","scenarioId":"zone_failure_us_east_1a","duration":120,"resilienceScore":{"overall":68,"grade":"D","breakdown":{"availability":62,"recoverability":71,"faultTolerance":71}},"vulnerabilities":[{"id":"single_az","severity":"high","title":"Single Availability Zone Dependency (us-east-1a)","description":"All EC2 m5.large instances are in us-east-1a. A zone outage takes the full compute fleet offline."}],"recommendations":["Deploy EC2 instances across us-east-1a and us-east-1b with cross-zone ALB"],"timeline":[{"time":0,"event":"Chaos test started against AWS simulation","severity":"info"},{"time":15,"event":"Zone failure injected — us-east-1a marked unavailable","severity":"critical"},{"time":17,"event":"All EC2 m5.large instances unreachable; ALB health checks failing","severity":"critical"},{"time":90,"event":"us-east-1a restored; EC2 instances restarting","severity":"warning"},{"time":110,"event":"Traffic resumed; error rate returning to baseline","severity":"info"},{"time":120,"event":"Test completed","severity":"info"}]},{"id":"aws-job-002","status":"completed","scenarioId":"rds_crash_us_east_1a","duration":100,"resilienceScore":{"overall":70,"grade":"C","breakdown":{"availability":65,"recoverability":73,"faultTolerance":72}},"vulnerabilities":[{"id":"no_multi_az_rds","severity":"medium","title":"RDS Multi-AZ Not Enabled","description":"The RDS db.r5.large instance has no Multi-AZ standby. A crash requires manual failover with extended downtime."}],"recommendations":["Enable RDS Multi-AZ for automatic standby promotion within 60 seconds"],"timeline":[{"time":0,"event":"Chaos test started against AWS simulation","severity":"info"},{"time":18,"event":"RDS db.r5.large crash injected","severity":"critical"},{"time":20,"event":"EC2 instances reporting database connection errors","severity":"critical"},{"time":85,"event":"RDS instance restored; EC2 instances reconnecting","severity":"info"},{"time":100,"event":"Test completed — recovery time 82 steps","severity":"info"}],"injections":[{"type":"database_crash","targetId":"rds-primary","injectionTime":18,"duration":67}]},{"id":"aws-job-003","status":"completed","scenarioId":"kill_instance_ec2","duration":90,"resilienceScore":{"overall":72,"grade":"C","breakdown":{"availability":68,"recoverability":75,"faultTolerance":73}},"vulnerabilities":[],"recommendations":["Configure EC2 Auto Scaling to replace terminated instances automatically within 90 seconds"],"timeline":[{"time":0,"event":"Chaos test started against AWS simulation","severity":"info"},{"time":12,"event":"EC2 instance 'web-1' terminated","severity":"warning"},{"time":14,"event":"ALB redistributed traffic to remaining EC2 instances","severity":"warning"},{"time":75,"event":"Auto Scaling group provisioned replacement instance","severity":"info"},{"time":90,"event":"Test completed","severity":"info"}]}]}},"gcp":{"summary":"GCP — zone failure + Cloud SQL crash batch result (us-central1)","value":{"batchId":"c3d4e5f6-a7b8-9012-cdef-123456789012","status":"completed","totalJobs":2,"completedJobs":2,"failedJobs":0,"aggregatedResilienceScore":{"overall":71,"grade":"C","breakdown":{"availability":65,"recoverability":74,"faultTolerance":74}},"aggregatedVulnerabilities":[{"id":"single_zone_gce","severity":"high","title":"GCE Instances Concentrated in us-central1-a","occurrences":2},{"id":"no_cloud_sql_ha","severity":"medium","title":"Cloud SQL High Availability Not Enabled","occurrences":1}],"aggregatedRecommendations":["Use a regional GCE Managed Instance Group spanning us-central1-a, us-central1-b, and us-central1-c","Enable Cloud SQL High Availability to provision a standby in a secondary zone with sub-60 s failover","Configure a global HTTP(S) Load Balancer for cross-zone traffic routing"],"childJobs":[{"id":"gcp-job-001","status":"completed","scenarioId":"zone_failure_us_central1_a","duration":110,"resilienceScore":{"overall":70,"grade":"C","breakdown":{"availability":64,"recoverability":73,"faultTolerance":73}},"vulnerabilities":[{"id":"single_zone_gce","severity":"high","title":"GCE Instances Concentrated in us-central1-a","description":"All e2-standard-4 instances are in us-central1-a. A zone failure takes down all compute capacity."}],"recommendations":["Create a regional MIG across all three us-central1 zones"],"timeline":[{"time":0,"event":"Chaos test started against GCP simulation","severity":"info"},{"time":12,"event":"Zone failure injected — us-central1-a marked unavailable","severity":"critical"},{"time":13,"event":"All GCE e2-standard-4 instances unreachable; Cloud Load Balancing health checks failing","severity":"critical"},{"time":88,"event":"us-central1-a restored; GCE instances restarting","severity":"warning"},{"time":105,"event":"Traffic resumed; error rate returning to baseline","severity":"info"},{"time":110,"event":"Test completed","severity":"info"}]},{"id":"gcp-job-002","status":"completed","scenarioId":"cloud_sql_crash_us_central1","duration":90,"resilienceScore":{"overall":72,"grade":"C","breakdown":{"availability":66,"recoverability":75,"faultTolerance":75}},"vulnerabilities":[{"id":"no_cloud_sql_ha","severity":"medium","title":"Cloud SQL High Availability Not Enabled","description":"The Cloud SQL db-standard-4 instance has no HA standby. A zone failure requires manual intervention with extended downtime."}],"recommendations":["Enable Cloud SQL High Availability for automatic standby failover in under 60 seconds"],"timeline":[{"time":0,"event":"Chaos test started against GCP simulation","severity":"info"},{"time":16,"event":"Cloud SQL instance crash injected","severity":"critical"},{"time":18,"event":"GCE instances reporting Cloud SQL connection errors","severity":"critical"},{"time":80,"event":"Cloud SQL restored; GCE instances reconnecting","severity":"info"},{"time":90,"event":"Test completed","severity":"info"}],"injections":[{"type":"database_crash","targetId":"cloudsql-primary","injectionTime":16,"duration":64}]}]}},"azure":{"summary":"Azure — VM Scale Set zone failure + Azure SQL crash batch result (East US)","value":{"batchId":"d4e5f6a7-b8c9-0123-defa-234567890123","status":"completed","totalJobs":2,"completedJobs":2,"failedJobs":0,"aggregatedResilienceScore":{"overall":67,"grade":"D","breakdown":{"availability":61,"recoverability":70,"faultTolerance":70}},"aggregatedVulnerabilities":[{"id":"no_zone_redundant_vmss","severity":"high","title":"VM Scale Set Not Zone-Redundant (East US)","occurrences":2},{"id":"no_zone_redundant_sql","severity":"high","title":"Azure SQL Not Zone-Redundant","occurrences":1}],"aggregatedRecommendations":["Configure the VM Scale Set with zone redundancy across all three East US availability zones","Enable zone-redundant backup for Azure SQL Database","Add an Azure Traffic Manager profile for regional DNS failover to West US 2"],"childJobs":[{"id":"azure-job-001","status":"completed","scenarioId":"zone_failure_east_us_zone1","duration":115,"resilienceScore":{"overall":65,"grade":"D","breakdown":{"availability":59,"recoverability":68,"faultTolerance":68}},"vulnerabilities":[{"id":"no_zone_redundant_vmss","severity":"high","title":"VM Scale Set Not Zone-Redundant (East US)","description":"All Standard_D4s_v3 VMs are pinned to East US zone 1. An Azure infrastructure event in that zone takes the entire compute fleet offline."}],"recommendations":["Enable zone redundancy on the VM Scale Set across zones 1, 2, and 3 in East US"],"timeline":[{"time":0,"event":"Chaos test started against Azure simulation","severity":"info"},{"time":14,"event":"Zone failure injected — East US zone 1 marked unavailable","severity":"critical"},{"time":16,"event":"All Standard_D4s_v3 VMs unreachable; Azure Load Balancer health probes failing","severity":"critical"},{"time":95,"event":"East US zone 1 restored; VMs returning online","severity":"warning"},{"time":108,"event":"Traffic resumed; error rate returning to baseline","severity":"info"},{"time":115,"event":"Test completed","severity":"info"}]},{"id":"azure-job-002","status":"completed","scenarioId":"azure_sql_crash_east_us","duration":95,"resilienceScore":{"overall":69,"grade":"D","breakdown":{"availability":63,"recoverability":72,"faultTolerance":72}},"vulnerabilities":[{"id":"no_zone_redundant_sql","severity":"high","title":"Azure SQL Not Zone-Redundant","description":"Azure SQL General Purpose tier is not zone-redundant. A zone failure may require a point-in-time restore with extended downtime."}],"recommendations":["Enable zone-redundant backup for Azure SQL to avoid point-in-time restore delays"],"timeline":[{"time":0,"event":"Chaos test started against Azure simulation","severity":"info"},{"time":18,"event":"Azure SQL Database crash injected","severity":"critical"},{"time":20,"event":"VM instances reporting Azure SQL connection errors","severity":"critical"},{"time":82,"event":"Azure SQL restored; VMs reconnecting","severity":"info"},{"time":95,"event":"Test completed","severity":"info"}],"injections":[{"type":"database_crash","targetId":"azure-sql-primary","injectionTime":18,"duration":64}]}]}},"oci":{"summary":"OCI — AD-1 failure + Autonomous DB crash batch result (us-ashburn-1)","value":{"batchId":"e5f6a7b8-c9d0-1234-efab-345678901234","status":"completed","totalJobs":2,"completedJobs":2,"failedJobs":0,"aggregatedResilienceScore":{"overall":68,"grade":"D","breakdown":{"availability":62,"recoverability":71,"faultTolerance":71}},"aggregatedVulnerabilities":[{"id":"single_ad","severity":"high","title":"VM Instances Confined to AD-1 (us-ashburn-1)","occurrences":2},{"id":"no_adb_cross_ad_data_guard","severity":"medium","title":"Autonomous Database Has No Cross-AD Data Guard","occurrences":2}],"aggregatedRecommendations":["Distribute VM.Standard3.Flex instances across AD-1, AD-2, and AD-3 using an OCI instance pool","Enable Autonomous Database Cross-AD Data Guard for automatic failover to AD-2 within 30 seconds","Use OCI Traffic Management Steering Policies for health-based routing across ADs"],"childJobs":[{"id":"oci-job-001","status":"completed","scenarioId":"ad_failure_us_ashburn_1_ad1","duration":120,"resilienceScore":{"overall":67,"grade":"D","breakdown":{"availability":61,"recoverability":70,"faultTolerance":70}},"vulnerabilities":[{"id":"single_ad","severity":"high","title":"VM Instances Confined to AD-1 (us-ashburn-1)","description":"All VM.Standard3.Flex instances are in AD-1. An AD-level failure takes the entire compute fleet offline with no cross-AD failover."}],"recommendations":["Distribute instances across AD-1, AD-2, and AD-3 using an OCI instance pool with an OCI Load Balancer"],"timeline":[{"time":0,"event":"Chaos test started against OCI simulation","severity":"info"},{"time":13,"event":"AD-1 failure injected — all AD-1 resources marked unavailable","severity":"critical"},{"time":15,"event":"All VM.Standard3.Flex instances unreachable; OCI Load Balancer health checks failing","severity":"critical"},{"time":92,"event":"AD-1 restored; VM instances restarting","severity":"warning"},{"time":108,"event":"Traffic resumed; error rate returning to baseline","severity":"info"},{"time":120,"event":"Test completed","severity":"info"}]},{"id":"oci-job-002","status":"completed","scenarioId":"adb_crash_us_ashburn_1","duration":100,"resilienceScore":{"overall":69,"grade":"D","breakdown":{"availability":63,"recoverability":72,"faultTolerance":72}},"vulnerabilities":[{"id":"no_adb_cross_ad_data_guard","severity":"medium","title":"Autonomous Database Has No Cross-AD Data Guard","description":"The Autonomous Database is deployed in AD-1 only. Enabling Cross-AD Data Guard provides automatic failover to a standby in AD-2."}],"recommendations":["Enable Autonomous Database Cross-AD Data Guard for automatic failover to AD-2 within 30 seconds","Use Universal Connection Pool (UCP) to absorb reconnection bursts after a database failover"],"timeline":[{"time":0,"event":"Chaos test started against OCI simulation","severity":"info"},{"time":16,"event":"Autonomous Database primary crash injected","severity":"critical"},{"time":18,"event":"VM.Standard3.Flex instances reporting Autonomous Database connection errors","severity":"critical"},{"time":82,"event":"Autonomous Database primary restored; VMs reconnecting","severity":"info"},{"time":100,"event":"Test completed","severity":"info"}],"injections":[{"type":"database_crash","targetId":"autonomous-db-ad1","injectionTime":16,"duration":66}]}]}}}}}},"400":{"description":"Batch job not completed yet. Returned when the batch chaos job status is\n`pending` or `running`. Poll `GET /chaos/batch/{batchId}` until status is\n`completed` before fetching results.\n","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/multi-cloud/explore":{"x-stability":"stable","post":{"tags":["Multi-Cloud Strategy"],"summary":"Explore multi-cloud deployment strategies","description":"Analyze a workload profile and generate optimized multi-cloud deployment strategies.\nThe system evaluates different provider combinations across AWS, GCP, Azure, OCI, and DigitalOcean\nbased on cost, latency, and vendor lock-in considerations.\n\n**DigitalOcean as a candidate provider:** DigitalOcean is a first-class candidate in every\nexploration run. It is particularly well-suited for cost-optimized workloads — Droplets and\nManaged Databases typically produce the lowest monthly spend in the comparison report. Set a\nhigh `cost` weight (e.g. 0.7+) and a moderate budget to see DigitalOcean-primary strategies\nappear at the top of `topStrategies` in the results.\n","operationId":"exploreMultiCloudStrategies","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X POST https://your-production-domain.com/api/multi-cloud/explore \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"workloadProfile\": {\n      \"name\": \"Balanced E-Commerce Platform\",\n      \"expectedRps\": 2500,\n      \"peakRps\": 8000,\n      \"dataResidencyRegions\": [\"us-east-1\", \"eu-west-1\"],\n      \"latencyRequirementMs\": 100,\n      \"monthlyBudget\": 15000,\n      \"complianceRequirements\": []\n    },\n    \"optimizationWeights\": {\n      \"cost\": 0.4,\n      \"latency\": 0.4,\n      \"vendorLockIn\": 0.2\n    },\n    \"webhookUrl\": \"https://your-app.com/webhooks/multicloud\"\n  }'\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\npayload = {\n    \"workloadProfile\": {\n        \"name\": \"Balanced E-Commerce Platform\",\n        \"expectedRps\": 2500,\n        \"peakRps\": 8000,\n        \"dataResidencyRegions\": [\"us-east-1\", \"eu-west-1\"],\n        \"latencyRequirementMs\": 100,\n        \"monthlyBudget\": 15000,\n        \"complianceRequirements\": [],\n    },\n    \"optimizationWeights\": {\"cost\": 0.4, \"latency\": 0.4, \"vendorLockIn\": 0.2},\n    \"webhookUrl\": \"https://your-app.com/webhooks/multicloud\",\n}\n\nresp = requests.post(\n    f\"{BASE_URL}/multi-cloud/explore\",\n    json=payload,\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\njob = data[\"job\"]\nprint(f\"Job started: id={job['id']}  status={job['status']}\")\nprint(f\"Poll status at: GET /api/multi-cloud/jobs/{job['id']}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst payload = {\n  workloadProfile: {\n    name: \"Balanced E-Commerce Platform\",\n    expectedRps: 2500,\n    peakRps: 8000,\n    dataResidencyRegions: [\"us-east-1\", \"eu-west-1\"],\n    latencyRequirementMs: 100,\n    monthlyBudget: 15000,\n    complianceRequirements: [],\n  },\n  optimizationWeights: { cost: 0.4, latency: 0.4, vendorLockIn: 0.2 },\n  webhookUrl: \"https://your-app.com/webhooks/multicloud\",\n};\n\nconst resp = await fetch(`${BASE_URL}/multi-cloud/explore`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify(payload),\n});\nconst data = await resp.json();\nconst { job } = data;\nconsole.log(`Job started: id=${job.id}  status=${job.status}`);\nconsole.log(`Poll status at: GET /api/multi-cloud/jobs/${job.id}`);\n"},{"lang":"curl","label":"curl (OCI Preemptible)","source":"curl -X POST https://your-production-domain.com/api/multi-cloud/explore \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"workloadProfile\": {\n      \"name\": \"OCI Preemptible Batch Worker\",\n      \"expectedRps\": 700,\n      \"peakRps\": 2000,\n      \"dataResidencyRegions\": [\"us-ashburn-1\"],\n      \"latencyRequirementMs\": 600,\n      \"monthlyBudget\": 2000,\n      \"complianceRequirements\": []\n    },\n    \"optimizationWeights\": {\n      \"cost\": 0.75,\n      \"latency\": 0.15,\n      \"vendorLockIn\": 0.10\n    },\n    \"webhookUrl\": \"https://your-app.example.com/webhooks/multicloud\",\n    \"webhookSecret\": \"oci-preemptible-secret\"\n  }'\n"},{"lang":"Python","label":"Python (OCI Preemptible)","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\n\npayload = {\n    \"workloadProfile\": {\n        \"name\": \"OCI Preemptible Batch Worker\",\n        \"expectedRps\": 700,\n        \"peakRps\": 2000,\n        \"dataResidencyRegions\": [\"us-ashburn-1\"],\n        \"latencyRequirementMs\": 600,\n        \"monthlyBudget\": 2000,\n        \"complianceRequirements\": [],\n    },\n    \"optimizationWeights\": {\"cost\": 0.75, \"latency\": 0.15, \"vendorLockIn\": 0.10},\n    \"webhookUrl\": \"https://your-app.example.com/webhooks/multicloud\",\n    \"webhookSecret\": \"oci-preemptible-secret\",\n}\n\nresp = requests.post(\n    f\"{BASE_URL}/multi-cloud/explore\",\n    json=payload,\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\njob = data[\"job\"]\nprint(f\"Job started: id={job['id']}  status={job['status']}\")\nprint(f\"Poll status at: GET /api/multi-cloud/jobs/{job['id']}\")\n"},{"lang":"TypeScript","label":"TypeScript (OCI Preemptible)","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\n\nconst payload = {\n  workloadProfile: {\n    name: \"OCI Preemptible Batch Worker\",\n    expectedRps: 700,\n    peakRps: 2000,\n    dataResidencyRegions: [\"us-ashburn-1\"],\n    latencyRequirementMs: 600,\n    monthlyBudget: 2000,\n    complianceRequirements: [] as string[],\n  },\n  optimizationWeights: { cost: 0.75, latency: 0.15, vendorLockIn: 0.10 },\n  webhookUrl: \"https://your-app.example.com/webhooks/multicloud\",\n  webhookSecret: \"oci-preemptible-secret\",\n};\n\nconst resp = await fetch(`${BASE_URL}/multi-cloud/explore`, {\n  method: \"POST\",\n  headers: {\n    \"Authorization\": `Bearer ${API_KEY}`,\n    \"Content-Type\": \"application/json\",\n  },\n  body: JSON.stringify(payload),\n});\nconst data = await resp.json();\nconst { job } = data;\nconsole.log(`Job started: id=${job.id}  status=${job.status}`);\nconsole.log(`Poll status at: GET /api/multi-cloud/jobs/${job.id}`);\n"}],"security":[{"BearerAuth":["write"]}],"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","required":["workloadProfile"],"properties":{"workloadProfile":{"$ref":"#/components/schemas/WorkloadProfile"},"optimizationWeights":{"type":"object","description":"Weights for optimization objectives (must sum to 1.0)","properties":{"cost":{"type":"number","minimum":0,"maximum":1,"default":0.4,"description":"Weight for cost optimization","example":0.4},"latency":{"type":"number","minimum":0,"maximum":1,"default":0.4,"description":"Weight for latency optimization","example":0.4},"vendorLockIn":{"type":"number","minimum":0,"maximum":1,"default":0.2,"description":"Weight for minimizing vendor lock-in","example":0.2}}},"webhookUrl":{"type":"string","format":"uri","description":"Optional HTTPS URL to receive webhook notification when job completes","example":"https://your-app.com/webhooks/multicloud"},"webhookSecret":{"type":"string","description":"Optional secret for HMAC-SHA256 webhook signature verification","example":"your-secret-key-here"}}},"examples":{"balanced":{"summary":"Balanced multi-cloud workload (equal cost, latency, lock-in weights)","value":{"workloadProfile":{"name":"Balanced E-Commerce Platform","expectedRps":2500,"peakRps":8000,"dataResidencyRegions":["us-east-1","eu-west-1"],"latencyRequirementMs":100,"monthlyBudget":15000,"complianceRequirements":[]},"optimizationWeights":{"cost":0.4,"latency":0.4,"vendorLockIn":0.2},"webhookUrl":"https://your-app.com/webhooks/multicloud"}},"costOptimized":{"summary":"Cost-optimized workload favouring DigitalOcean","value":{"workloadProfile":{"name":"Cost-Sensitive Startup App","expectedRps":1000,"peakRps":3000,"dataResidencyRegions":["us-east-1"],"latencyRequirementMs":150,"monthlyBudget":5000,"complianceRequirements":[]},"optimizationWeights":{"cost":0.7,"latency":0.2,"vendorLockIn":0.1},"webhookUrl":"https://your-app.com/webhooks/multicloud"}},"digitalOceanPrimary":{"summary":"DigitalOcean-primary strategy — maximize cost savings with DO Droplets","value":{"workloadProfile":{"name":"DigitalOcean-First SaaS Backend","expectedRps":800,"peakRps":2500,"dataResidencyRegions":["nyc3"],"latencyRequirementMs":200,"monthlyBudget":4000,"complianceRequirements":[]},"optimizationWeights":{"cost":0.75,"latency":0.15,"vendorLockIn":0.1},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"do-multicloud-secret"}},"digitaloceanAMDNVMe":{"summary":"DigitalOcean AMD NVMe Droplet — cost-optimized I/O-intensive workload targeting s-2vcpu-4gb-amd","value":{"workloadProfile":{"name":"DO AMD NVMe I/O-Intensive Backend","expectedRps":600,"peakRps":1800,"dataResidencyRegions":["nyc3"],"latencyRequirementMs":180,"monthlyBudget":2000,"complianceRequirements":[]},"optimizationWeights":{"cost":0.8,"latency":0.15,"vendorLockIn":0.05},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"do-amd-nvme-secret"}},"awsPrimary":{"summary":"AWS-primary strategy — high-performance enterprise workload on EC2 and RDS Multi-AZ","value":{"workloadProfile":{"name":"AWS Enterprise API Platform","expectedRps":5000,"peakRps":15000,"dataResidencyRegions":["us-east-1","us-west-2"],"latencyRequirementMs":50,"monthlyBudget":25000,"complianceRequirements":["soc2","pci-dss"]},"optimizationWeights":{"cost":0.3,"latency":0.5,"vendorLockIn":0.2},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"aws-multicloud-secret"}},"awsSpot":{"summary":"AWS EC2 Spot strategy — cost-optimized fault-tolerant batch workload using Spot Instances","value":{"workloadProfile":{"name":"AWS EC2 Spot Batch Worker","expectedRps":1000,"peakRps":3000,"dataResidencyRegions":["us-east-1"],"latencyRequirementMs":600,"monthlyBudget":3500,"complianceRequirements":[]},"optimizationWeights":{"cost":0.75,"latency":0.15,"vendorLockIn":0.1},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"aws-spot-secret"}},"gcpPrimary":{"summary":"GCP-primary strategy — ML analytics workload on Cloud Run and Cloud SQL","value":{"workloadProfile":{"name":"GCP ML Analytics Backend","expectedRps":3000,"peakRps":10000,"dataResidencyRegions":["us-central1","europe-west1"],"latencyRequirementMs":80,"monthlyBudget":18000,"complianceRequirements":["gdpr"]},"optimizationWeights":{"cost":0.3,"latency":0.5,"vendorLockIn":0.2},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"gcp-multicloud-secret"}},"gcpSpot":{"summary":"GCP Spot VM strategy — cost-optimized batch workload using preemptible compute","value":{"workloadProfile":{"name":"GCP Spot Batch Processing","expectedRps":800,"peakRps":2400,"dataResidencyRegions":["us-central1"],"latencyRequirementMs":500,"monthlyBudget":3000,"complianceRequirements":[]},"optimizationWeights":{"cost":0.75,"latency":0.15,"vendorLockIn":0.1},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"gcp-spot-secret"}},"azurePrimary":{"summary":"Azure-primary strategy — compliance-heavy enterprise workload on AKS and Azure Database","value":{"workloadProfile":{"name":"Azure Compliance Enterprise Platform","expectedRps":2000,"peakRps":6000,"dataResidencyRegions":["eastus","westeurope"],"latencyRequirementMs":100,"monthlyBudget":20000,"complianceRequirements":["gdpr","hipaa","iso27001"]},"optimizationWeights":{"cost":0.2,"latency":0.4,"vendorLockIn":0.4},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"azure-multicloud-secret"}},"azureSpot":{"summary":"Azure Spot VM strategy — cost-optimized fault-tolerant workload using Azure Spot instances","value":{"workloadProfile":{"name":"Azure Spot Fault-Tolerant Worker","expectedRps":600,"peakRps":1800,"dataResidencyRegions":["eastus"],"latencyRequirementMs":400,"monthlyBudget":2500,"complianceRequirements":[]},"optimizationWeights":{"cost":0.75,"latency":0.15,"vendorLockIn":0.1},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"azure-spot-secret"}},"ociPrimary":{"summary":"OCI-primary strategy — database-intensive workload on OCI Compute and Autonomous Database","value":{"workloadProfile":{"name":"OCI Database-Intensive Backend","expectedRps":1500,"peakRps":4000,"dataResidencyRegions":["us-ashburn-1"],"latencyRequirementMs":120,"monthlyBudget":12000,"complianceRequirements":[]},"optimizationWeights":{"cost":0.5,"latency":0.3,"vendorLockIn":0.2},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"oci-multicloud-secret"}},"ociSpot":{"summary":"OCI Ampere A1 Spot strategy — cost-optimized fault-tolerant batch workload using OCI preemptible Ampere A1 compute","value":{"workloadProfile":{"name":"OCI Ampere A1 Spot Batch Worker","expectedRps":700,"peakRps":2000,"dataResidencyRegions":["us-ashburn-1"],"latencyRequirementMs":600,"monthlyBudget":2000,"complianceRequirements":[]},"optimizationWeights":{"cost":0.75,"latency":0.15,"vendorLockIn":0.1},"webhookUrl":"https://your-app.example.com/webhooks/multicloud","webhookSecret":"oci-spot-secret"}}}}}},"responses":{"202":{"description":"Multi-cloud exploration job started","content":{"application/json":{"schema":{"type":"object","properties":{"job":{"type":"object","properties":{"id":{"type":"string","format":"uuid","example":"job-abc123"},"type":{"type":"string","enum":["multicloud_exploration"],"example":"multicloud_exploration"},"status":{"type":"string","enum":["pending","running"],"example":"running"},"createdAt":{"type":"string","format":"date-time"}}},"message":{"type":"string","example":"Multi-cloud exploration started. Use GET /multi-cloud/jobs/{id} to check status."}}}}}},"400":{"$ref":"#/components/responses/BadRequest"},"401":{"$ref":"#/components/responses/Unauthorized"}}}},"/multi-cloud/jobs/{jobId}":{"x-stability":"stable","get":{"tags":["Multi-Cloud Strategy"],"summary":"Get multi-cloud exploration job status","description":"Get the current status and progress of a multi-cloud strategy exploration job","operationId":"getMultiCloudJob","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl https://your-production-domain.com/api/multi-cloud/jobs/job_abc123 \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"job_abc123\"\n\nresp = requests.get(\n    f\"{BASE_URL}/multi-cloud/jobs/{JOB_ID}\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\njob = resp.json()\nprint(f\"Job {JOB_ID}  status={job['status']}  progress={job.get('progress', 0)}%  \"\n      f\"strategies={job.get('strategiesGenerated', 0)}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"job_abc123\";\n\nconst resp = await fetch(`${BASE_URL}/multi-cloud/jobs/${JOB_ID}`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst job = await resp.json();\nconsole.log(`Job ${JOB_ID}  status=${job.status}  progress=${job.progress ?? 0}%  strategies=${job.strategiesGenerated ?? 0}`);\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the multi-cloud job"}],"responses":{"200":{"description":"Multi-cloud job status","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"status":{"type":"string","enum":["pending","running","completed","failed"]},"progress":{"type":"number","description":"Completion progress (0-100)","example":75},"strategiesGenerated":{"type":"integer","description":"Number of strategies generated so far","example":12},"workloadProfile":{"$ref":"#/components/schemas/WorkloadProfile"},"optimizationWeights":{"type":"object","properties":{"cost":{"type":"number"},"latency":{"type":"number"},"vendorLockIn":{"type":"number"}}},"createdAt":{"type":"string","format":"date-time"},"updatedAt":{"type":"string","format":"date-time"},"completedAt":{"type":"string","format":"date-time"},"error":{"type":"string"}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}},"delete":{"summary":"Cancel multi-cloud exploration job","description":"Cancel a running multi-cloud exploration job. This endpoint is idempotent - calling it multiple times\non the same job will return success without error.\n\n**Cancellation Rules:**\n- Jobs with status \"pending\" or \"running\" will be cancelled\n- Jobs already \"cancelled\" will return success (idempotent behavior)\n- Jobs with status \"completed\" or \"failed\" cannot be cancelled (returns 409)\n- Cancelled jobs will have status set to \"cancelled\" and a cancelledAt timestamp\n","operationId":"cancelMultiCloudJob","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -X DELETE https://your-production-domain.com/api/multi-cloud/jobs/job_abc123 \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"job_abc123\"\n\nresp = requests.delete(\n    f\"{BASE_URL}/multi-cloud/jobs/{JOB_ID}\",\n    headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(f\"Job {data['id']}  status={data['status']}  message={data.get('message')}\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"job_abc123\";\n\nconst resp = await fetch(`${BASE_URL}/multi-cloud/jobs/${JOB_ID}`, {\n  method: \"DELETE\",\n  headers: { \"Authorization\": `Bearer ${API_KEY}` },\n});\nconst data = await resp.json();\nconsole.log(`Job ${data.id}  status=${data.status}  message=${data.message}`);\n"}],"tags":["Multi-Cloud Strategy"],"security":[{"BearerAuth":[]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string"},"description":"Job ID"}],"responses":{"200":{"description":"Job cancelled successfully or was already cancelled.\nReturns the same response whether cancelling for the first time or if already cancelled\n(idempotent operation).\n","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"status":{"type":"string","enum":["cancelled"]},"cancelledAt":{"type":"string","format":"date-time"},"message":{"type":"string","description":"Message indicating if job was just cancelled or already cancelled"}}},"examples":{"newlyCancelled":{"value":{"id":"job_abc123","status":"cancelled","cancelledAt":"2024-01-15T10:30:00Z","message":"Job cancelled successfully"}},"alreadyCancelled":{"value":{"id":"job_abc123","status":"cancelled","cancelledAt":"2024-01-15T09:45:00Z","message":"Job already cancelled"}}}}}},"404":{"description":"Job not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"409":{"description":"Cannot cancel job that is already completed or failed","content":{"application/json":{"schema":{"type":"object","properties":{"error":{"type":"string"},"status":{"type":"string"}}},"example":{"error":"Cannot cancel job that is already completed or failed","status":"completed"}}}},"500":{"description":"Failed to cancel job","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/multi-cloud/jobs/{jobId}/results":{"x-stability":"stable","get":{"tags":["Multi-Cloud Strategy"],"summary":"Get multi-cloud exploration results","description":"Retrieve the **final** results from a multi-cloud exploration job.\n\n**Only returns data once the job has finished.** This endpoint returns\nresults exclusively when the job `status` is `completed`. While the job\nis still `pending` or `running` it returns `400 INVALID_REQUEST`; a\n`failed` job returns `500`.\n\n**Polling pattern:**\n- Poll `GET /multi-cloud/jobs/{jobId}` (or subscribe to `/stream`) until `status` is `completed`, then call this endpoint once for the final ranked strategies and comparison report.\n- To read strategies as they accumulate while the job is still running, use `GET /multi-cloud/jobs/{jobId}/partial-results` instead — that endpoint returns `isComplete: false` until the job reaches a terminal state.\n- For real-time streaming of progress, use the `/stream` endpoint.\n","operationId":"getMultiCloudResults","x-codeSamples":[{"lang":"curl","label":"curl","source":"until curl -sf \"https://your-production-domain.com/api/multi-cloud/jobs/job_abc123\" \\\n  -H \"Authorization: Bearer $API_KEY\" | grep -q '\"status\":\"completed\"'; do\n  sleep 2\ndone\n\ncurl \"https://your-production-domain.com/api/multi-cloud/jobs/job_abc123/results\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n\ncurl \"https://your-production-domain.com/api/multi-cloud/jobs/job_abc123/results?providers=digitalocean,aws\" \\\n  -H \"Authorization: Bearer $API_KEY\"\n"},{"lang":"Python","label":"Python","source":"import time\nimport requests\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"job_abc123\"\nHEADERS = {\"Authorization\": f\"Bearer {API_KEY}\"}\n\nwhile True:\n    status = requests.get(\n        f\"{BASE_URL}/multi-cloud/jobs/{JOB_ID}\", headers=HEADERS,\n    ).json()\n    if status[\"status\"] in (\"completed\", \"failed\", \"cancelled\"):\n        break\n    time.sleep(2)\n\nresp = requests.get(\n    f\"{BASE_URL}/multi-cloud/jobs/{JOB_ID}/results\",\n    headers=HEADERS,\n)\nresp.raise_for_status()\ndata = resp.json()\nprint(f\"Job {JOB_ID}  status={data['status']}  \"\n      f\"complete={data['isComplete']}  strategies={data['strategiesGenerated']}\")\nfor strategy in data.get(\"topStrategies\", []):\n    cost = strategy[\"metrics\"][\"monthlyCost\"]\n    latency = strategy[\"metrics\"][\"avgLatencyMs\"]\n    print(f\"  {strategy['name']}  cost=${cost}/mo  latency={latency}ms\")\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"job_abc123\";\nconst HEADERS = { \"Authorization\": `Bearer ${API_KEY}` };\n\n// Poll status until the job is finished. /results returns 400 while\n// the job is still pending or running; use /partial-results to read\n// strategies as they accumulate during a run.\nlet status;\ndo {\n  await new Promise(r => setTimeout(r, 2000));\n  status = await fetch(`${BASE_URL}/multi-cloud/jobs/${JOB_ID}`, {\n    headers: HEADERS,\n  }).then(r => r.json());\n} while (![\"completed\", \"failed\", \"cancelled\"].includes(status.status));\n\nconst resp = await fetch(`${BASE_URL}/multi-cloud/jobs/${JOB_ID}/results`, {\n  headers: HEADERS,\n});\nconst data = await resp.json();\nconsole.log(`Job ${JOB_ID}  status=${data.status}  complete=${data.isComplete}  strategies=${data.strategiesGenerated}`);\nfor (const strategy of data.topStrategies ?? []) {\n  const { monthlyCost, avgLatencyMs } = strategy.metrics;\n  console.log(`  ${strategy.name}  cost=$${monthlyCost}/mo  latency=${avgLatencyMs}ms`);\n}\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the multi-cloud job"},{"name":"providers","in":"query","required":false,"schema":{"type":"string","example":"digitalocean,aws"},"description":"Comma-separated list of cloud provider names to filter results by preferred primary provider.\nA strategy is included if any of the specified providers holds ≥ 50 % of the traffic allocation.\nSupported values: `aws`, `gcp`, `azure`, `digitalocean`, `oci`.\nIf omitted, all strategies are returned.\n"}],"responses":{"200":{"description":"Job results (partial or complete based on job status)","content":{"application/json":{"schema":{"type":"object","properties":{"jobId":{"type":"string"},"status":{"type":"string","enum":["pending","running","completed","failed","cancelled"]},"progress":{"type":"number","description":"Job progress percentage (0-100)"},"isComplete":{"type":"boolean","description":"True if job is finished, false if still running or pending"},"strategiesGenerated":{"type":"number","description":"Number of strategies generated so far"},"allStrategies":{"type":"array","items":{"$ref":"#/components/schemas/Strategy"},"description":"All raw strategies generated so far (available during and after generation)"},"topStrategies":{"type":"array","items":{"$ref":"#/components/schemas/Strategy"},"description":"Top 10 optimized strategies (only available when status is completed)"},"comparisonReport":{"type":"string","description":"Markdown comparison report (only available when complete)"},"completedAt":{"type":"string","format":"date-time"}}},"examples":{"partialResults":{"value":{"jobId":"job_abc123","status":"running","progress":45,"isComplete":false,"strategiesGenerated":15,"allStrategies":[{"name":"AWS-Dominant Strategy","description":"Primary AWS deployment with GCP failover","allocations":[{"provider":"aws","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":12500,"avgLatencyMs":45,"vendorLockInScore":65}},{"name":"DigitalOcean Cost-Optimized","description":"DigitalOcean primary with AWS failover — lowest monthly spend","allocations":[{"provider":"digitalocean","percentage":80},{"provider":"aws","percentage":20}],"metrics":{"monthlyCost":7800,"avgLatencyMs":55,"vendorLockInScore":48}}],"topStrategies":[]}},"completeResults":{"value":{"jobId":"job_abc123","status":"completed","progress":100,"isComplete":true,"strategiesGenerated":42,"allStrategies":[{"name":"AWS-Dominant Strategy","description":"Primary AWS deployment with GCP failover","allocations":[{"provider":"aws","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":12500,"avgLatencyMs":45,"vendorLockInScore":65}},{"name":"DigitalOcean Cost-Optimized","description":"DigitalOcean primary with AWS failover — lowest monthly spend","allocations":[{"provider":"digitalocean","percentage":80},{"provider":"aws","percentage":20}],"metrics":{"monthlyCost":7800,"avgLatencyMs":55,"vendorLockInScore":48}}],"topStrategies":[{"name":"Multi-Cloud Balanced","description":"Optimized multi-cloud strategy for cost and performance","allocations":[{"provider":"aws","percentage":40},{"provider":"gcp","percentage":35},{"provider":"azure","percentage":25}],"metrics":{"monthlyCost":11200,"avgLatencyMs":42,"vendorLockInScore":35}},{"name":"DigitalOcean Cost-Optimized","description":"DigitalOcean primary with AWS failover — lowest monthly spend","allocations":[{"provider":"digitalocean","percentage":80},{"provider":"aws","percentage":20}],"metrics":{"monthlyCost":7800,"avgLatencyMs":55,"vendorLockInScore":48}}],"comparisonReport":"# Multi-Cloud Strategy Comparison Report...","completedAt":"2024-01-15T10:30:00Z"}},"digitalOceanWinner":{"summary":"DigitalOcean wins — cost-optimized workload where DO-primary ranks","value":{"jobId":"job_do_xyz789","status":"completed","progress":100,"isComplete":true,"strategiesGenerated":38,"allStrategies":[{"name":"DigitalOcean-Primary Strategy","description":"DigitalOcean Droplets (s-4vcpu-8gb) as primary compute, DigitalOcean Managed Databases (PostgreSQL) for persistence, and Spaces for S3-compatible object storage — lowest total monthly spend in the comparison","allocations":[{"provider":"digitalocean","percentage":80},{"provider":"aws","percentage":20}],"metrics":{"monthlyCost":7800,"avgLatencyMs":55,"vendorLockInScore":48}},{"name":"AWS-Dominant Strategy","description":"Primary AWS deployment with GCP failover","allocations":[{"provider":"aws","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":14200,"avgLatencyMs":43,"vendorLockInScore":74}},{"name":"GCP-Primary Strategy","description":"GCP Cloud Run primary with Azure failover","allocations":[{"provider":"gcp","percentage":60},{"provider":"azure","percentage":40}],"metrics":{"monthlyCost":11600,"avgLatencyMs":46,"vendorLockInScore":61}}],"topStrategies":[{"name":"DigitalOcean-Primary Strategy","description":"DigitalOcean Droplets (s-4vcpu-8gb) as primary compute, DigitalOcean Managed Databases (PostgreSQL) for persistence, and Spaces for S3-compatible object storage — lowest total monthly spend in the comparison","allocations":[{"provider":"digitalocean","percentage":80},{"provider":"aws","percentage":20}],"metrics":{"monthlyCost":7800,"avgLatencyMs":55,"vendorLockInScore":48}},{"name":"DigitalOcean + GCP Balanced","description":"DigitalOcean Droplets for primary API tier with DigitalOcean Managed Databases, GCP Cloud Run for burst compute — good cost/latency balance","allocations":[{"provider":"digitalocean","percentage":65},{"provider":"gcp","percentage":35}],"metrics":{"monthlyCost":9400,"avgLatencyMs":50,"vendorLockInScore":41}},{"name":"AWS-Dominant Strategy","description":"Primary AWS deployment with GCP failover","allocations":[{"provider":"aws","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":14200,"avgLatencyMs":43,"vendorLockInScore":74}}],"comparisonReport":"# Multi-Cloud Strategy Comparison Report\n\n## Winner: DigitalOcean-Primary Strategy\n\nMonthly cost: **$7,800** — 45% lower than the AWS-dominant baseline ($14,200).\nAverage latency: **55 ms** — comfortably within the 150 ms SLA requirement.\nVendor lock-in score: **48/100** — moderate lock-in, significantly lower than the AWS-only baseline (74/100).\n\n## Key Resources\n\n- **Compute:** DigitalOcean Droplets (s-4vcpu-8gb, nyc3) — predictable hourly pricing with no data-transfer surprises within the region.\n- **Database:** DigitalOcean Managed Databases (PostgreSQL, 2-node HA cluster) — automated failover, daily backups, and connection pooling included.\n- **Object Storage:** DigitalOcean Spaces — S3-compatible API, 250 GB included, CDN edge caching available at no extra charge.\n\n## Trade-offs\n\n| Metric | DO-Primary | AWS-Dominant | DO + GCP |\n|---|---|---|---|\n| Monthly cost | $7,800 | $14,200 | $9,400 |\n| Avg latency | 55 ms | 43 ms | 50 ms |\n| Lock-in score | 48 | 74 | 41 |\n\n## Recommendation\n\nFor cost-sensitive workloads with moderate latency requirements, DigitalOcean Droplets\npaired with DigitalOcean Managed Databases and Spaces deliver the best cost efficiency.\nThe 12 ms latency difference versus the AWS baseline is unlikely to impact end-user\nexperience for the given SLA. Consider the DigitalOcean + GCP Balanced strategy if\nfuture burst capacity beyond current Droplet limits is anticipated.","completedAt":"2024-01-16T09:45:00Z"}},"awsPrimaryWinner":{"summary":"AWS wins — performance-optimized workload where EC2 m5.xlarge + RDS Multi-AZ ranks","value":{"jobId":"job_aws_perf123","status":"completed","progress":100,"isComplete":true,"strategiesGenerated":40,"allStrategies":[{"name":"AWS-Primary Strategy","description":"EC2 m5.xlarge Auto Scaling group across us-east-1a/1b with RDS db.r5.large Multi-AZ and CloudFront CDN — lowest p95 latency in comparison","allocations":[{"provider":"aws","percentage":100}],"metrics":{"monthlyCost":14500,"avgLatencyMs":38,"vendorLockInScore":72}},{"name":"AWS + GCP Balanced","description":"EC2 primary with GCP Cloud Run burst capacity","allocations":[{"provider":"aws","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":13200,"avgLatencyMs":41,"vendorLockInScore":58}},{"name":"DigitalOcean Cost-Optimized","description":"DigitalOcean Droplets primary — lowest cost but 17 ms higher latency","allocations":[{"provider":"digitalocean","percentage":80},{"provider":"aws","percentage":20}],"metrics":{"monthlyCost":7800,"avgLatencyMs":55,"vendorLockInScore":48}}],"topStrategies":[{"name":"AWS-Primary Strategy","description":"EC2 m5.xlarge Auto Scaling group across us-east-1a/1b with RDS db.r5.large Multi-AZ and CloudFront CDN — lowest p95 latency in comparison","allocations":[{"provider":"aws","percentage":100}],"metrics":{"monthlyCost":14500,"avgLatencyMs":38,"vendorLockInScore":72}},{"name":"AWS + GCP Balanced","description":"EC2 primary with GCP Cloud Run burst capacity","allocations":[{"provider":"aws","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":13200,"avgLatencyMs":41,"vendorLockInScore":58}}],"comparisonReport":"# Multi-Cloud Strategy Comparison Report\n\n## Winner: AWS-Primary Strategy\n\nMonthly cost: **$14,500** — within the $25,000 budget.\nAverage latency: **38 ms** — best in class, comfortably within the 50 ms SLA.\nVendor lock-in score: **72/100** — acceptable given the performance requirements.\n\n## Key Resources\n\n- **Compute:** EC2 m5.xlarge (4 vCPU / 16 GB) Auto Scaling group, 3–12 instances, us-east-1a/1b.\n- **Database:** RDS db.r5.large Multi-AZ PostgreSQL — automatic failover within 60 seconds.\n- **CDN:** CloudFront with edge caching reduces origin load by ~40%.\n\n## Trade-offs\n\n| Metric | AWS-Primary | AWS + GCP | DO Cost-Optimized |\n|---|---|---|---|\n| Monthly cost | $14,500 | $13,200 | $7,800 |\n| Avg latency | 38 ms | 41 ms | 55 ms |\n| Lock-in score | 72 | 58 | 48 |\n\n## Recommendation\n\nFor workloads with a 50 ms SLA, AWS-Primary delivers the best latency. If budget is a constraint, the AWS + GCP Balanced strategy saves $1,300/month with only 3 ms latency degradation.","completedAt":"2024-01-17T08:20:00Z"}},"gcpPrimaryWinner":{"summary":"GCP wins — ML analytics workload where Cloud Run + Cloud SQL ranks","value":{"jobId":"job_gcp_ml456","status":"completed","progress":100,"isComplete":true,"strategiesGenerated":35,"allStrategies":[{"name":"GCP-Primary Strategy","description":"Cloud Run (fully managed, us-central1) with Cloud SQL for PostgreSQL (HA) and Cloud Storage — best autoscaling fit for variable ML inference load","allocations":[{"provider":"gcp","percentage":100}],"metrics":{"monthlyCost":11800,"avgLatencyMs":42,"vendorLockInScore":63}},{"name":"GCP + AWS Hybrid","description":"GCP Cloud Run primary with AWS Lambda for async processing jobs","allocations":[{"provider":"gcp","percentage":65},{"provider":"aws","percentage":35}],"metrics":{"monthlyCost":12400,"avgLatencyMs":44,"vendorLockInScore":51}},{"name":"AWS-Dominant Strategy","description":"EC2 m5.large fleet — higher cost, comparable latency","allocations":[{"provider":"aws","percentage":100}],"metrics":{"monthlyCost":15600,"avgLatencyMs":45,"vendorLockInScore":74}}],"topStrategies":[{"name":"GCP-Primary Strategy","description":"Cloud Run (fully managed, us-central1) with Cloud SQL for PostgreSQL (HA) and Cloud Storage — best autoscaling fit for variable ML inference load","allocations":[{"provider":"gcp","percentage":100}],"metrics":{"monthlyCost":11800,"avgLatencyMs":42,"vendorLockInScore":63}},{"name":"GCP + AWS Hybrid","description":"GCP Cloud Run primary with AWS Lambda for async processing jobs","allocations":[{"provider":"gcp","percentage":65},{"provider":"aws","percentage":35}],"metrics":{"monthlyCost":12400,"avgLatencyMs":44,"vendorLockInScore":51}}],"comparisonReport":"# Multi-Cloud Strategy Comparison Report\n\n## Winner: GCP-Primary Strategy\n\nMonthly cost: **$11,800** — $3,800 below the AWS baseline.\nAverage latency: **42 ms** — within the 80 ms SLA requirement.\nVendor lock-in score: **63/100** — moderate, lower than an AWS-only approach.\n\n## Key Resources\n\n- **Compute:** Cloud Run (fully managed) in us-central1 — scales to zero between inference jobs, eliminating idle compute cost.\n- **Database:** Cloud SQL for PostgreSQL (HA, db-n1-standard-4) — regional failover with 99.95% SLA.\n- **Storage:** Cloud Storage Standard — multi-region bucket with CDN integration for model artefacts.\n\n## Trade-offs\n\n| Metric | GCP-Primary | GCP + AWS | AWS-Dominant |\n|---|---|---|---|\n| Monthly cost | $11,800 | $12,400 | $15,600 |\n| Avg latency | 42 ms | 44 ms | 45 ms |\n| Lock-in score | 63 | 51 | 74 |\n\n## Recommendation\n\nGCP-Primary is optimal for variable ML inference loads. Cloud Run's scale-to-zero behaviour saves up to 35% on compute versus always-on EC2 instances at comparable load.","completedAt":"2024-01-18T11:35:00Z"}},"azurePrimaryWinner":{"summary":"Azure wins — compliance workload where AKS + Azure Database for PostgreSQL ranks","value":{"jobId":"job_azure_comp789","status":"completed","progress":100,"isComplete":true,"strategiesGenerated":38,"allStrategies":[{"name":"Azure-Primary Strategy","description":"AKS (eastus, Standard_D4s_v3 nodes) with Azure Database for PostgreSQL Flexible Server and Azure Front Door — strongest GDPR/HIPAA compliance posture","allocations":[{"provider":"azure","percentage":100}],"metrics":{"monthlyCost":13200,"avgLatencyMs":44,"vendorLockInScore":67}},{"name":"Azure + AWS Hybrid","description":"Azure primary with AWS S3 for object storage overflow","allocations":[{"provider":"azure","percentage":75},{"provider":"aws","percentage":25}],"metrics":{"monthlyCost":14100,"avgLatencyMs":46,"vendorLockInScore":55}},{"name":"GCP-Primary Strategy","description":"GCP Cloud Run with Cloud SQL — lower lock-in, weaker native compliance tooling","allocations":[{"provider":"gcp","percentage":100}],"metrics":{"monthlyCost":11800,"avgLatencyMs":42,"vendorLockInScore":63}}],"topStrategies":[{"name":"Azure-Primary Strategy","description":"AKS (eastus, Standard_D4s_v3 nodes) with Azure Database for PostgreSQL Flexible Server and Azure Front Door — strongest GDPR/HIPAA compliance posture","allocations":[{"provider":"azure","percentage":100}],"metrics":{"monthlyCost":13200,"avgLatencyMs":44,"vendorLockInScore":67}},{"name":"Azure + AWS Hybrid","description":"Azure primary with AWS S3 for object storage overflow","allocations":[{"provider":"azure","percentage":75},{"provider":"aws","percentage":25}],"metrics":{"monthlyCost":14100,"avgLatencyMs":46,"vendorLockInScore":55}}],"comparisonReport":"# Multi-Cloud Strategy Comparison Report\n\n## Winner: Azure-Primary Strategy\n\nMonthly cost: **$13,200** — within the $20,000 budget.\nAverage latency: **44 ms** — within the 100 ms SLA.\nVendor lock-in score: **67/100** — justified by compliance requirements (GDPR, HIPAA, ISO 27001).\n\n## Key Resources\n\n- **Compute:** AKS cluster (Standard_D4s_v3, 3–10 nodes) in eastus with Availability Zones — meets HA requirements for HIPAA.\n- **Database:** Azure Database for PostgreSQL Flexible Server (General Purpose, 4 vCores) with geo-redundant backup.\n- **Networking:** Azure Front Door with WAF — OWASP rule sets satisfy PCI-DSS network security controls.\n\n## Trade-offs\n\n| Metric | Azure-Primary | Azure + AWS | GCP-Primary |\n|---|---|---|---|\n| Monthly cost | $13,200 | $14,100 | $11,800 |\n| Avg latency | 44 ms | 46 ms | 42 ms |\n| Lock-in score | 67 | 55 | 63 |\n\n## Recommendation\n\nFor GDPR/HIPAA/ISO 27001 workloads, Azure-Primary provides the most complete native compliance toolkit. GCP is $1,400/month cheaper but requires third-party tooling to meet the same compliance bar.","completedAt":"2024-01-19T14:00:00Z"}},"ociPrimaryWinner":{"summary":"OCI wins — database-intensive workload where Compute + Autonomous Database ranks","value":{"jobId":"job_oci_db321","status":"completed","progress":100,"isComplete":true,"strategiesGenerated":33,"allStrategies":[{"name":"OCI-Primary Strategy","description":"OCI Compute VM.Standard.E4.Flex (4 OCPU / 64 GB) with Autonomous Database (ATP, 4 OCPU) in us-ashburn-1 — best price-performance for database-heavy workloads","allocations":[{"provider":"oci","percentage":100}],"metrics":{"monthlyCost":9600,"avgLatencyMs":48,"vendorLockInScore":55}},{"name":"OCI + AWS Hybrid","description":"OCI primary database tier with AWS EC2 for the web and API layer","allocations":[{"provider":"oci","percentage":60},{"provider":"aws","percentage":40}],"metrics":{"monthlyCost":11200,"avgLatencyMs":46,"vendorLockInScore":48}},{"name":"AWS-Dominant Strategy","description":"EC2 + RDS — higher cost for equivalent database throughput","allocations":[{"provider":"aws","percentage":100}],"metrics":{"monthlyCost":16400,"avgLatencyMs":42,"vendorLockInScore":74}}],"topStrategies":[{"name":"OCI-Primary Strategy","description":"OCI Compute VM.Standard.E4.Flex (4 OCPU / 64 GB) with Autonomous Database (ATP, 4 OCPU) in us-ashburn-1 — best price-performance for database-heavy workloads","allocations":[{"provider":"oci","percentage":100}],"metrics":{"monthlyCost":9600,"avgLatencyMs":48,"vendorLockInScore":55}},{"name":"OCI + AWS Hybrid","description":"OCI primary database tier with AWS EC2 for the web and API layer","allocations":[{"provider":"oci","percentage":60},{"provider":"aws","percentage":40}],"metrics":{"monthlyCost":11200,"avgLatencyMs":46,"vendorLockInScore":48}}],"comparisonReport":"# Multi-Cloud Strategy Comparison Report\n\n## Winner: OCI-Primary Strategy\n\nMonthly cost: **$9,600** — 41% lower than the AWS-dominant baseline ($16,400).\nAverage latency: **48 ms** — within the 120 ms SLA requirement.\nVendor lock-in score: **55/100** — moderate, offset by significant cost savings.\n\n## Key Resources\n\n- **Compute:** OCI VM.Standard.E4.Flex (4 OCPU / 64 GB RAM) — flexible OCPU allocation reduces idle-time waste.\n- **Database:** Autonomous Database (ATP, 4 OCPU) — self-tuning, automatic patching, and built-in connection pooling eliminate DBA overhead.\n- **Networking:** OCI FastConnect to AWS for the OCI + AWS hybrid variant — sub-5 ms inter-cloud latency.\n\n## Trade-offs\n\n| Metric | OCI-Primary | OCI + AWS | AWS-Dominant |\n|---|---|---|---|\n| Monthly cost | $9,600 | $11,200 | $16,400 |\n| Avg latency | 48 ms | 46 ms | 42 ms |\n| Lock-in score | 55 | 48 | 74 |\n\n## Recommendation\n\nFor database-heavy workloads, OCI Autonomous Database delivers the best cost efficiency. The OCI + AWS Hybrid strategy is worth considering if the web/API tier already has AWS dependencies, adding only $1,600/month for a 2 ms latency improvement.","completedAt":"2024-01-20T16:45:00Z"}},"digitaloceanAMDNVMeMultiCloud":{"summary":"DigitalOcean AMD NVMe Droplets — cost-optimized strategy using s-2vcpu-4gb-amd for maximum price-performance","value":{"jobId":"job_do_amd_nvme_001","status":"completed","progress":100,"isComplete":true,"strategiesGenerated":36,"allStrategies":[{"name":"DigitalOcean AMD NVMe Primary Strategy","description":"DigitalOcean AMD NVMe Droplets (s-2vcpu-4gb-amd, nyc3) as primary compute — NVMe-backed local SSD storage delivers higher I/O throughput than standard Droplets at the same price point, with DigitalOcean Managed Databases (PostgreSQL) and Spaces for object storage","allocations":[{"provider":"digitalocean","percentage":85},{"provider":"aws","percentage":15}],"metrics":{"monthlyCost":6200,"avgLatencyMs":52,"vendorLockInScore":44}},{"name":"DigitalOcean AMD NVMe + GCP Balanced","description":"s-2vcpu-4gb-amd Droplets for primary API tier with GCP Cloud Run for burst compute — good cost/latency balance","allocations":[{"provider":"digitalocean","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":8100,"avgLatencyMs":49,"vendorLockInScore":38}},{"name":"AWS-Dominant Strategy","description":"EC2 t3.medium fleet — higher baseline cost, comparable single-request latency","allocations":[{"provider":"aws","percentage":100}],"metrics":{"monthlyCost":13800,"avgLatencyMs":41,"vendorLockInScore":74}}],"topStrategies":[{"name":"DigitalOcean AMD NVMe Primary Strategy","description":"DigitalOcean AMD NVMe Droplets (s-2vcpu-4gb-amd, nyc3) as primary compute — NVMe-backed local SSD storage delivers higher I/O throughput than standard Droplets at the same price point, with DigitalOcean Managed Databases (PostgreSQL) and Spaces for object storage","allocations":[{"provider":"digitalocean","percentage":85},{"provider":"aws","percentage":15}],"metrics":{"monthlyCost":6200,"avgLatencyMs":52,"vendorLockInScore":44}},{"name":"DigitalOcean AMD NVMe + GCP Balanced","description":"s-2vcpu-4gb-amd Droplets for primary API tier with GCP Cloud Run for burst compute — good cost/latency balance","allocations":[{"provider":"digitalocean","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":8100,"avgLatencyMs":49,"vendorLockInScore":38}},{"name":"AWS-Dominant Strategy","description":"EC2 t3.medium fleet — higher baseline cost, comparable single-request latency","allocations":[{"provider":"aws","percentage":100}],"metrics":{"monthlyCost":13800,"avgLatencyMs":41,"vendorLockInScore":74}}],"comparisonReport":"# Multi-Cloud Strategy Comparison Report\n\n## Winner: DigitalOcean AMD NVMe Primary Strategy\n\nMonthly cost: **$6,200** — 55% lower than the AWS-dominant baseline ($13,800).\nAverage latency: **52 ms** — within the 150 ms SLA requirement.\nVendor lock-in score: **44/100** — low lock-in, easy to migrate if requirements change.\n\n## Key Resources\n\n- **Compute:** DigitalOcean AMD NVMe Droplets (s-2vcpu-4gb-amd, nyc3) — AMD EPYC processors with NVMe-backed local SSD storage deliver higher disk I/O throughput than standard Intel Droplets at the same hourly rate ($0.036/hr per Droplet). Ideal for workloads with frequent local reads/writes or ephemeral scratch space.\n- **Database:** DigitalOcean Managed Databases (PostgreSQL, 2-node HA cluster, nyc3) — automated failover, daily backups, and PgBouncer connection pooling included at no extra charge.\n- **Object Storage:** DigitalOcean Spaces — S3-compatible API, 250 GB included, optional CDN edge caching.\n\n## AMD NVMe vs Standard Droplets\n\nThe `s-2vcpu-4gb-amd` slug selects the AMD NVMe variant of the standard shared-CPU tier. Compared to the equivalent Intel Droplet (`s-2vcpu-4gb`):\n- Same vCPU count and RAM\n- Same hourly price\n- NVMe local SSD instead of spinning disk — up to 3× higher sequential read throughput\n- AMD EPYC \"Milan\" or \"Rome\" core depending on host availability\n\nChoose the AMD NVMe variant when your workload is I/O-bound (e.g. log processing, local caching, build pipelines) or when you want deterministic low-latency disk access without paying for a dedicated CPU plan.\n\n## Trade-offs\n\n| Metric | DO AMD NVMe Primary | DO AMD NVMe + GCP | AWS-Dominant |\n|---|---|---|---|\n| Monthly cost | $6,200 | $8,100 | $13,800 |\n| Avg latency | 52 ms | 49 ms | 41 ms |\n| Lock-in score | 44 | 38 | 74 |\n\n## Recommendation\n\nFor cost-sensitive workloads with moderate I/O requirements, the DigitalOcean AMD NVMe Primary strategy delivers the best value. The 11 ms latency gap versus the AWS baseline is unlikely to affect end-user experience for the given SLA. If burst capacity beyond current Droplet limits is anticipated, consider the DO AMD NVMe + GCP Balanced strategy, which adds only $1,900/month for a 3 ms latency improvement and lower vendor lock-in.","completedAt":"2024-01-21T10:15:00Z"}}}}}},"400":{"description":"Job not completed yet","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/multi-cloud/jobs/{jobId}/partial-results":{"x-stability":"stable","get":{"tags":["Multi-Cloud Strategy"],"summary":"Get partial multi-cloud exploration results while a job is running","description":"Retrieve strategies accumulated so far for a multi-cloud exploration job, even while the job is still running.\n\nThis endpoint is the dedicated channel for polling partial results during job execution.\nThe `/results` endpoint requires the job to be completed; use this endpoint instead when you want\nto start analyzing strategies before the full exploration finishes.\n\n**Polling pattern:**\n```\nwhile true:\n  data = GET /api/multi-cloud/jobs/{jobId}/partial-results\n  show data.allStrategies to user\n  if data.isComplete: break\n  sleep(2s)\nfull = GET /api/multi-cloud/jobs/{jobId}/results\n```\n\n**isComplete flag:**\n- `false` — job is still `pending` or `running`; more strategies may arrive\n- `true` — job has reached a terminal state (`completed`, `failed`, or `cancelled`)\n\n**When `isComplete` is true and `status` is `completed`**, fetch the ranked\n`topStrategies` and `comparisonReport` from `GET /api/multi-cloud/jobs/{jobId}/results`.\n","operationId":"getMultiCloudPartialResults","x-codeSamples":[{"lang":"curl","label":"curl","source":"while true; do\n  DATA=$(curl -s \"https://your-production-domain.com/api/multi-cloud/jobs/job_abc123/partial-results\" \\\n    -H \"Authorization: Bearer $API_KEY\")\n  echo \"$DATA\" | jq '{strategies: (.allStrategies | length), isComplete: .isComplete}'\n  [ \"$(echo \"$DATA\" | jq -r '.isComplete')\" = \"true\" ] && break\n  sleep 2\ndone\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"job_abc123\";\n\nasync function pollPartialResults(jobId) {\n  while (true) {\n    const resp = await fetch(`${BASE_URL}/multi-cloud/jobs/${jobId}/partial-results`, {\n      headers: { \"Authorization\": `Bearer ${API_KEY}` },\n    });\n    const data = await resp.json();\n    console.log(`strategies so far: ${data.allStrategies?.length ?? 0}, isComplete: ${data.isComplete}`);\n    if (data.isComplete) break;\n    await new Promise(r => setTimeout(r, 2000));\n  }\n  // Fetch full ranked results once complete\n  const full = await fetch(`${BASE_URL}/multi-cloud/jobs/${jobId}/results`, {\n    headers: { \"Authorization\": `Bearer ${API_KEY}` },\n  });\n  return full.json();\n}\n"}],"security":[{"BearerAuth":["read"]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string","format":"uuid"},"description":"ID of the multi-cloud job"},{"name":"providers","in":"query","required":false,"schema":{"type":"string","example":"digitalocean,aws"},"description":"Comma-separated list of cloud provider names to filter results by preferred primary provider.\nA strategy is included if any of the specified providers holds ≥ 50 % of the traffic allocation.\nSupported values: `aws`, `gcp`, `azure`, `digitalocean`, `oci`.\nIf omitted, all strategies are returned.\n"}],"responses":{"200":{"description":"Partial or terminal results for the job","content":{"application/json":{"schema":{"type":"object","required":["jobId","status","progress","isComplete","strategiesGenerated","allStrategies"],"properties":{"jobId":{"type":"string"},"status":{"type":"string","enum":["pending","running","completed","failed","cancelled"]},"progress":{"type":"number","description":"Job progress percentage (0-100)"},"isComplete":{"type":"boolean","description":"True if job has reached a terminal state, false if still pending or running"},"strategiesGenerated":{"type":"number","description":"Number of strategies generated so far"},"allStrategies":{"type":"array","items":{"$ref":"#/components/schemas/Strategy"},"description":"All raw strategies generated so far"},"error":{"type":"string","description":"Error message — only present when status is failed"}}},"examples":{"runningPartial":{"summary":"Job still running — 12 strategies accumulated so far","value":{"jobId":"job_abc123","status":"running","progress":40,"isComplete":false,"strategiesGenerated":12,"allStrategies":[{"name":"AWS-Dominant Strategy","description":"Primary AWS deployment","allocations":[{"provider":"aws","percentage":100}],"metrics":{"monthlyCost":14500,"avgLatencyMs":38,"vendorLockInScore":72}}]}},"terminalComplete":{"summary":"Job completed — isComplete is true, fetch /results for ranked output","value":{"jobId":"job_abc123","status":"completed","progress":100,"isComplete":true,"strategiesGenerated":42,"allStrategies":[]}},"digitaloceanAMDNVMePartialMultiCloud":{"summary":"AMD NVMe mid-run — s-2vcpu-4gb-amd strategy visible before job completes","value":{"jobId":"job_do_amd_nvme_partial_001","status":"running","progress":55,"isComplete":false,"strategiesGenerated":20,"allStrategies":[{"name":"DigitalOcean AMD NVMe Primary Strategy","description":"DigitalOcean AMD NVMe Droplets (s-2vcpu-4gb-amd, nyc3) as primary compute — NVMe-backed local SSD storage delivers higher I/O throughput than standard Droplets at the same price point, with DigitalOcean Managed Databases (PostgreSQL) and Spaces for object storage","allocations":[{"provider":"digitalocean","percentage":85},{"provider":"aws","percentage":15}],"metrics":{"monthlyCost":6200,"avgLatencyMs":52,"vendorLockInScore":44}},{"name":"DigitalOcean AMD NVMe + GCP Balanced","description":"s-2vcpu-4gb-amd Droplets for primary API tier with GCP Cloud Run for burst compute — good cost/latency balance","allocations":[{"provider":"digitalocean","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":8100,"avgLatencyMs":49,"vendorLockInScore":38}},{"name":"AWS-Dominant Strategy","description":"EC2 t3.medium fleet — higher baseline cost, comparable single-request latency","allocations":[{"provider":"aws","percentage":100}],"metrics":{"monthlyCost":13800,"avgLatencyMs":41,"vendorLockInScore":74}}]}},"digitaloceanAMDNVMeFilteredPartial":{"summary":"AMD NVMe mid-run filtered — ?providers=digitalocean excludes AWS-dominant entries while job is still running","value":{"jobId":"job_do_amd_nvme_partial_001","status":"running","progress":55,"isComplete":false,"strategiesGenerated":14,"allStrategies":[{"name":"DigitalOcean AMD NVMe Primary Strategy","description":"DigitalOcean AMD NVMe Droplets (s-2vcpu-4gb-amd, nyc3) as primary compute — NVMe-backed local SSD storage delivers higher I/O throughput than standard Droplets at the same price point, with DigitalOcean Managed Databases (PostgreSQL) and Spaces for object storage","allocations":[{"provider":"digitalocean","percentage":85},{"provider":"aws","percentage":15}],"metrics":{"monthlyCost":6200,"avgLatencyMs":52,"vendorLockInScore":44}},{"name":"DigitalOcean AMD NVMe + GCP Balanced","description":"s-2vcpu-4gb-amd Droplets for primary API tier with GCP Cloud Run for burst compute — good cost/latency balance","allocations":[{"provider":"digitalocean","percentage":70},{"provider":"gcp","percentage":30}],"metrics":{"monthlyCost":8100,"avgLatencyMs":49,"vendorLockInScore":38}}]}}}}}},"401":{"$ref":"#/components/responses/Unauthorized"},"404":{"$ref":"#/components/responses/NotFound"}}}},"/multi-cloud/jobs/{jobId}/stream":{"x-stability":"stable","get":{"summary":"Stream multi-cloud exploration results in real-time","description":"Stream multi-cloud exploration results using Server-Sent Events (SSE).\nThis endpoint provides real-time updates as strategies are generated.\n\n**SSE Event Types:**\n- `init`: Initial job state when connection is established\n- `strategyGenerated`: A new strategy has been generated (sent for each strategy)\n- `progressUpdate`: Progress update (sent periodically)\n- `completed`: Job has finished successfully (final event before connection closes)\n- `failed`: Job encountered an unrecoverable error (final event before connection closes)\n- `cancelled`: Job was cancelled via `DELETE /api/multi-cloud/jobs/{jobId}` (final event before connection closes)\n\n**Connection Behavior:**\n- Connection remains open until job completes, fails, or is cancelled\n- All existing strategies are streamed immediately upon connection\n- New strategies are sent as they're generated\n- Connection automatically closes when job finishes\n\n**Agent Reconnection and Recovery Guide:**\n\nAfter receiving a terminal SSE event (`completed`, `failed`, or `cancelled`) the server\ncloses the connection. Each event requires a different agent response:\n\n- **`completed`**: The job finished successfully. No reconnection is needed. Fetch the\n  full ranked results from `GET /api/multi-cloud/jobs/{jobId}/results` to retrieve\n  `topStrategies`, `allStrategies`, and the `comparisonReport`. The results endpoint\n  also accepts a `?providers=` filter if you only need strategies for specific clouds.\n\n- **`failed`**: The job encountered an unrecoverable error. Inspect the `error` field in\n  the event payload for the root cause. Transient errors (e.g. a pricing API timeout)\n  are safe to retry — submit a new job via `POST /api/multi-cloud/jobs`. Permanent errors\n  (e.g. an invalid scenario ID) should not be retried without fixing the underlying\n  input first. Do not attempt to reconnect to the same `jobId`; it will not recover.\n\n- **`cancelled`**: Cancellation is terminal and intentional. No retry is needed or\n  recommended. If the cancellation was unintended, submit a new job.\n\n**Handling unexpected connection drops (no terminal event received):**\n\nIf the SSE connection closes without a `completed`, `failed`, or `cancelled` event —\nfor example due to a network interruption, proxy timeout, or server restart — the job\nmay still be running. Use the following fallback strategy:\n\n1. Poll `GET /api/multi-cloud/jobs/{jobId}` to check the current `status` field.\n2. If `status` is `running` or `pending`, reconnect to this stream endpoint. The\n   server replays all strategies generated so far on reconnect, so no data is lost.\n3. If `status` is `completed`, `failed`, or `cancelled`, treat it the same as if you\n   had received the corresponding terminal SSE event (see above).\n\nAgents should implement an exponential back-off (e.g. 1 s, 2 s, 4 s, cap at 30 s)\nbefore each reconnection attempt to avoid hammering the server during an outage.\n\n**Use Cases:**\n- Start analyzing strategies while generation continues\n- Real-time progress monitoring\n- Faster decision-making with early access to good strategies\n\n**Client Code Sample (JavaScript / Node.js):**\n\n`EventSource` does not support custom headers, so use `fetch` with a\n`ReadableStream` to pass the Bearer token:\n\n```javascript\nasync function streamMultiCloudJob(jobId, apiToken) {\n  const response = await fetch(\n    `https://your-host/api/multi-cloud/jobs/${jobId}/stream`,\n    {\n      headers: {\n        Authorization: `Bearer ${apiToken}`,\n        Accept: 'text/event-stream',\n      },\n    }\n  );\n\n  if (!response.ok) {\n    throw new Error(`HTTP ${response.status}: ${await response.text()}`);\n  }\n\n  const reader = response.body.getReader();\n  const decoder = new TextDecoder();\n  let buffer = '';\n\n  while (true) {\n    const { value, done } = await reader.read();\n    if (done) break;\n\n    buffer += decoder.decode(value, { stream: true });\n\n    // SSE frames are separated by double newlines\n    const frames = buffer.split(/\\n\\n/);\n    buffer = frames.pop(); // keep incomplete trailing frame\n\n    for (const frame of frames) {\n      const eventLine = frame.match(/^event:\\s*(.+)$/m);\n      const dataLine  = frame.match(/^data:\\s*(.+)$/m);\n      if (!dataLine) continue;\n\n      const eventType = eventLine ? eventLine[1].trim() : 'message';\n      const payload   = JSON.parse(dataLine[1]);\n\n      switch (eventType) {\n        case 'init':\n          console.log('Stream opened. Job status:', payload.status);\n          break;\n\n        case 'progressUpdate':\n          console.log(`Progress: ${payload.progress}% — ${payload.message}`);\n          break;\n\n        case 'strategyGenerated':\n          console.log('New strategy:', payload.strategy.name,\n            '| cost $' + payload.strategy.metrics.monthlyCost);\n          break;\n\n        case 'completed':\n          console.log('Job complete. Top strategy:',\n            payload.topStrategy?.name);\n          reader.cancel(); // close the connection\n          return payload;\n      }\n    }\n  }\n}\n\n// Usage\nstreamMultiCloudJob('job_aws_perf123', process.env.API_KEY)\n  .then(result => console.log('Final result:', result))\n  .catch(err  => console.error('Stream error:', err));\n```\n\n**Client Code Sample (Python):**\n\nUse `httpx` with streaming to consume the SSE connection.\nInstall with `pip install httpx`.\n\n```python\nimport httpx\nimport json\nimport os\n\n\ndef stream_multi_cloud_job(job_id: str, api_token: str) -> dict:\n    \"\"\"Stream a multi-cloud exploration job and return the final payload.\"\"\"\n    url = f\"https://your-host/api/multi-cloud/jobs/{job_id}/stream\"\n    headers = {\n        \"Authorization\": f\"Bearer {api_token}\",\n        \"Accept\": \"text/event-stream\",\n    }\n\n    with httpx.stream(\"GET\", url, headers=headers, timeout=None) as response:\n        response.raise_for_status()\n\n        buffer = \"\"\n\n        for chunk in response.iter_text():\n            buffer += chunk\n\n            *frames, buffer = buffer.split(\"\\n\\n\")\n\n            for frame in frames:\n                event_type = \"message\"\n                data_str = None\n\n                for line in frame.splitlines():\n                    if line.startswith(\"event:\"):\n                        event_type = line[len(\"event:\"):].strip()\n                    elif line.startswith(\"data:\"):\n                        data_str = line[len(\"data:\"):].strip()\n\n                if data_str is None:\n                    continue\n\n                payload = json.loads(data_str)\n\n                if event_type == \"init\":\n                    print(f\"Stream opened. Job status: {payload['status']}\")\n\n                elif event_type == \"progressUpdate\":\n                    print(f\"Progress: {payload['progress']}% — {payload.get('message', '')}\")\n\n                elif event_type == \"strategyGenerated\":\n                    strategy = payload[\"strategy\"]\n                    cost = strategy[\"metrics\"][\"monthlyCost\"]\n                    print(f\"New strategy: {strategy['name']} | cost ${cost}\")\n\n                elif event_type == \"completed\":\n                    top = payload.get(\"topStrategy\", {})\n                    print(f\"Job complete. Top strategy: {top.get('name')}\")\n                    return payload  # connection closes when the context exits\n\n    return {}\n\n\nif __name__ == \"__main__\":\n    result = stream_multi_cloud_job(\n        job_id=\"job_aws_perf123\",\n        api_token=os.environ[\"API_KEY\"],\n    )\n    print(\"Final result:\", result)\n```\n\n**Agent Reconnection Code Sample — with exponential back-off (JavaScript / Node.js):**\n\nThe samples above cover the happy path. The snippet below adds the full\nreconnect loop: detecting a connection drop without a terminal event,\npolling the status endpoint as a fallback, and reconnecting with\nexponential back-off. All three terminal events (`completed`, `failed`,\n`cancelled`) are handled explicitly.\n\n```javascript\nconst BASE_URL = 'https://your-host/api';\n\n// Fetch the current job state without opening an SSE stream.\nasync function pollJobStatus(jobId, apiToken) {\n  const res = await fetch(`${BASE_URL}/multi-cloud/jobs/${jobId}`, {\n    headers: { Authorization: `Bearer ${apiToken}` },\n  });\n  if (!res.ok) throw new Error(`Poll failed: HTTP ${res.status}`);\n  return res.json(); // { status, progress, ... }\n}\n\nasync function streamWithReconnect(jobId, apiToken) {\n  let delay    = 1_000;  // back-off starts at 1 s\n  const MAX_DELAY = 30_000;\n\n  while (true) {\n    let receivedTerminal = false;\n\n    try {\n      // ── Open the SSE connection ──────────────────────────────────\n      const response = await fetch(\n        `${BASE_URL}/multi-cloud/jobs/${jobId}/stream`,\n        {\n          headers: {\n            Authorization: `Bearer ${apiToken}`,\n            Accept: 'text/event-stream',\n          },\n        }\n      );\n\n      if (!response.ok) {\n        throw new Error(`HTTP ${response.status}: ${await response.text()}`);\n      }\n\n      const reader  = response.body.getReader();\n      const decoder = new TextDecoder();\n      let buffer    = '';\n\n      outer: while (true) {\n        const { value, done } = await reader.read();\n        if (done) break; // server closed — check receivedTerminal below\n\n        buffer += decoder.decode(value, { stream: true });\n        const frames = buffer.split(/\\n\\n/);\n        buffer = frames.pop(); // keep any incomplete trailing frame\n\n        for (const frame of frames) {\n          const eventLine = frame.match(/^event:\\s*(.+)$/m);\n          const dataLine  = frame.match(/^data:\\s*(.+)$/m);\n          if (!dataLine) continue;\n\n          const eventType = eventLine ? eventLine[1].trim() : 'message';\n          const payload   = JSON.parse(dataLine[1]);\n\n          switch (eventType) {\n            case 'init':\n              // Successful (re)connect — reset back-off timer.\n              console.log('Connected. Job status:', payload.status);\n              delay = 1_000;\n              break;\n\n            case 'progressUpdate':\n              console.log(`Progress: ${payload.progress}% — ${payload.message}`);\n              break;\n\n            case 'strategyGenerated':\n              console.log('Strategy:', payload.strategy.name,\n                '| $' + payload.strategy.metrics.monthlyCost + '/mo');\n              break;\n\n            // ── Terminal events ──────────────────────────────────────\n            case 'completed':\n              console.log('Job complete. Top strategy:', payload.topStrategy?.name);\n              receivedTerminal = true;\n              reader.cancel();\n              return { status: 'completed', payload };\n\n            case 'failed':\n              console.error('Job failed:', payload.error);\n              receivedTerminal = true;\n              reader.cancel();\n              return { status: 'failed', payload };\n\n            case 'cancelled':\n              console.warn('Job cancelled.');\n              receivedTerminal = true;\n              reader.cancel();\n              return { status: 'cancelled', payload };\n          }\n\n          if (receivedTerminal) break outer;\n        }\n      }\n    } catch (err) {\n      // Network error, proxy timeout, or server restart.\n      console.warn('SSE connection error:', err.message);\n    }\n\n    // Already handled cleanly — exit the retry loop.\n    if (receivedTerminal) break;\n\n    // ── Fallback: poll before reconnecting ───────────────────────────\n    // The connection dropped without a terminal event. The job may\n    // still be running, or it may have finished while we were offline.\n    try {\n      const job = await pollJobStatus(jobId, apiToken);\n\n      if (job.status === 'completed') {\n        console.log('Recovered via poll — job already completed.');\n        return { status: 'completed', payload: job };\n      }\n      if (job.status === 'failed') {\n        console.error('Recovered via poll — job failed:', job.error);\n        return { status: 'failed', payload: job };\n      }\n      if (job.status === 'cancelled') {\n        console.warn('Recovered via poll — job was cancelled.');\n        return { status: 'cancelled', payload: job };\n      }\n      // status is 'running' or 'pending' — reconnect after back-off.\n      console.log(`Job still ${job.status}. Reconnecting in ${delay / 1000}s…`);\n    } catch (pollErr) {\n      console.warn('Poll also failed:', pollErr.message, '— will retry.');\n    }\n\n    // ── Exponential back-off ─────────────────────────────────────────\n    await new Promise(resolve => setTimeout(resolve, delay));\n    delay = Math.min(delay * 2, MAX_DELAY);\n  }\n}\n\n// Usage\nstreamWithReconnect('job_aws_perf123', process.env.API_KEY)\n  .then(({ status, payload }) => console.log('Done:', status, payload))\n  .catch(err => console.error('Unrecoverable error:', err));\n```\n\n**Agent Reconnection Code Sample — with exponential back-off (Python):**\n\n```python\nimport httpx\nimport json\nimport os\nimport time\n\n\nBASE_URL = \"https://your-host/api\"\nTERMINAL_STATUSES = {\"completed\", \"failed\", \"cancelled\"}\n\n\ndef poll_job_status(job_id: str, api_token: str) -> dict:\n    \"\"\"Fetch the current job state without opening an SSE stream.\"\"\"\n    url = f\"{BASE_URL}/multi-cloud/jobs/{job_id}\"\n    headers = {\"Authorization\": f\"Bearer {api_token}\"}\n    response = httpx.get(url, headers=headers, timeout=10)\n    response.raise_for_status()\n    return response.json()  # {\"status\": ..., \"progress\": ..., ...}\n\n\ndef stream_with_reconnect(job_id: str, api_token: str) -> dict:\n    \"\"\"\n    Open the SSE stream and reconnect automatically after unexpected drops.\n\n    Polls the status endpoint when the connection closes without a terminal\n    event, and applies exponential back-off before each reconnection attempt.\n\n    Returns {\"status\": <terminal_status>, \"payload\": <event_payload>}.\n    \"\"\"\n    url = f\"{BASE_URL}/multi-cloud/jobs/{job_id}/stream\"\n    headers = {\n        \"Authorization\": f\"Bearer {api_token}\",\n        \"Accept\": \"text/event-stream\",\n    }\n    delay     = 1.0   # back-off starts at 1 s\n    max_delay = 30.0\n\n    while True:\n        received_terminal = False\n\n        try:\n            with httpx.stream(\"GET\", url, headers=headers, timeout=None) as response:\n                response.raise_for_status()\n                delay  = 1.0  # reset back-off on successful connect\n                buffer = \"\"\n\n                for chunk in response.iter_text():\n                    buffer += chunk\n                    *frames, buffer = buffer.split(\"\\n\\n\")\n\n                    for frame in frames:\n                        event_type = \"message\"\n                        data_str   = None\n\n                        for line in frame.splitlines():\n                            if line.startswith(\"event:\"):\n                                event_type = line[len(\"event:\"):].strip()\n                            elif line.startswith(\"data:\"):\n                                data_str = line[len(\"data:\"):].strip()\n\n                        if data_str is None:\n                            continue\n\n                        payload = json.loads(data_str)\n\n                        if event_type == \"init\":\n                            print(f\"Connected. Job status: {payload['status']}\")\n\n                        elif event_type == \"progressUpdate\":\n                            print(f\"Progress: {payload['progress']}% — {payload.get('message', '')}\")\n\n                        elif event_type == \"strategyGenerated\":\n                            s = payload[\"strategy\"]\n                            print(f\"Strategy: {s['name']} | ${s['metrics']['monthlyCost']}/mo\")\n\n                        elif event_type == \"completed\":\n                            top = payload.get(\"topStrategy\", {})\n                            print(f\"Job complete. Top strategy: {top.get('name')}\")\n                            received_terminal = True\n                            return {\"status\": \"completed\", \"payload\": payload}\n\n                        elif event_type == \"failed\":\n                            print(f\"Job failed: {payload.get('error')}\")\n                            received_terminal = True\n                            return {\"status\": \"failed\", \"payload\": payload}\n\n                        elif event_type == \"cancelled\":\n                            print(\"Job cancelled.\")\n                            received_terminal = True\n                            return {\"status\": \"cancelled\", \"payload\": payload}\n\n                        if received_terminal:\n                            break\n\n        except (httpx.HTTPError, httpx.StreamError) as exc:\n            print(f\"SSE connection error: {exc}\")\n\n        if received_terminal:\n            break\n\n        try:\n            job = poll_job_status(job_id, api_token)\n\n            if job[\"status\"] in TERMINAL_STATUSES:\n                print(f\"Recovered via poll — job {job['status']}.\")\n                return {\"status\": job[\"status\"], \"payload\": job}\n\n            print(f\"Job still {job['status']}. Reconnecting in {delay:.0f}s…\")\n\n        except httpx.HTTPError as exc:\n            print(f\"Poll also failed: {exc} — will retry.\")\n\n        time.sleep(delay)\n        delay = min(delay * 2, max_delay)\n\n    return {}\n\n\nif __name__ == \"__main__\":\n    result = stream_with_reconnect(\n        job_id=\"job_aws_perf123\",\n        api_token=os.environ[\"API_KEY\"],\n    )\n    print(\"Done:\", result[\"status\"], result.get(\"payload\", {}))\n```\n","operationId":"streamMultiCloudJob","x-codeSamples":[{"lang":"curl","label":"curl","source":"curl -N https://your-production-domain.com/api/multi-cloud/jobs/job_abc123/stream \\\n  -H \"Authorization: Bearer $API_KEY\" \\\n  -H \"Accept: text/event-stream\"\n"},{"lang":"Python","label":"Python","source":"import httpx\nimport json\n\nBASE_URL = \"https://your-production-domain.com/api\"\nAPI_KEY = \"your-api-key\"\nJOB_ID = \"job_abc123\"\n\nurl = f\"{BASE_URL}/multi-cloud/jobs/{JOB_ID}/stream\"\nheaders = {\"Authorization\": f\"Bearer {API_KEY}\", \"Accept\": \"text/event-stream\"}\n\nwith httpx.stream(\"GET\", url, headers=headers, timeout=None) as response:\n    response.raise_for_status()\n    buffer = \"\"\n    for chunk in response.iter_text():\n        buffer += chunk\n        *frames, buffer = buffer.split(\"\\n\\n\")\n        for frame in frames:\n            event_type, data_str = \"message\", None\n            for line in frame.splitlines():\n                if line.startswith(\"event:\"):\n                    event_type = line[len(\"event:\"):].strip()\n                elif line.startswith(\"data:\"):\n                    data_str = line[len(\"data:\"):].strip()\n            if not data_str:\n                continue\n            payload = json.loads(data_str)\n            if event_type == \"progressUpdate\":\n                print(f\"Progress: {payload['progress']}%\")\n            elif event_type == \"strategyGenerated\":\n                s = payload[\"strategy\"]\n                print(f\"Strategy: {s['name']}  cost=${s['metrics']['monthlyCost']}/mo\")\n            elif event_type == \"completed\":\n                top = payload.get(\"topStrategy\", {})\n                print(f\"Done. Top strategy: {top.get('name')}\")\n                break\n"},{"lang":"Node.js","label":"Node.js","source":"const BASE_URL = \"https://your-production-domain.com/api\";\nconst API_KEY = \"your-api-key\";\nconst JOB_ID = \"job_abc123\";\n\nconst response = await fetch(`${BASE_URL}/multi-cloud/jobs/${JOB_ID}/stream`, {\n  headers: { \"Authorization\": `Bearer ${API_KEY}`, \"Accept\": \"text/event-stream\" },\n});\nif (!response.ok) throw new Error(`HTTP ${response.status}`);\n\nconst reader = response.body.getReader();\nconst decoder = new TextDecoder();\nlet buffer = \"\";\n\nwhile (true) {\n  const { value, done } = await reader.read();\n  if (done) break;\n  buffer += decoder.decode(value, { stream: true });\n  const frames = buffer.split(/\\n\\n/);\n  buffer = frames.pop();\n  for (const frame of frames) {\n    const eventLine = frame.match(/^event:\\s*(.+)$/m);\n    const dataLine  = frame.match(/^data:\\s*(.+)$/m);\n    if (!dataLine) continue;\n    const eventType = eventLine ? eventLine[1].trim() : \"message\";\n    const payload   = JSON.parse(dataLine[1]);\n    if (eventType === \"progressUpdate\") {\n      console.log(`Progress: ${payload.progress}%`);\n    } else if (eventType === \"strategyGenerated\") {\n      const s = payload.strategy;\n      console.log(`Strategy: ${s.name}  cost=$${s.metrics.monthlyCost}/mo`);\n    } else if (eventType === \"completed\") {\n      console.log(\"Done. Top strategy:\", payload.topStrategy?.name);\n      reader.cancel();\n    }\n  }\n}\n"}],"tags":["Multi-Cloud Strategy"],"security":[{"BearerAuth":[]}],"parameters":[{"name":"jobId","in":"path","required":true,"schema":{"type":"string"},"description":"Multi-cloud job ID"}],"responses":{"200":{"description":"SSE stream of job updates","content":{"text/event-stream":{"schema":{"type":"string","description":"Server-Sent Events stream"},"examples":{"awsStream":{"summary":"AWS — EC2 m5.xlarge + RDS Multi-AZ performance workload stream","value":"event: init\ndata: {\"jobId\":\"job_aws_perf123\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_aws_perf123\",\"progress\":20,\"strategiesGenerated\":8,\"message\":\"Evaluating EC2 instance families and RDS configurations\"}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_aws_perf123\",\"strategy\":{\"name\":\"AWS-Primary Strategy\",\"description\":\"EC2 m5.xlarge Auto Scaling group across us-east-1a/1b with RDS db.r5.large Multi-AZ and CloudFront CDN — lowest p95 latency in comparison\",\"allocations\":[{\"provider\":\"aws\",\"percentage\":100}],\"metrics\":{\"monthlyCost\":14500,\"avgLatencyMs\":38,\"vendorLockInScore\":72}},\"strategiesGenerated\":9}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_aws_perf123\",\"strategy\":{\"name\":\"AWS + GCP Balanced\",\"description\":\"EC2 primary with GCP Cloud Run burst capacity — reduces lock-in by 14 points with only 3 ms latency increase\",\"allocations\":[{\"provider\":\"aws\",\"percentage\":70},{\"provider\":\"gcp\",\"percentage\":30}],\"metrics\":{\"monthlyCost\":13200,\"avgLatencyMs\":41,\"vendorLockInScore\":58}},\"strategiesGenerated\":10}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_aws_perf123\",\"progress\":65,\"strategiesGenerated\":26,\"message\":\"Ranking strategies by p95 latency against 50 ms SLA\"}\n\nevent: completed\ndata: {\"jobId\":\"job_aws_perf123\",\"status\":\"completed\",\"progress\":100,\"strategiesGenerated\":40,\"topStrategy\":{\"name\":\"AWS-Primary Strategy\",\"metrics\":{\"monthlyCost\":14500,\"avgLatencyMs\":38,\"vendorLockInScore\":72}},\"comparisonReport\":\"# Multi-Cloud Strategy Comparison Report\\n\\n## Winner: AWS-Primary Strategy\\n\\nAverage latency: **38 ms** — best in class, within the 50 ms SLA.\\nMonthly cost: **$14,500** — within the $25,000 budget.\\nVendor lock-in score: **72/100** — acceptable given performance requirements.\",\"completedAt\":\"2024-01-17T08:20:00Z\"}\n"},"gcpStream":{"summary":"GCP — Cloud Run + Cloud SQL ML analytics workload stream","value":"event: init\ndata: {\"jobId\":\"job_gcp_ml456\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_gcp_ml456\",\"progress\":18,\"strategiesGenerated\":6,\"message\":\"Evaluating Cloud Run autoscaling profiles for variable ML inference load\"}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_gcp_ml456\",\"strategy\":{\"name\":\"GCP-Primary Strategy\",\"description\":\"Cloud Run (fully managed, us-central1) with Cloud SQL for PostgreSQL (HA) and Cloud Storage — best autoscaling fit for variable ML inference load; scales to zero between jobs\",\"allocations\":[{\"provider\":\"gcp\",\"percentage\":100}],\"metrics\":{\"monthlyCost\":11800,\"avgLatencyMs\":42,\"vendorLockInScore\":63}},\"strategiesGenerated\":7}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_gcp_ml456\",\"strategy\":{\"name\":\"GCP + AWS Hybrid\",\"description\":\"GCP Cloud Run primary with AWS Lambda for async batch processing — reduces cost by 5% versus GCP-only while adding cross-cloud redundancy\",\"allocations\":[{\"provider\":\"gcp\",\"percentage\":65},{\"provider\":\"aws\",\"percentage\":35}],\"metrics\":{\"monthlyCost\":12400,\"avgLatencyMs\":44,\"vendorLockInScore\":51}},\"strategiesGenerated\":8}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_gcp_ml456\",\"progress\":70,\"strategiesGenerated\":24,\"message\":\"Comparing scale-to-zero savings across Cloud Run, Lambda, and Container Apps\"}\n\nevent: completed\ndata: {\"jobId\":\"job_gcp_ml456\",\"status\":\"completed\",\"progress\":100,\"strategiesGenerated\":35,\"topStrategy\":{\"name\":\"GCP-Primary Strategy\",\"metrics\":{\"monthlyCost\":11800,\"avgLatencyMs\":42,\"vendorLockInScore\":63}},\"comparisonReport\":\"# Multi-Cloud Strategy Comparison Report\\n\\n## Winner: GCP-Primary Strategy\\n\\nMonthly cost: **$11,800** — $3,800 below the AWS baseline.\\nAverage latency: **42 ms** — within the 80 ms SLA.\\nCloud Run scale-to-zero eliminates idle compute cost between ML inference jobs.\",\"completedAt\":\"2024-01-18T11:35:00Z\"}\n"},"azureStream":{"summary":"Azure — AKS + Azure Database for PostgreSQL compliance workload stream","value":"event: init\ndata: {\"jobId\":\"job_azure_comp789\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_azure_comp789\",\"progress\":22,\"strategiesGenerated\":8,\"message\":\"Evaluating GDPR/HIPAA compliance posture across AKS, GKE, and EKS configurations\"}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_azure_comp789\",\"strategy\":{\"name\":\"Azure-Primary Strategy\",\"description\":\"AKS (eastus, Standard_D4s_v3 nodes) with Azure Database for PostgreSQL Flexible Server and Azure Front Door — strongest GDPR/HIPAA compliance posture with native Policy and Defender integration\",\"allocations\":[{\"provider\":\"azure\",\"percentage\":100}],\"metrics\":{\"monthlyCost\":13200,\"avgLatencyMs\":44,\"vendorLockInScore\":67}},\"strategiesGenerated\":9}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_azure_comp789\",\"strategy\":{\"name\":\"Azure + AWS Hybrid\",\"description\":\"AKS primary with AWS S3 for object storage overflow — reduces storage cost by 12% while maintaining Azure compliance perimeter for compute and database tiers\",\"allocations\":[{\"provider\":\"azure\",\"percentage\":75},{\"provider\":\"aws\",\"percentage\":25}],\"metrics\":{\"monthlyCost\":14100,\"avgLatencyMs\":46,\"vendorLockInScore\":55}},\"strategiesGenerated\":10}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_azure_comp789\",\"progress\":60,\"strategiesGenerated\":23,\"message\":\"Scoring compliance coverage for GDPR, HIPAA, ISO 27001, and PCI-DSS controls\"}\n\nevent: completed\ndata: {\"jobId\":\"job_azure_comp789\",\"status\":\"completed\",\"progress\":100,\"strategiesGenerated\":38,\"topStrategy\":{\"name\":\"Azure-Primary Strategy\",\"metrics\":{\"monthlyCost\":13200,\"avgLatencyMs\":44,\"vendorLockInScore\":67}},\"comparisonReport\":\"# Multi-Cloud Strategy Comparison Report\\n\\n## Winner: Azure-Primary Strategy\\n\\nMonthly cost: **$13,200** — within the $20,000 budget.\\nAverage latency: **44 ms** — within the 100 ms SLA.\\nCompliance: native GDPR, HIPAA, and ISO 27001 tooling; no third-party additions required.\",\"completedAt\":\"2024-01-19T14:00:00Z\"}\n"},"ociStream":{"summary":"OCI — Compute + Autonomous Database database-intensive workload stream","value":"event: init\ndata: {\"jobId\":\"job_oci_db321\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_oci_db321\",\"progress\":25,\"strategiesGenerated\":8,\"message\":\"Benchmarking Autonomous Database ATP throughput against RDS and Cloud SQL at equivalent OCPU counts\"}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_oci_db321\",\"strategy\":{\"name\":\"OCI-Primary Strategy\",\"description\":\"OCI Compute VM.Standard.E4.Flex (4 OCPU / 64 GB) with Autonomous Database ATP (4 OCPU) in us-ashburn-1 — best price-performance for OLTP-heavy workloads; Autonomous Database self-tunes indexes and query plans\",\"allocations\":[{\"provider\":\"oci\",\"percentage\":100}],\"metrics\":{\"monthlyCost\":9600,\"avgLatencyMs\":48,\"vendorLockInScore\":55}},\"strategiesGenerated\":9}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_oci_db321\",\"strategy\":{\"name\":\"OCI + AWS Hybrid\",\"description\":\"OCI Autonomous Database for primary OLTP workload with AWS S3 and Lambda for analytics offload — keeps database cost advantage while leveraging mature AWS analytics ecosystem\",\"allocations\":[{\"provider\":\"oci\",\"percentage\":70},{\"provider\":\"aws\",\"percentage\":30}],\"metrics\":{\"monthlyCost\":11200,\"avgLatencyMs\":50,\"vendorLockInScore\":44}},\"strategiesGenerated\":10}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_oci_db321\",\"progress\":72,\"strategiesGenerated\":24,\"message\":\"Calculating total cost of ownership including Autonomous Database OCPU licensing\"}\n\nevent: completed\ndata: {\"jobId\":\"job_oci_db321\",\"status\":\"completed\",\"progress\":100,\"strategiesGenerated\":33,\"topStrategy\":{\"name\":\"OCI-Primary Strategy\",\"metrics\":{\"monthlyCost\":9600,\"avgLatencyMs\":48,\"vendorLockInScore\":55}},\"comparisonReport\":\"# Multi-Cloud Strategy Comparison Report\\n\\n## Winner: OCI-Primary Strategy\\n\\nMonthly cost: **$9,600** — 34% lower than the AWS RDS baseline ($14,500).\\nAverage latency: **48 ms** — within the 75 ms SLA.\\nAutonomous Database eliminates DBA overhead for index tuning, patching, and vacuuming.\",\"completedAt\":\"2024-01-20T09:10:00Z\"}\n"},"digitalOceanStream":{"summary":"DigitalOcean — Droplets + Managed Databases cost-optimized workload stream","value":"event: init\ndata: {\"jobId\":\"job_do_xyz789\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_do_xyz789\",\"progress\":20,\"strategiesGenerated\":7,\"message\":\"Evaluating DigitalOcean Droplet sizes and Managed Database tiers against workload traffic profile\"}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_do_xyz789\",\"strategy\":{\"name\":\"DigitalOcean-Primary Strategy\",\"description\":\"DigitalOcean Droplets (s-4vcpu-8gb, nyc3) as primary compute with Managed Databases (PostgreSQL, 2-node HA) and Spaces for S3-compatible object storage — lowest total monthly spend; predictable flat-rate pricing with no data-transfer surprises within region\",\"allocations\":[{\"provider\":\"digitalocean\",\"percentage\":80},{\"provider\":\"aws\",\"percentage\":20}],\"metrics\":{\"monthlyCost\":7800,\"avgLatencyMs\":55,\"vendorLockInScore\":48}},\"strategiesGenerated\":8}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_do_xyz789\",\"strategy\":{\"name\":\"DigitalOcean + GCP Balanced\",\"description\":\"DigitalOcean Droplets for primary API tier with Managed Databases, GCP Cloud Run for burst compute — good cost/latency balance with lower lock-in than DO-only\",\"allocations\":[{\"provider\":\"digitalocean\",\"percentage\":65},{\"provider\":\"gcp\",\"percentage\":35}],\"metrics\":{\"monthlyCost\":9400,\"avgLatencyMs\":50,\"vendorLockInScore\":41}},\"strategiesGenerated\":9}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_do_xyz789\",\"progress\":68,\"strategiesGenerated\":26,\"message\":\"Ranking strategies by monthly cost; verifying all options satisfy 150 ms SLA\"}\n\nevent: completed\ndata: {\"jobId\":\"job_do_xyz789\",\"status\":\"completed\",\"progress\":100,\"strategiesGenerated\":38,\"topStrategy\":{\"name\":\"DigitalOcean-Primary Strategy\",\"metrics\":{\"monthlyCost\":7800,\"avgLatencyMs\":55,\"vendorLockInScore\":48}},\"comparisonReport\":\"# Multi-Cloud Strategy Comparison Report\\n\\n## Winner: DigitalOcean-Primary Strategy\\n\\nMonthly cost: **$7,800** — 45% lower than the AWS-dominant baseline ($14,200).\\nAverage latency: **55 ms** — within the 150 ms SLA.\\nVendor lock-in score: **48/100** — moderate, significantly lower than the AWS-only baseline (74/100).\",\"completedAt\":\"2024-01-16T09:45:00Z\"}\n"},"digitaloceanAMDNVMeStream":{"summary":"DigitalOcean AMD NVMe Droplets — in-flight stream with s-2vcpu-4gb-amd strategies accumulating","value":"event: init\ndata: {\"jobId\":\"job_do_amd_nvme001\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_do_amd_nvme001\",\"progress\":22,\"strategiesGenerated\":6,\"message\":\"Evaluating DigitalOcean AMD NVMe Droplet sizes and Managed Database tiers against I/O-intensive workload profile\"}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_do_amd_nvme001\",\"strategy\":{\"name\":\"DigitalOcean AMD NVMe Primary Strategy\",\"description\":\"DigitalOcean AMD NVMe Droplets (s-2vcpu-4gb-amd, nyc3) as primary compute — NVMe-backed local SSD delivers higher I/O throughput than standard Droplets at the same price point, with DigitalOcean Managed Databases (PostgreSQL, 2-node HA) and Spaces for object storage\",\"allocations\":[{\"provider\":\"digitalocean\",\"percentage\":100}],\"metrics\":{\"monthlyCost\":6200,\"avgLatencyMs\":52,\"vendorLockInScore\":44}},\"strategiesGenerated\":7}\n\nevent: strategyGenerated\ndata: {\"jobId\":\"job_do_amd_nvme001\",\"strategy\":{\"name\":\"DO AMD NVMe + GCP Balanced\",\"description\":\"s-2vcpu-4gb-amd Droplets for primary API tier with GCP Cloud Run for burst compute — good cost/latency balance with lower lock-in than DO-only\",\"allocations\":[{\"provider\":\"digitalocean\",\"percentage\":65},{\"provider\":\"gcp\",\"percentage\":35}],\"metrics\":{\"monthlyCost\":8100,\"avgLatencyMs\":49,\"vendorLockInScore\":38}},\"strategiesGenerated\":8}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_do_amd_nvme001\",\"progress\":55,\"strategiesGenerated\":19,\"message\":\"Ranking strategies by monthly cost; verifying all options satisfy 150 ms SLA\"}\n"},"failedStream":{"summary":"Job failure — upstream provider pricing API unavailable mid-run","value":"event: init\ndata: {\"jobId\":\"job_aws_fail001\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_aws_fail001\",\"progress\":35,\"strategiesGenerated\":14,\"message\":\"Fetching live pricing data for EC2, RDS, and CloudFront\"}\n\nevent: failed\ndata: {\"jobId\":\"job_aws_fail001\",\"status\":\"failed\",\"progress\":35,\"error\":\"PricingFetchError: AWS Pricing API returned 503 after 3 retries — unable to compute accurate cost estimates\",\"failedAt\":\"2024-01-21T10:14:22Z\"}\n"},"cancelledStream":{"summary":"Job cancellation — agent issued DELETE before completion","value":"event: init\ndata: {\"jobId\":\"job_gcp_cancel002\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_gcp_cancel002\",\"progress\":52,\"strategiesGenerated\":21,\"message\":\"Evaluating GCP Cloud Run burst strategies against 80 ms SLA\"}\n\nevent: cancelled\ndata: {\"jobId\":\"job_gcp_cancel002\",\"status\":\"cancelled\",\"progress\":52,\"strategiesGenerated\":21,\"reason\":\"Cancelled by agent request via DELETE /api/multicloud/jobs/job_gcp_cancel002\",\"cancelledAt\":\"2024-01-21T11:03:45Z\"}\n"},"digitaloceanAMDNVMeFailedStream":{"summary":"DigitalOcean AMD NVMe — job failure mid-run (pricing API unavailable)","value":"event: init\ndata: {\"jobId\":\"job_do_amd_fail003\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_do_amd_fail003\",\"progress\":41,\"strategiesGenerated\":16,\"message\":\"Fetching live pricing data for DigitalOcean AMD NVMe Droplets (s-2vcpu-4gb-amd) and Managed Databases\"}\n\nevent: failed\ndata: {\"jobId\":\"job_do_amd_fail003\",\"status\":\"failed\",\"progress\":41,\"error\":\"PricingFetchError: DigitalOcean Pricing API returned 503 after 3 retries — unable to compute accurate cost estimates for s-2vcpu-4gb-amd AMD NVMe Droplet configurations\",\"failedAt\":\"2024-01-22T08:27:14Z\"}\n"},"digitaloceanAMDNVMeCancelledStream":{"summary":"DigitalOcean AMD NVMe — job cancellation via DELETE before completion","value":"event: init\ndata: {\"jobId\":\"job_do_amd_cancel004\",\"status\":\"running\",\"progress\":0,\"strategiesGenerated\":0}\n\nevent: progressUpdate\ndata: {\"jobId\":\"job_do_amd_cancel004\",\"progress\":47,\"strategiesGenerated\":18,\"message\":\"Ranking DigitalOcean AMD NVMe Droplet strategies (s-2vcpu-4gb-amd) by monthly cost; verifying all options satisfy 150 ms SLA\"}\n\nevent: cancelled\ndata: {\"jobId\":\"job_do_amd_cancel004\",\"status\":\"cancelled\",\"progress\":47,\"strategiesGenerated\":18,\"reason\":\"Cancelled by agent request via DELETE /api/multicloud/jobs/job_do_amd_cancel004\",\"cancelledAt\":\"2024-01-22T09:15:33Z\"}\n"}}}}},"401":{"description":"Unauthorized - invalid or missing API key","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"404":{"description":"Job not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"429":{"description":"Rate limit exceeded","headers":{"Retry-After":{"description":"Seconds until the rate-limit window resets","schema":{"type":"integer"}}},"content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}}},"components":{"schemas":{"ValidationErrorCode":{"type":"string","description":"Machine-readable error code indicating the category of validation failure","enum":["INVALID_FIELD","MISSING_REQUIRED","UNKNOWN_PROVIDER","UNKNOWN_SIZE","UNKNOWN_REGION","INVALID_REQUEST"]},"FieldError":{"type":"object","description":"A single structured field-level error returned on HTTP 400 responses","required":["code","pointer","message"],"properties":{"code":{"$ref":"#/components/schemas/ValidationErrorCode"},"pointer":{"type":"string","description":"JSON Pointer (RFC 6901) to the field that caused the error, or an empty string for request-level errors","example":"/resources/0/provider"},"message":{"type":"string","description":"Human-readable description of the error","example":"Unknown provider 'gce'; valid values are aws, gcp, azure, oci, digitalocean"},"suggestion":{"type":"string","description":"Optional actionable hint for self-correction","example":"Use one of: aws, gcp, azure, oci, digitalocean"},"suggestions":{"type":"array","description":"For UNKNOWN_REGION errors, the 3-5 closest matching regions ranked by similarity to the invalid input","items":{"type":"object","required":["regionKey","regionLabel"],"properties":{"regionKey":{"type":"string","description":"Canonical shortcode for the region","example":"use2"},"regionLabel":{"type":"string","description":"Standard provider label for the region","example":"us-east-2"}}}}}},"ValidationErrorResponse":{"type":"object","description":"Response body for HTTP 400 errors — either a single error or an array of field errors","oneOf":[{"type":"object","required":["error"],"properties":{"error":{"$ref":"#/components/schemas/FieldError"}}},{"type":"object","required":["errors"],"properties":{"errors":{"type":"array","items":{"$ref":"#/components/schemas/FieldError"},"minItems":1}}}]},"ApiKeyCreatedResponse":{"type":"object","description":"Response returned when an API key is successfully created (either via admin POST /keys or via POST /keys/register)","properties":{"id":{"type":"string","description":"Unique identifier for the created API key","example":"3f9a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c"},"key":{"type":"string","description":"The plain-text API key. Store this immediately — it is shown only once.","example":"cwm_live_a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2"},"keyPrefix":{"type":"string","description":"Redacted prefix used to identify the key in listings","example":"cwm_live_a1b2c3d4e5f6a1b2..."},"name":{"type":"string","example":"canvas-cloud-ai-prod"},"scopes":{"type":"array","items":{"type":"string","enum":["read","write","admin"]},"example":["read","write"]},"rateLimit":{"type":"integer","example":1000},"createdAt":{"type":"string","format":"date-time","example":"2026-05-10T17:00:00.000Z"},"expiresAt":{"type":"string","format":"date-time","nullable":true},"message":{"type":"string","example":"Store this API key securely. You won't be able to see it again."}}},"RegistrationTokenCreatedResponse":{"type":"object","description":"Response returned when a registration token is minted. The plain `token` value is shown only once.","properties":{"id":{"type":"string","description":"Unique identifier for the registration token","example":"7a8b9c0d-1e2f-3a4b-5c6d-7e8f9a0b1c2d"},"token":{"type":"string","description":"The plain-text registration token. Share this with the recipient immediately — it will not be shown again.","example":"cwm_reg_a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2"},"tokenPrefix":{"type":"string","description":"Redacted prefix used to identify the token in listings","example":"cwm_reg_a1b2c3d4e5f6a1b2..."},"name":{"type":"string","example":"canvas-cloud-ai"},"scopes":{"type":"array","items":{"type":"string","enum":["read","write","admin"]},"example":["read","write"]},"rateLimit":{"type":"integer","example":1000},"expiresAt":{"type":"string","format":"date-time","nullable":true,"example":"2026-06-01T00:00:00Z"},"createdAt":{"type":"string","format":"date-time","example":"2026-05-10T17:00:00.000Z"},"message":{"type":"string","example":"Share this token with the client. It can only be used once and will not be shown again."}}},"RegistrationTokenSummary":{"type":"object","description":"Summary of a registration token as returned by GET /register-tokens (plain token value is never included)","properties":{"id":{"type":"string","example":"7a8b9c0d-1e2f-3a4b-5c6d-7e8f9a0b1c2d"},"tokenPrefix":{"type":"string","example":"cwm_reg_a1b2c3d4e5f6a1b2..."},"name":{"type":"string","example":"canvas-cloud-ai"},"scopes":{"type":"array","items":{"type":"string","enum":["read","write","admin"]},"example":["read","write"]},"rateLimit":{"type":"integer","example":1000},"expiresAt":{"type":"string","format":"date-time","nullable":true,"example":"2026-06-01T00:00:00Z"},"usedAt":{"type":"string","format":"date-time","nullable":true,"description":"Set when the token was consumed. Null if still pending."},"isActive":{"type":"boolean","example":true},"createdByKeyId":{"type":"string","description":"ID of the admin API key that created this token","example":"1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b5c6d"},"createdAt":{"type":"string","format":"date-time","example":"2026-05-10T17:00:00.000Z"},"status":{"type":"string","enum":["pending","used","expired","revoked"],"description":"Computed status of the token","example":"pending"}}},"Resource":{"type":"object","required":["id","type","name","provider","status"],"properties":{"id":{"type":"string","description":"Unique resource identifier","example":"ec2-1"},"type":{"type":"string","enum":["compute","network","database","storage","kubernetes"],"description":"Resource type. `kubernetes` models a managed container node pool (EKS / GKE / AKS / OKE / DOKS): its `characteristics.nodeCount` worker nodes autoscale between `characteristics.minNodes` and `characteristics.maxNodes`, growing both serving capacity (`maxThroughput`) and hourly cost proportionally.\n","example":"compute"},"name":{"type":"string","description":"Human-readable resource name","example":"Web Server 1"},"provider":{"type":"string","enum":["aws","gcp","azure","oci","digitalocean"],"description":"Cloud provider","example":"aws"},"status":{"type":"string","enum":["healthy","warning","critical","offline"],"description":"Current resource status","default":"healthy"},"characteristics":{"type":"object","description":"Provider-specific resource characteristics","properties":{"serviceFamily":{"type":"string","description":"Provider-specific service family identifier. AWS values include: `ec2`, `rds`, `aurora-postgresql`, `aurora-dsql`, `aurora-serverless`, `dynamodb`, `s3`, `elb`, `cloudfront`, `lambda`. GCP values include: `gce`, `cloud-sql`, `cloud-spanner`, `bigtable-ssd`, `bigtable-hdd`, `cloud-run`, `gcs`, `cloud-load-balancing`. Azure values include: `azure_vm`, `azure-sql`, `azure-functions`, `blob-storage`, `azure-lb`. OCI values include: `oci_vm`, `autonomous-db`, `oci-container`, `oci-lb`. DigitalOcean values include: `droplets`, `managed_postgresql`, `spaces`, `load_balancer`.\n","example":"ec2"},"size":{"type":"string","example":"m5.large"},"maxThroughput":{"type":"number","example":2000},"baseLatency":{"type":"number","example":3},"maxConnections":{"type":"number","description":"Database only — the connection-pool capacity (maximum concurrent connections) for a database resource. The simulation models connection-pool pressure as `activeConnections / maxConnections` (surfaced as `metrics.connection_pressure`), so raising this value gives the database more headroom before the pool saturates. Defaults to 100 when omitted.\n","example":800},"capacityGB":{"type":"number","description":"Storage only — provisioned storage capacity in gigabytes for a storage/volume resource. Used together with `maxIops` to model block-storage throughput and IOPS utilization.\n","example":500},"maxIops":{"type":"number","description":"Storage only — the provisioned maximum IOPS (I/O operations per second) for a block-storage volume. The simulation models storage IOPS utilization against this ceiling (surfaced as `metrics.storageIopsUtilization`). Falls back to the provider's default volume profile IOPS when omitted.\n","example":3000},"cacheHitRate":{"type":"number","description":"Cache / CDN only — the expected cache hit rate as a fraction between 0 and 1 (e.g. `0.85` = 85%). Higher values reduce the load that reaches downstream origin/database resources and lower effective latency. Defaults to 0.8 when omitted for cache resources.\n","example":0.85},"autoscaling":{"type":"boolean","description":"Compute only — marks a compute resource as the autoscaling primary. When multiple compute resources exist, the engine uses the one flagged `autoscaling: true` as the autoscaling target; otherwise it falls back to the first compute resource.\n","example":true},"nodeCount":{"type":"number","description":"Kubernetes only — the current number of worker nodes in the managed node pool. The cluster's `maxThroughput` and hourly cost scale proportionally with this count as autoscaling adds/removes nodes.\n","example":2},"minNodes":{"type":"number","description":"Kubernetes only — the minimum number of worker nodes the node pool will scale in to. Defaults to the provider autoscaling profile minimum when omitted.\n","example":2},"maxNodes":{"type":"number","description":"Kubernetes only — the maximum number of worker nodes the node pool will scale out to. Defaults to the provider autoscaling profile maximum when omitted.\n","example":10},"nodePools":{"type":"array","description":"Kubernetes only — multiple node pools per cluster. When present, each pool scales independently within its own min/max bounds and is billed at its own per-node hourly rate. The legacy single-pool fields (`nodeCount`, `minNodes`, `maxNodes`) are used only when `nodePools` is absent.\n","items":{"type":"object","properties":{"name":{"type":"string","description":"Human-readable node pool name.","example":"general"},"nodeCount":{"type":"number","description":"Current number of worker nodes in this pool. Scales proportionally with the pool's serving capacity and cost as autoscaling adds/removes nodes.\n","example":2},"minNodes":{"type":"number","description":"Minimum number of worker nodes this pool will scale in to. Defaults to the provider autoscaling profile minimum when omitted.\n","example":2},"maxNodes":{"type":"number","description":"Maximum number of worker nodes this pool will scale out to. Defaults to the provider autoscaling profile maximum when omitted.\n","example":10},"perNodeRate":{"type":"number","description":"Hourly cost per node for this pool. Falls back to the provider default rate when omitted.\n","example":0.096}}}}}},"recoveryPolicy":{"type":"object","description":"Per-resource recovery thresholds that control how quickly the simulation transitions a resource from critical → warning → healthy. All four fields default to the global values (criticalCpuThreshold: 80, criticalSteps: 4, warningCpuThreshold: 70, warningSteps: 3) when omitted. Stateless microservices can use lower thresholds and fewer steps to heal faster; databases can use higher thresholds and more steps for a stricter recovery window.\n","properties":{"criticalCpuThreshold":{"type":"number","minimum":0,"maximum":100,"default":80,"description":"CPU must drop to or below this percentage before the critical → warning cooldown clock starts.","example":75},"criticalSteps":{"type":"integer","minimum":1,"default":4,"description":"Number of consecutive simulation steps the CPU must stay below criticalCpuThreshold before transitioning from critical to warning.","example":2},"warningCpuThreshold":{"type":"number","minimum":0,"maximum":100,"default":70,"description":"CPU must drop to or below this percentage before the warning → healthy cooldown clock starts.","example":60},"warningSteps":{"type":"integer","minimum":1,"default":3,"description":"Number of consecutive simulation steps the CPU must stay below warningCpuThreshold before transitioning from warning to healthy.","example":1}}}}},"Connection":{"type":"object","required":["sourceId","targetId"],"properties":{"sourceId":{"type":"string","description":"ID of the source resource"},"targetId":{"type":"string","description":"ID of the target resource"}}},"Simulation":{"type":"object","properties":{"id":{"type":"string","format":"uuid","description":"Unique simulation identifier"},"name":{"type":"string"},"description":{"type":"string"},"resources":{"type":"array","items":{"$ref":"#/components/schemas/Resource"}},"connections":{"type":"array","items":{"$ref":"#/components/schemas/Connection"}},"traffic":{"type":"number"},"currentTime":{"type":"integer","description":"Current simulation time step"}}},"EpisodeConfig":{"type":"object","required":["maxSteps","targetSLA"],"properties":{"maxSteps":{"type":"integer","description":"Maximum steps per episode","minimum":1,"example":300},"targetTrafficPattern":{"type":"string","enum":["constant","ramp","burst","step","wave","custom"],"description":"Traffic pattern to simulate","default":"constant","example":"ramp"},"initialTraffic":{"type":"number","description":"Starting traffic load (requests/sec)","default":1000,"example":5000},"targetSLA":{"type":"object","required":["maxLatencyP95","maxErrorRate"],"properties":{"maxLatencyP95":{"type":"number","description":"Target P95 latency threshold (ms)","example":200},"maxErrorRate":{"type":"number","description":"Target maximum error rate (%)","example":1}}},"costBudgetPerHour":{"type":"number","description":"Target cost budget per hour (USD)","default":10,"example":5},"enableFailures":{"type":"boolean","description":"Whether to inject random failures","default":false},"tick_seconds":{"type":"integer","minimum":1,"maximum":3600,"default":60,"description":"Number of simulated seconds each step advances the simulation clock.\nDefault is 60 (one simulated minute per step). Set higher values to\nmodel longer time horizons (e.g. 300 for 5-minute ticks, 3600 for\nhourly ticks). The value is fixed for the lifetime of the episode.\n","example":60}}},"RLEnvironment":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"simulationId":{"type":"string","format":"uuid"},"episodeConfig":{"$ref":"#/components/schemas/EpisodeConfig"},"currentStep":{"type":"integer","description":"Current step number in episode","example":42},"totalReward":{"type":"number","description":"Cumulative reward so far","example":12.45},"isActive":{"type":"boolean","description":"Whether episode is still running","example":true},"lastSimTimeHuman":{"type":"string","description":"Human-readable simulated elapsed time as of the last step or reset (e.g. \"1h 30m\"). Absent before the first step.","example":"1h 30m"},"createdAt":{"type":"string","format":"date-time"},"updatedAt":{"type":"string","format":"date-time"},"webhookDeliveryStatus":{"type":"string","enum":["pending","delivered","failed"],"description":"Webhook delivery status (for episode completion)","example":"delivered"},"webhookDeliveryAttempts":{"type":"integer","description":"Number of webhook delivery attempts made","example":1},"webhookDeliveryError":{"type":"string","description":"Error message if webhook delivery failed"},"webhookDeliveredAt":{"type":"string","format":"date-time","description":"Timestamp when webhook was successfully delivered"},"idleExpiresAt":{"type":"string","format":"date-time","description":"ISO 8601 timestamp at which this environment will be automatically expired\ndue to inactivity. Computed as `lastActivityAt` (or `createdAt` if no step\nor reset has been called yet) plus the 2-hour idle TTL. Agents can use\nthis value to schedule keep-alive calls (`POST /rl/environments/{environmentId}/step`\nor `POST /rl/environments/{environmentId}/reset`) before expiry.\n","example":"2024-01-15T14:00:00.000Z"}}},"Action":{"type":"object","required":["type","parameters"],"properties":{"type":{"type":"string","enum":["adjust_threshold","scale_out","scale_in","add_resource","remove_resource","no_op","set_recovery_policy"],"description":"Type of action to execute"},"parameters":{"oneOf":[{"$ref":"#/components/schemas/AdjustThresholdParams"},{"$ref":"#/components/schemas/ScaleParams"},{"$ref":"#/components/schemas/ResourceParams"},{"$ref":"#/components/schemas/SetRecoveryPolicyParams"}]}}},"AdjustThresholdParams":{"type":"object","description":"Parameters for adjust_threshold action","properties":{"cpuThreshold":{"type":"number","minimum":0,"maximum":100,"description":"CPU utilization threshold for scaling (%)","example":70},"throughputThreshold":{"type":"number","minimum":0,"maximum":100,"description":"Throughput utilization threshold (%)","example":75},"latencyThreshold":{"type":"number","minimum":0,"description":"Latency threshold for scaling (ms)","example":180}}},"ScaleParams":{"type":"object","required":["instanceCount"],"properties":{"instanceCount":{"type":"integer","minimum":1,"description":"Number of instances to add or remove","example":2}}},"ResourceParams":{"type":"object","properties":{"resource":{"$ref":"#/components/schemas/Resource"},"resourceId":{"type":"string","description":"For remove_resource action, ID of resource to remove"},"resourceType":{"type":"string","enum":["compute","database","storage","network","security"],"description":"For add_resource action, type of resource to add"},"provider":{"type":"string","enum":["aws","gcp","azure","oci","digitalocean"],"description":"For add_resource action, cloud provider for the new resource"},"targetSize":{"type":"string","description":"For add_resource action, size label for the new resource (e.g. \"t3.micro\" for aws)"},"regionKey":{"type":"string","description":"For add_resource action, region key to assign to the new resource (e.g. \"us-east-1\" for aws). Must be a valid region for the specified provider. Returns UNKNOWN_REGION if invalid."}}},"SetRecoveryPolicyParams":{"type":"object","description":"Parameters for set_recovery_policy action — override healing thresholds on a specific resource","required":["resourceId","recoveryPolicy"],"properties":{"resourceId":{"type":"string","description":"ID of the resource whose recovery policy should be updated","example":"res-abc123"},"recoveryPolicy":{"type":"object","description":"Recovery policy thresholds to apply","required":["criticalCpuThreshold","criticalSteps","warningCpuThreshold","warningSteps"],"properties":{"criticalCpuThreshold":{"type":"number","minimum":0,"maximum":100,"default":80,"description":"CPU % above which a resource is considered critical","example":90},"criticalSteps":{"type":"integer","minimum":1,"default":4,"description":"Steps the resource must stay at critical CPU before recovery triggers","example":2},"warningCpuThreshold":{"type":"number","minimum":0,"maximum":100,"default":70,"description":"CPU % above which a resource is considered in warning state","example":75},"warningSteps":{"type":"integer","minimum":1,"default":3,"description":"Steps the resource must stay at warning CPU before recovery triggers","example":2}}}}},"Observation":{"type":"object","required":["metrics","resources","traffic","currentTime"],"properties":{"metrics":{"type":"object","description":"Current system metrics","properties":{"cpuUsage":{"type":"number","description":"Average CPU utilization (%)","example":65.3},"latencyP50":{"type":"number","description":"P50 latency (ms)","example":45},"latencyP95":{"type":"number","description":"P95 latency (ms)","example":91},"errorRate":{"type":"number","description":"Error rate (%)","example":0.5},"throughput":{"type":"number","description":"Current throughput (requests/sec)","example":4500},"costPerHour":{"type":"number","description":"Current hourly cost (USD)","example":0.31},"connectionPressure":{"type":"number","description":"Database connection-pool pressure — the ratio of estimated active connections to total pool capacity (capped at 3.0). Only present when the simulation contains a database resource; absent otherwise.\n","example":0.42}}},"resources":{"type":"array","description":"Current resources in simulation","items":{"$ref":"#/components/schemas/Resource"}},"traffic":{"type":"number","description":"Current traffic load","example":5000},"currentTime":{"type":"integer","description":"Current simulation time (total elapsed simulated seconds)","example":2520},"sim_time_seconds":{"type":"integer","description":"Total elapsed simulated seconds since the episode started.\nEquals `currentTime` (which is now expressed in seconds).\nIncluded so agents can compute real-world time durations without\nre-reading the environment config.\n","example":2520},"tick_seconds":{"type":"integer","minimum":1,"maximum":3600,"description":"Number of simulated seconds this step advanced the clock.\nReflects the per-step `tick_seconds` override if one was supplied in\nthe step request body; otherwise matches the `tick_seconds` value set\nin `episodeConfig` at environment creation time.\n","example":60},"autoscalingConfig":{"type":"object","description":"Current autoscaling configuration","properties":{"scaleOutCpuThreshold":{"type":"number"},"scaleInCpuThreshold":{"type":"number"},"maxInstances":{"type":"number"},"minInstances":{"type":"number"}}},"scalingHistory":{"type":"array","description":"Recent scaling actions","items":{"type":"object","properties":{"timestamp":{"type":"integer"},"action":{"type":"string"},"reason":{"type":"string"}}}},"recentEvents":{"type":"array","description":"Recent system events","items":{"type":"object"}}}},"Reward":{"type":"object","required":["total","components"],"properties":{"total":{"type":"number","description":"Total reward for this step (weighted sum of components)","example":0.312},"components":{"type":"object","description":"Individual reward components before weighting","properties":{"performance":{"type":"number","description":"Performance score (0-1, based on latency and errors)","example":0.578},"cost":{"type":"number","description":"Cost efficiency score (0-1, based on budget)","example":1},"stability":{"type":"number","description":"Stability score (-1 to 1, penalizes excessive changes)","example":-0.2},"sla":{"type":"number","description":"SLA compliance score (-1 to 0, penalizes violations)","example":-0.5}}},"metrics":{"type":"object","description":"Raw metrics used for reward calculation","properties":{"avgLatency":{"type":"number"},"errorRate":{"type":"number"},"costPerHour":{"type":"number"},"slaViolations":{"type":"integer"}}}}},"RLObs":{"type":"object","description":"Agent-facing decision state returned by the step and observation endpoints. Contains only the signals an RL agent needs to choose its next action. Fractions (cpu_util, error_rate, uptime) are in the range [0, 1].\n","required":["rps","cpu_util","instances","traffic","currentTime"],"properties":{"rps":{"type":"number","description":"Throughput in requests per second","example":1620},"cpu_util":{"type":"number","description":"CPU utilization as a fraction (0–1)","example":0.582},"instances":{"type":"integer","description":"Total number of active compute instances","example":3},"traffic":{"type":"number","description":"Current traffic load (requests per second)","example":1620},"currentTime":{"type":"integer","description":"Current simulation time step","example":15},"tick_seconds":{"type":"integer","minimum":1,"maximum":3600,"description":"Number of simulated seconds this step advanced the clock.\nReflects the per-step `tick_seconds` override supplied in the step\nrequest body, or the episode-level `episodeConfig.tick_seconds` when\nno per-step override was given.\n","example":60}}},"RLMetrics":{"type":"object","description":"Evaluation-oriented metrics returned alongside obs by the step and observation endpoints. These are the outputs your reward function and monitoring dashboards should read. Fractions (error_rate, uptime) are in the range [0, 1]. The `connection_pressure` field is optional — it is only present when the simulation contains at least one database resource (e.g. RDS, Cloud SQL, Azure SQL, Autonomous DB, Managed PostgreSQL). Agents should guard against its absence when no database resource is configured.\n","required":["cost_usd_hr","latency_p95","error_rate","uptime","sla_violations"],"properties":{"cost_usd_hr":{"type":"number","description":"Current hourly cost in USD","example":1.08},"latency_p95":{"type":"number","description":"P95 latency in milliseconds","example":104},"error_rate":{"type":"number","description":"Error rate as a fraction (0–1)","example":0.003},"uptime":{"type":"number","description":"Uptime fraction (0–1); equals 1 − error_rate","example":0.997},"sla_violations":{"type":"integer","description":"Number of SLA dimension violations at this step (0 = fully compliant)","example":0},"connection_pressure":{"type":"number","description":"DB connection-pool pressure ratio (activeConnections / maxConnections), capped at 3.0. Only present when the simulation contains database resources. Values > 1.0 indicate pool exhaustion; values > 1.5 indicate severe saturation. Use this field to build reward functions that penalise connection-pool exhaustion independently of the aggregate error_rate signal.\n","example":1.25}}},"AutoscalingConfig":{"type":"object","properties":{"scaleOutCpuThreshold":{"type":"number"},"scaleInCpuThreshold":{"type":"number"},"scaleOutThroughputThreshold":{"type":"number"},"scaleInThroughputThreshold":{"type":"number"},"scaleOutLatencyThreshold":{"type":"number"},"cooldownSeconds":{"type":"number"},"minInstances":{"type":"number"},"maxInstances":{"type":"number"}}},"OptimizationGoals":{"type":"object","required":["primary"],"properties":{"primary":{"type":"string","enum":["minimize_cost","maximize_performance","balance"],"description":"Primary optimization objective","example":"minimize_cost"},"constraints":{"type":"object","properties":{"max_cost_per_hour":{"type":"number","description":"Maximum acceptable cost per hour (USD)","example":10},"min_throughput":{"type":"number","description":"Minimum required throughput (requests/second)","example":5000},"max_latency_p95":{"type":"number","description":"Maximum acceptable P95 latency (milliseconds)","example":200}}},"weights":{"type":"object","description":"Custom weights for multi-objective optimization","properties":{"cost":{"type":"number","example":0.4},"performance":{"type":"number","example":0.4},"stability":{"type":"number","example":0.2}}}}},"OptimizationRecommendation":{"type":"object","properties":{"rank":{"type":"integer","description":"Recommendation ranking (1 is best)","example":1},"name":{"type":"string","description":"Descriptive name","example":"Cost-Optimized Configuration"},"description":{"type":"string","example":"Reduces costs by 38% while maintaining performance"},"simulationSnapshot":{"type":"object","description":"Modified simulation configuration","properties":{"resources":{"type":"array","items":{"$ref":"#/components/schemas/Resource"}},"connections":{"type":"array","items":{"$ref":"#/components/schemas/Connection"}},"autoscalingConfig":{"$ref":"#/components/schemas/AutoscalingConfig"}}},"metrics":{"type":"object","properties":{"cost_per_hour":{"type":"number"},"latency_p95":{"type":"number"},"throughput":{"type":"number"},"error_rate":{"type":"number"}}},"improvements":{"type":"array","items":{"type":"string"},"example":["Reduced cost by 38%","Resolved CPU saturation"]},"changes":{"type":"array","items":{"type":"string"},"example":["Changed web-server-1 from m5.large to t3.medium"]},"score":{"type":"number","description":"Overall score based on goals","example":87.5},"costSavingsPercent":{"type":"number","description":"Cost savings vs baseline","example":38}}},"OptimizationJob":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"status":{"type":"string","enum":["pending","running","completed","failed"],"example":"completed"},"variantsGenerated":{"type":"integer","example":57},"variantsCompleted":{"type":"integer","example":57},"createdAt":{"type":"string","format":"date-time"},"completedAt":{"type":"string","format":"date-time"},"error":{"type":"string","description":"Error message if failed"},"webhookDeliveryStatus":{"type":"string","enum":["pending","delivered","failed"],"description":"Webhook delivery status","example":"delivered"},"webhookDeliveryAttempts":{"type":"integer","description":"Number of webhook delivery attempts made","example":1},"webhookDeliveryError":{"type":"string","description":"Error message if webhook delivery failed"},"webhookDeliveredAt":{"type":"string","format":"date-time","description":"Timestamp when webhook was successfully delivered"}}},"TrafficForecastPoint":{"type":"object","required":["timestamp","rps"],"properties":{"timestamp":{"type":"number","description":"Time offset in simulation steps","example":30},"rps":{"type":"number","description":"Requests per second at this timestamp","example":12000},"label":{"type":"string","description":"Optional label for this data point","example":"Peak - Noon"}}},"TrafficForecast":{"type":"object","required":["name","dataPoints"],"properties":{"name":{"type":"string","description":"Name of the traffic forecast","example":"Black Friday 2025"},"description":{"type":"string","description":"Optional description","example":"Predicted traffic spike for sale event"},"dataPoints":{"type":"array","description":"Traffic data points over time","items":{"$ref":"#/components/schemas/TrafficForecastPoint"}},"peakRPS":{"type":"number","description":"Peak requests per second (calculated)","example":12000},"avgRPS":{"type":"number","description":"Average requests per second (calculated)","example":5200}}},"ValidationResult":{"type":"object","properties":{"passed":{"type":"boolean","description":"Whether infrastructure can handle the forecast","example":false},"summary":{"type":"string","description":"Summary of validation results","example":"Infrastructure will fail under peak load due to CPU saturation"},"peakMetrics":{"type":"object","description":"Metrics at peak traffic","properties":{"timestamp":{"type":"number"},"traffic":{"type":"number"},"cpuUsage":{"type":"number"},"latencyP95":{"type":"number"},"errorRate":{"type":"number"},"costPerHour":{"type":"number"}}},"bottlenecksDetected":{"type":"array","description":"List of bottlenecks found","items":{"type":"string"},"example":["CPU saturation at 98%","Error rate exceeds 5%"]},"failurePoints":{"type":"array","description":"Time points where infrastructure fails","items":{"type":"object","properties":{"timestamp":{"type":"number"},"traffic":{"type":"number"},"reason":{"type":"string"}}}},"recommendations":{"type":"array","description":"Recommended fixes","items":{"type":"string"},"example":["Scale out to 5 instances before peak","Increase CPU threshold to 75%"]}}},"ThresholdTestResult":{"type":"object","properties":{"scaleOutCpuThreshold":{"type":"number","description":"CPU threshold for scaling out","example":70},"scaleInCpuThreshold":{"type":"number","description":"CPU threshold for scaling in","example":30},"scaleOutThroughputThreshold":{"type":"number","example":80},"scaleInThroughputThreshold":{"type":"number","example":40},"metrics":{"type":"object","description":"Performance metrics for this configuration","properties":{"cost_per_hour":{"type":"number","example":4.5},"latency_p95":{"type":"number","example":142},"error_rate":{"type":"number","example":0.8},"throughput":{"type":"number","example":11500},"scaling_events":{"type":"integer","example":12}}},"bottlenecks":{"type":"array","items":{"type":"string"},"example":["CPU spike during rapid scale-out"]},"score":{"type":"number","description":"Overall score (0-100)","example":87.5},"passed":{"type":"boolean","description":"Whether this configuration meets requirements","example":true}}},"PredictionRecommendation":{"type":"object","properties":{"rank":{"type":"integer","description":"Recommendation ranking (1 is best)","example":1},"title":{"type":"string","description":"Short title for recommendation","example":"Proactive scaling before peak"},"description":{"type":"string","example":"Scale out 2 hours before predicted peak to avoid saturation"},"priority":{"type":"string","enum":["critical","high","medium","low"],"example":"critical"},"action":{"type":"string","description":"Specific action to take","example":"Set CPU threshold to 65% and enable predictive scaling"},"expectedImpact":{"type":"string","example":"Prevents 98% CPU saturation, reduces error rate to <1%"},"autoscalingConfig":{"$ref":"#/components/schemas/AutoscalingConfig"},"resourceChanges":{"type":"array","description":"Specific resource modifications","items":{"type":"object","properties":{"resourceId":{"type":"string"},"change":{"type":"string"},"reason":{"type":"string"}}}}}},"PredictionJob":{"type":"object","properties":{"id":{"type":"string","format":"uuid"},"type":{"type":"string","enum":["validation","threshold_optimization"],"description":"Type of prediction job","example":"validation"},"status":{"type":"string","enum":["pending","running","completed","failed"],"example":"completed"},"baseSimulationId":{"type":"string","description":"ID of simulation being tested"},"trafficForecast":{"$ref":"#/components/schemas/TrafficForecast"},"validationResult":{"$ref":"#/components/schemas/ValidationResult"},"thresholdTests":{"type":"array","description":"Results from testing different thresholds","items":{"$ref":"#/components/schemas/ThresholdTestResult"}},"bestThresholds":{"$ref":"#/components/schemas/AutoscalingConfig"},"recommendations":{"type":"array","items":{"$ref":"#/components/schemas/PredictionRecommendation"}},"createdAt":{"type":"string","format":"date-time"},"completedAt":{"type":"string","format":"date-time"},"error":{"type":"string","description":"Error message if failed"},"webhookDeliveryStatus":{"type":"string","enum":["pending","delivered","failed"],"description":"Webhook delivery status","example":"delivered"},"webhookDeliveryAttempts":{"type":"integer","description":"Number of webhook delivery attempts made","example":1},"webhookDeliveryError":{"type":"string","description":"Error message if webhook delivery failed"},"webhookDeliveredAt":{"type":"string","format":"date-time","description":"Timestamp when webhook was successfully delivered"}}},"ChaosInjectionConfig":{"type":"object","required":["type","targetId","injectionTime"],"properties":{"type":{"type":"string","enum":["kill_instance","network_delay","database_slowdown","database_overload","cpu_spike","memory_pressure","kill_zone"],"description":"Type of failure to inject","example":"kill_instance"},"targetId":{"type":"string","description":"ID of the resource to target (or zone ID for kill_zone)","example":"web-1"},"injectionTime":{"type":"integer","description":"Simulation step when failure should be injected","example":50},"duration":{"type":"integer","description":"How long the failure lasts (in steps, optional)","example":20},"severity":{"type":"number","description":"Severity multiplier (0.0-1.0, optional)","example":0.8}}},"ResilienceScore":{"type":"object","properties":{"overall":{"type":"number","description":"Overall resilience score (0-100)","example":72.5},"grade":{"type":"string","enum":["A","B","C","D","F"],"description":"Letter grade for resilience","example":"C"},"metrics":{"$ref":"#/components/schemas/ResilienceMetrics"}}},"ResilienceMetrics":{"type":"object","properties":{"recoveryTimeSeconds":{"type":"number","description":"Time to recover from failures (in simulation seconds)","example":45.2},"availabilityPercent":{"type":"number","description":"Percentage of time system was available","example":94.3},"meanTimeToDetect":{"type":"number","description":"Average time to detect failures (seconds)","example":3.5},"meanTimeToRecover":{"type":"number","description":"Average time to recover from failures (seconds)","example":12.8},"errorRateDuringFailure":{"type":"number","description":"Error rate percentage during failures","example":15.7}}},"Vulnerability":{"type":"object","properties":{"id":{"type":"string","description":"Unique vulnerability identifier","example":"zone_dependency"},"severity":{"type":"string","enum":["critical","high","medium","low"],"description":"Severity level","example":"high"},"title":{"type":"string","description":"Short vulnerability title","example":"Single Availability Zone Dependency"},"description":{"type":"string","description":"Detailed description of the vulnerability","example":"All web servers are in the same availability zone. A zone failure causes complete service outage."},"impact":{"type":"string","description":"Impact on the system","example":"100% service downtime if us-east-1a fails"},"recommendation":{"type":"string","description":"How to fix the vulnerability","example":"Distribute web servers across at least 2 availability zones"},"detectedAt":{"type":"integer","description":"Simulation step when vulnerability was detected","example":52},"affectedResources":{"type":"array","description":"Resources affected by this vulnerability","items":{"type":"string"},"example":["web-1","web-2","web-3"]}}},"ChaosScenario":{"type":"object","properties":{"id":{"type":"string","description":"Unique scenario identifier","example":"zone_failure"},"name":{"type":"string","description":"Human-readable scenario name","example":"Availability Zone Failure"},"description":{"type":"string","description":"What this scenario tests","example":"Tests resilience to complete availability zone failure"},"injections":{"type":"array","description":"Failure injections in this scenario","items":{"$ref":"#/components/schemas/ChaosInjectionConfig"}},"expectedVulnerabilities":{"type":"array","description":"Vulnerabilities this scenario typically detects","items":{"type":"string"},"example":["zone_dependency","insufficient_capacity"]}}},"ChaosJob":{"type":"object","properties":{"id":{"type":"string","format":"uuid","description":"Unique job identifier"},"type":{"type":"string","enum":["chaos_test"],"description":"Job type"},"status":{"type":"string","enum":["pending","running","completed","failed","cancelled","partial_failed"],"description":"Current job status"},"simulationId":{"type":"string","format":"uuid","description":"Base simulation being tested"},"scenarioId":{"type":"string","description":"Scenario used (if applicable)"},"customInjections":{"type":"array","description":"Custom injections (if applicable)","items":{"$ref":"#/components/schemas/ChaosInjectionConfig"}},"duration":{"type":"integer","description":"Test duration in steps"},"createdAt":{"type":"string","format":"date-time"},"completedAt":{"type":"string","format":"date-time"},"webhookDeliveryStatus":{"type":"string","enum":["pending","delivered","failed"],"description":"Webhook delivery status","example":"delivered"},"webhookDeliveryAttempts":{"type":"integer","description":"Number of webhook delivery attempts made","example":1},"webhookDeliveryError":{"type":"string","description":"Error message if webhook delivery failed"},"webhookDeliveredAt":{"type":"string","format":"date-time","description":"Timestamp when webhook was successfully delivered"},"error":{"type":"string","description":"Error message if failed"}}},"BatchChaosRequest":{"type":"object","required":["simulationId","scenarios"],"properties":{"simulationId":{"type":"string","format":"uuid","description":"ID of the base simulation to test","example":"sim_abc123"},"scenarios":{"type":"array","description":"Array of chaos test scenarios to execute in parallel","minItems":1,"maxItems":10,"items":{"type":"object","properties":{"scenarioId":{"type":"string","description":"Pre-built scenario ID (optional, mutually exclusive with customInjections)","example":"zone_failure","enum":["zone_failure","database_crash","network_partition","cascading_failure","random_instance_failure","database_slowdown"]},"customInjections":{"type":"array","description":"Custom failure injections (optional, mutually exclusive with scenarioId)","items":{"$ref":"#/components/schemas/ChaosInjectionConfig"}},"duration":{"type":"integer","description":"Test duration in simulation steps","minimum":10,"maximum":300,"default":300,"example":120}}}},"webhookUrl":{"type":"string","format":"uri","description":"Optional HTTPS URL to receive webhook notification when batch completes","example":"https://example.com/webhook"},"webhookSecret":{"type":"string","description":"Optional secret for HMAC-SHA256 webhook signature verification","example":"secret123"}}},"BatchChaosJob":{"type":"object","properties":{"id":{"type":"string","format":"uuid","description":"Unique batch job identifier","example":"batch_xyz789"},"status":{"type":"string","enum":["pending","running","completed","failed","cancelled","partial_failed"],"description":"Current batch status. 'partial_failed' indicates some child jobs failed but others succeeded.\n","example":"completed"},"childJobIds":{"type":"array","description":"IDs of all child chaos jobs in this batch","items":{"type":"string","format":"uuid"},"example":["job_1","job_2","job_3"]},"totalJobs":{"type":"integer","description":"Total number of child jobs in this batch","example":3},"completedJobs":{"type":"integer","description":"Number of child jobs that completed successfully","example":2},"failedJobs":{"type":"integer","description":"Number of child jobs that failed","example":1},"cancelledJobs":{"type":"integer","description":"Number of child jobs that were cancelled","example":0},"aggregatedResilienceScore":{"allOf":[{"$ref":"#/components/schemas/ResilienceScore"}],"description":"Aggregated resilience score across all completed child jobs. Only present when status is \"completed\"."},"aggregatedVulnerabilities":{"type":"array","description":"Aggregated vulnerabilities from all child jobs (deduplicated with occurrence counts). Only present when status is \"completed\".","items":{"type":"object","properties":{"id":{"type":"string","description":"Vulnerability identifier","example":"zone_dependency"},"severity":{"type":"string","enum":["critical","high","medium","low"],"description":"Severity level","example":"high"},"title":{"type":"string","description":"Vulnerability title","example":"Single Availability Zone Dependency"},"description":{"type":"string","description":"Detailed description","example":"Resources are not distributed across availability zones"},"occurrences":{"type":"integer","description":"Number of child jobs where this vulnerability was detected","example":2}}}},"aggregatedRecommendations":{"type":"array","description":"Aggregated recommendations from all child jobs (deduplicated). Only present when status is \"completed\".","items":{"type":"string"},"example":["Distribute resources across multiple availability zones","Implement database connection pooling","Add circuit breakers for external dependencies"]},"childJobResults":{"type":"array","description":"Summary results for each child job (lightweight version for status endpoint)","items":{"type":"object","properties":{"jobId":{"type":"string","format":"uuid","description":"Child job ID","example":"job_1"},"status":{"type":"string","enum":["pending","running","completed","failed","cancelled"],"description":"Child job status","example":"completed"},"resilienceScore":{"allOf":[{"$ref":"#/components/schemas/ResilienceScore"}],"description":"Resilience score (only present if completed)"},"vulnerabilities":{"type":"array","description":"Vulnerabilities detected (only present if completed)","items":{"type":"string"},"example":["zone_dependency","insufficient_capacity"]},"error":{"type":"string","description":"Error message if failed","example":"Simulation failed to initialize"}}}},"webhookUrl":{"type":"string","format":"uri","description":"Webhook URL for batch completion notification"},"webhookDeliveryStatus":{"type":"string","enum":["pending","delivered","failed"],"description":"Webhook delivery status","example":"delivered"},"webhookDeliveryAttempts":{"type":"integer","description":"Number of webhook delivery attempts made","example":1},"webhookDeliveryError":{"type":"string","description":"Error message if webhook delivery failed"},"webhookDeliveredAt":{"type":"string","format":"date-time","description":"Timestamp when webhook was successfully delivered"},"createdAt":{"type":"string","format":"date-time","description":"When the batch was created","example":"2025-11-23T10:00:00Z"},"updatedAt":{"type":"string","format":"date-time","description":"When the batch was last updated","example":"2025-11-23T10:15:00Z"},"completedAt":{"type":"string","format":"date-time","description":"When the batch completed (or failed)","example":"2025-11-23T10:15:00Z"},"cancelledAt":{"type":"string","format":"date-time","description":"When the batch was cancelled"},"error":{"type":"string","description":"Error message if batch failed"}}},"WorkloadProfile":{"type":"object","required":["computeInstances","databaseInstances","storageGB","trafficRPS","latencyRequirementMs","primaryRegion"],"properties":{"computeInstances":{"type":"integer","minimum":1,"description":"Number of compute instances required","example":10},"databaseInstances":{"type":"integer","minimum":1,"description":"Number of database instances required","example":2},"storageGB":{"type":"integer","minimum":1,"description":"Storage capacity in gigabytes","example":500},"trafficRPS":{"type":"integer","minimum":1,"description":"Expected traffic in requests per second","example":5000},"latencyRequirementMs":{"type":"integer","minimum":1,"description":"Maximum acceptable latency in milliseconds","example":100},"primaryRegion":{"type":"string","description":"Primary deployment region","example":"us-east-1"},"secondaryRegions":{"type":"array","description":"Optional secondary regions for multi-region deployment","items":{"type":"string"},"example":["eu-west-1","ap-southeast-1"]},"requiresMultiRegion":{"type":"boolean","description":"Whether multi-region deployment is required","default":false},"dataResidencyRequirements":{"type":"array","description":"Data residency constraints (e.g., GDPR regions)","items":{"type":"string"},"example":["eu","us"]}}},"ProviderAllocation":{"type":"object","properties":{"provider":{"type":"string","enum":["aws","gcp","azure","oci","digitalocean"],"description":"Cloud provider name","example":"aws"},"computeInstances":{"type":"integer","description":"Number of compute instances on this provider","example":6},"databaseInstances":{"type":"integer","description":"Number of database instances on this provider","example":1},"storageGB":{"type":"integer","description":"Storage allocated to this provider (GB)","example":300},"trafficPercentage":{"type":"number","description":"Percentage of traffic routed to this provider","example":60},"regions":{"type":"array","description":"Regions used on this provider","items":{"type":"string"},"example":["us-east-1","us-west-2"]}}},"StrategyMetrics":{"type":"object","properties":{"totalCostPerHour":{"type":"number","description":"Total hourly cost across all providers","example":12.5},"avgLatencyMs":{"type":"number","description":"Average latency in milliseconds","example":45.2},"vendorLockInScore":{"type":"number","description":"Vendor lock-in score (0-100, higher = more lock-in)","example":35},"dataPortabilityScore":{"type":"number","description":"Data portability score (0-100, higher = more portable)","example":75},"geographicCoverage":{"type":"number","description":"Geographic coverage score (0-100)","example":85},"compositeScore":{"type":"number","description":"Overall weighted composite score (0-10)","example":8.5}}},"Strategy":{"type":"object","properties":{"id":{"type":"string","description":"Unique strategy identifier","example":"strategy-1"},"name":{"type":"string","description":"Strategy name","example":"AWS-Primary Multi-Region"},"description":{"type":"string","description":"Strategy description","example":"AWS-heavy deployment with GCP for data redundancy and reduced vendor lock-in"},"allocations":{"type":"array","description":"Provider allocations in this strategy","items":{"$ref":"#/components/schemas/ProviderAllocation"}},"metrics":{"$ref":"#/components/schemas/StrategyMetrics"},"tradeoffs":{"type":"array","description":"Key tradeoffs of this strategy","items":{"type":"string"},"example":["Higher cost for improved redundancy","Lower vendor lock-in at expense of complexity"]},"recommendations":{"type":"array","description":"Recommendations for this strategy","items":{"type":"string"},"example":["Best for workloads requiring high availability","Consider multi-region replication for databases"]},"suggestedResources":{"type":"array","description":"Canvas resource type IDs that are recommended for this strategy. Consumers can use these identifiers to pre-populate a canvas or present actionable \"Add to canvas\" shortcuts. Example values include `oci-waf`, `aws-waf`, `gcp-cloud-armor`, `azure-waf`.\n","items":{"type":"string"},"example":["oci-waf"]}}},"MultiCloudJob":{"type":"object","properties":{"id":{"type":"string","format":"uuid","description":"Unique job identifier","example":"job-abc123"},"workloadProfile":{"$ref":"#/components/schemas/WorkloadProfile"},"optimizationWeights":{"type":"object","description":"Optimization weights used","properties":{"cost":{"type":"number","example":0.4},"latency":{"type":"number","example":0.4},"vendorLockIn":{"type":"number","example":0.2}}},"status":{"type":"string","enum":["pending","running","completed","failed"],"description":"Current job status","example":"completed"},"progress":{"type":"number","description":"Progress percentage (0-100)","example":100},"strategiesGenerated":{"type":"integer","description":"Number of strategies generated","example":15},"topStrategies":{"type":"array","description":"Top-ranked strategies (available when completed)","items":{"$ref":"#/components/schemas/Strategy"}},"comparisonReport":{"type":"string","description":"Detailed comparison report (available when completed)","example":"Multi-Cloud Strategy Analysis Report..."},"createdAt":{"type":"string","format":"date-time"},"updatedAt":{"type":"string","format":"date-time"},"completedAt":{"type":"string","format":"date-time"},"error":{"type":"string","description":"Error message if failed"},"webhookDeliveryStatus":{"type":"string","enum":["pending","delivered","failed"],"description":"Webhook delivery status","example":"delivered"},"webhookDeliveryAttempts":{"type":"integer","description":"Number of webhook delivery attempts made","example":1},"webhookDeliveryError":{"type":"string","description":"Error message if webhook delivery failed"},"webhookDeliveredAt":{"type":"string","format":"date-time","description":"Timestamp when webhook was successfully delivered"}}},"OptimizationWebhookPayload":{"type":"object","description":"Webhook payload sent when an optimization job completes","properties":{"event":{"type":"string","enum":["optimization.completed","optimization.failed"],"description":"Event type","example":"optimization.completed"},"jobId":{"type":"string","format":"uuid","description":"Job identifier","example":"abc-123-def"},"status":{"type":"string","enum":["completed","failed"],"description":"Final job status","example":"completed"},"timestamp":{"type":"string","format":"date-time","description":"When the webhook was sent","example":"2025-11-23T10:30:00Z"},"data":{"type":"object","description":"Optimization job results","properties":{"variantsGenerated":{"type":"integer","example":57},"recommendations":{"type":"array","items":{"$ref":"#/components/schemas/OptimizationRecommendation"}},"error":{"type":"string","description":"Error message if status is failed"}}}}},"ChaosWebhookPayload":{"type":"object","description":"Webhook payload sent when a chaos engineering test completes","properties":{"event":{"type":"string","enum":["chaos.completed","chaos.failed"],"description":"Event type","example":"chaos.completed"},"jobId":{"type":"string","format":"uuid","description":"Job identifier"},"status":{"type":"string","enum":["completed","failed"],"description":"Final job status","example":"completed"},"timestamp":{"type":"string","format":"date-time","description":"When the webhook was sent"},"data":{"type":"object","description":"Chaos test results","properties":{"resilienceScore":{"$ref":"#/components/schemas/ResilienceScore"},"vulnerabilities":{"type":"array","items":{"$ref":"#/components/schemas/Vulnerability"}},"error":{"type":"string","description":"Error message if status is failed"}}}}},"PredictionWebhookPayload":{"type":"object","description":"Webhook payload sent when a prediction job (validation or threshold optimization) completes. When `status` is `\"failed\"`, inspect `data.error` to determine the recovery action. See the **Handling Failures** section in the API description for a full classification of retryable vs non-retryable error conditions and per-error recovery steps.\n","properties":{"event":{"type":"string","enum":["prediction.validation.completed","prediction.validation.failed","prediction.threshold_optimization.completed","prediction.threshold_optimization.failed"],"description":"Event type","example":"prediction.validation.completed"},"jobId":{"type":"string","format":"uuid","description":"Job identifier"},"status":{"type":"string","enum":["completed","failed"],"description":"Final job status","example":"completed"},"timestamp":{"type":"string","format":"date-time","description":"When the webhook was sent"},"data":{"type":"object","description":"Prediction job results","properties":{"type":{"type":"string","enum":["validation","threshold_optimization"],"description":"Type of prediction job"},"validationResult":{"$ref":"#/components/schemas/ValidationResult"},"bestThresholds":{"$ref":"#/components/schemas/AutoscalingConfig"},"recommendations":{"type":"array","items":{"$ref":"#/components/schemas/PredictionRecommendation"}},"error":{"type":"string","description":"Error message if status is failed"}}}},"examples":{"awsThresholdOptimizationWebhook":{"summary":"AWS — EC2 Auto Scaling threshold optimization completed (Black Friday)","value":{"event":"prediction.threshold_optimization.completed","jobId":"a1b2c3d4-e5f6-7890-abcd-ef1234567890","status":"completed","timestamp":"2025-11-23T10:35:00Z","data":{"type":"threshold_optimization","bestThresholds":{"scaleOutCpuThreshold":70,"scaleInCpuThreshold":30,"scaleOutThroughputThreshold":75,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":120,"cooldownSeconds":180,"minInstances":3,"maxInstances":15},"recommendations":[{"rank":1,"title":"Lower CPU scale-out threshold to 70%","description":"Triggering scale-out at 70% CPU instead of 80% gives 60–90 seconds of lead time before saturation under the Black Friday ramp","priority":"high","action":"Set scaleOutCpuThreshold to 70 in the EC2 Auto Scaling policy","expectedImpact":"Reduces peak CPU from 81% to ~68%, drops error rate from 0.4% to <0.1%"},{"rank":2,"title":"Keep scale-in threshold at 30% to avoid flapping","description":"A conservative scale-in threshold prevents the Auto Scaling group from terminating instances too quickly after the Black Friday peak, avoiding a secondary spike during wind-down","priority":"medium","action":"Set scaleInCpuThreshold to 30 in the EC2 Auto Scaling policy","expectedImpact":"Eliminates post-peak scale-in flap; saves one unnecessary scale-out cycle during wind-down"}]}}},"gcpThresholdOptimizationWebhook":{"summary":"GCP — Cloud Run threshold optimization completed (Holiday seasonal burst)","value":{"event":"prediction.threshold_optimization.completed","jobId":"b2c3d4e5-f6a7-8901-bcde-f12345678901","status":"completed","timestamp":"2025-12-15T08:20:00Z","data":{"type":"threshold_optimization","bestThresholds":{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"scaleOutThroughputThreshold":65,"scaleInThroughputThreshold":30,"scaleOutLatencyThreshold":150,"cooldownSeconds":60,"minInstances":3,"maxInstances":25},"recommendations":[{"rank":1,"title":"Set CPU scale-out threshold to 60% to prevent concurrency saturation","description":"Cloud Run saturates when per-instance concurrency fills before CPU-based scale-out fires. Triggering at 60% CPU ensures new instances are warm before the holiday ramp overwhelms the active pool","priority":"critical","action":"Configure Cloud Run --cpu-throttling and set Knative autoscaling target annotation to 60","expectedImpact":"Peak error rate drops from 12.1% to <0.5%; p95 latency drops from 310 ms to ~58 ms"},{"rank":2,"title":"Set minimum instances to 3 to eliminate cold-start lag","description":"Keeping 3 instances warm prevents the 45-second scale-out delay at ramp start that causes early-stage errors before the optimizer thresholds can take effect","priority":"high","action":"Set --min-instances=3 on the Cloud Run service revision","expectedImpact":"Removes cold-start lag; threshold optimizer can respond within 5 s instead of 45 s"}]}}},"azureThresholdOptimizationWebhook":{"summary":"Azure — AKS HPA threshold optimization completed (product launch spike)","value":{"event":"prediction.threshold_optimization.completed","jobId":"c3d4e5f6-a7b8-9012-cdef-123456789012","status":"completed","timestamp":"2025-09-10T14:05:00Z","data":{"type":"threshold_optimization","bestThresholds":{"scaleOutCpuThreshold":60,"scaleInCpuThreshold":25,"scaleOutThroughputThreshold":70,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":80,"cooldownSeconds":120,"minInstances":3,"maxInstances":12},"recommendations":[{"rank":1,"title":"Lower HPA CPU target to 60%","description":"Scaling out at 60% CPU instead of 75% gives the AKS node pool an extra 45 seconds of lead time for the product launch spike, eliminating the brief 110 ms latency overshoot at step 21","priority":"high","action":"Update HorizontalPodAutoscaler targetCPUUtilizationPercentage to 60","expectedImpact":"Eliminates p95 latency spike at launch; steady-state p95 drops from 62 ms to 48 ms"},{"rank":2,"title":"Set cooldown to 120 s to prevent HPA thrashing during sustained load","description":"The product launch pattern holds elevated traffic for 50 steps — a 120-second cooldown prevents the HPA from scale-in oscillations during the sustained period","priority":"medium","action":"Set HorizontalPodAutoscaler spec.behavior.scaleDown.stabilizationWindowSeconds to 120","expectedImpact":"Eliminates 3 unnecessary scale-in/scale-out cycles during the sustained traffic window"}]}}},"ociThresholdOptimizationWebhook":{"summary":"OCI — VM.Standard.E4.Flex threshold optimization completed (month-end batch)","value":{"event":"prediction.threshold_optimization.completed","jobId":"d4e5f6a7-b8c9-0123-defa-234567890123","status":"completed","timestamp":"2025-10-25T02:00:00Z","data":{"type":"threshold_optimization","bestThresholds":{"scaleOutCpuThreshold":65,"scaleInCpuThreshold":30,"scaleOutThroughputThreshold":70,"scaleInThroughputThreshold":35,"scaleOutLatencyThreshold":200,"cooldownSeconds":300,"minInstances":2,"maxInstances":8},"recommendations":[{"rank":1,"title":"Set scale-out CPU threshold to 65% for the batch window","description":"Month-end batch load ramps gradually over 35 steps — triggering at 65% CPU provides a 2-instance buffer before peak query load hits, keeping ATP connection pool below 60%","priority":"medium","action":"Update OCI Autoscaling policy CPU threshold to 65% for the VM.Standard.E4.Flex instance pool","expectedImpact":"Peak CPU drops from 72% to ~63%; ATP connection pool pressure drops from 74% to ~58%"},{"rank":2,"title":"Use a 300-second cooldown to prevent premature scale-in mid-batch","description":"Month-end batch jobs run for 55 steps — a short cooldown causes the autoscaler to prematurely scale in between reporting sub-jobs, then immediately scale out again","priority":"low","action":"Set OCI Autoscaling policy cooldown period to 300 seconds","expectedImpact":"Eliminates 2 mid-batch scale-in/out cycles; reduces ATP reconnect overhead"}]}}},"digitalOceanThresholdOptimizationWebhook":{"summary":"DigitalOcean — Droplet autoscaling threshold optimization completed (viral traffic spike)","value":{"event":"prediction.threshold_optimization.completed","jobId":"e5f6a7b8-c9d0-1234-efab-345678901234","status":"completed","timestamp":"2025-08-03T19:45:00Z","data":{"type":"threshold_optimization","bestThresholds":{"scaleOutCpuThreshold":55,"scaleInCpuThreshold":20,"scaleOutThroughputThreshold":60,"scaleInThroughputThreshold":25,"scaleOutLatencyThreshold":100,"cooldownSeconds":90,"minInstances":6,"maxInstances":20},"recommendations":[{"rank":1,"title":"Lower scale-out CPU threshold to 55% to react before viral saturation","description":"The viral spike reaches full intensity within 15 steps — the default 75% threshold fires too late for DigitalOcean App Platform to provision new Droplets in time. 55% gives a 10-step head start","priority":"critical","action":"Update DigitalOcean App Platform autoscaling CPU threshold to 55%","expectedImpact":"Peak CPU drops from 98% to ~58%; error rate drops from 18% to <0.5%"},{"rank":2,"title":"Set minimum Droplet pool to 6 as a prerequisite for the threshold to take effect","description":"Even the optimized 55% threshold cannot compensate if the starting pool is too small — the 15-step viral ramp outpaces Droplet provisioning speed from a pool of 3","priority":"critical","action":"Set DigitalOcean App Platform min_instance_count to 6","expectedImpact":"Ensures the threshold optimizer has sufficient baseline capacity; CPU peak drops to ~52% when combined with the 55% scale-out trigger"}]}}},"awsThresholdOptimizationFailedInvalidSimulation":{"summary":"AWS — threshold optimization failed (invalid simulation ID)","value":{"event":"prediction.threshold_optimization.failed","jobId":"f6a7b8c9-d0e1-2345-fabc-456789012345","status":"failed","timestamp":"2025-11-23T10:36:42Z","data":{"type":"threshold_optimization","error":"Simulation 'sim_nonexistent_abc123' not found. Verify the simulationId references an existing simulation before submitting a threshold optimization job."}}},"awsThresholdOptimizationFailedInsufficientTraffic":{"summary":"AWS — EC2 threshold optimization failed (insufficient traffic data)","value":{"event":"prediction.threshold_optimization.failed","jobId":"a7b8c9d0-e1f2-3456-abcd-567890123456","status":"failed","timestamp":"2025-11-24T14:22:10Z","data":{"type":"threshold_optimization","error":"No valid threshold combinations found for the provided traffic forecast on AWS EC2 Auto Scaling. The forecast contains fewer than 3 distinct traffic levels, which is insufficient to evaluate scale-out and scale-in thresholds independently. Supply a traffic pattern with at least 3 discrete load steps (e.g. low, medium, high) and resubmit."}}},"gcpValidationFailedNoTrafficForecast":{"summary":"GCP — Cloud Run validation failed (no traffic forecast data)","value":{"event":"prediction.validation.failed","jobId":"b8c9d0e1-f2a3-4567-bcde-678901234567","status":"failed","timestamp":"2025-12-15T09:12:33Z","data":{"type":"validation","error":"Validation job for GCP Cloud Run scenario 'sim_gcp_prod_cr_88f2' failed: no traffic forecast data found. A traffic pattern must be associated with the simulation before validation can run. Attach a trafficPatternId to the simulation and resubmit."}}},"awsValidationFailedSimulationNotFound":{"summary":"AWS — EC2 validation failed (simulation not found)","value":{"event":"prediction.validation.failed","jobId":"c9d0e1f2-a3b4-5678-cdef-789012345678","status":"failed","timestamp":"2025-11-25T16:04:51Z","data":{"type":"validation","error":"Simulation 'sim_nonexistent_xyz999' not found. Verify the simulationId references an existing simulation before submitting a validation job."}}}}},"MultiCloudWebhookPayload":{"type":"object","description":"Webhook payload sent when a multi-cloud exploration job completes","properties":{"event":{"type":"string","enum":["multicloud.completed","multicloud.failed"],"description":"Event type","example":"multicloud.completed"},"jobId":{"type":"string","format":"uuid","description":"Job identifier"},"status":{"type":"string","enum":["completed","failed"],"description":"Final job status","example":"completed"},"timestamp":{"type":"string","format":"date-time","description":"When the webhook was sent"},"data":{"type":"object","description":"Multi-cloud exploration results","properties":{"strategiesGenerated":{"type":"integer","example":15},"topStrategies":{"type":"array","items":{"$ref":"#/components/schemas/Strategy"}},"comparisonReport":{"type":"string","example":"Multi-Cloud Strategy Analysis Report..."},"error":{"type":"string","description":"Error message if status is failed"}}}}},"RLEpisodeWebhookPayload":{"type":"object","description":"Webhook payload sent when an RL training episode completes","properties":{"event":{"type":"string","enum":["rl_episode.completed"],"description":"Event type","example":"rl_episode.completed"},"environmentId":{"type":"string","format":"uuid","description":"RL environment identifier"},"simulationId":{"type":"string","format":"uuid","description":"Simulation identifier"},"timestamp":{"type":"string","format":"date-time","description":"When the webhook was sent"},"data":{"type":"object","description":"Episode completion data","properties":{"totalSteps":{"type":"integer","description":"Total steps in the episode","example":300},"totalReward":{"type":"number","description":"Cumulative reward achieved","example":145.67},"sim_time_human":{"type":"string","description":"Human-readable simulated elapsed time at episode completion (e.g. \"1h 0m\")","example":"1h 0m"},"episodeConfig":{"$ref":"#/components/schemas/EpisodeConfig"},"finalMetrics":{"type":"object","description":"Final system metrics","properties":{"avgCost":{"type":"number"},"avgLatency":{"type":"number"},"avgErrorRate":{"type":"number"}}}}}}},"Error":{"type":"object","required":["error"],"properties":{"error":{"type":"string","description":"Error message","example":"Simulation not found"},"details":{"type":"array","description":"Detailed validation errors (for 400 responses)","items":{"type":"object"}}}}},"responses":{"BadRequest":{"description":"Invalid request data — structured field-level error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ValidationErrorResponse"}}}},"Unauthorized":{"description":"Authentication required or invalid API key","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"NotFound":{"description":"Resource not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"InternalError":{"description":"Internal server error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}},"TooManyRequests":{"description":"Rate limit exceeded — slow down requests or upgrade your API key's rate limit","headers":{"Retry-After":{"description":"Seconds until the rate-limit window resets","schema":{"type":"integer"}}},"content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}},"securitySchemes":{"BearerAuth":{"type":"http","scheme":"bearer","bearerFormat":"API Key","description":"API key authentication using Bearer tokens in the Authorization header.\n\n**How to authenticate:**\n1. Create an API key using POST /api/keys\n2. Include the key in the Authorization header: `Bearer cwm_live_<your_key>`\n3. All RL environment endpoints require authentication\n\n**Rate Limits:**\n- Default: 1000 requests/hour per key\n- Configurable per key\n- Returns 429 when exceeded\n"}}},"security":[]}