Cloud World Model - RL Training API
 1.0.0 
OAS 3.1

/api-docs/swagger.json

RESTful API for training reinforcement learning agents on cloud infrastructure autoscaling.

This API enables external AI agents to learn optimal autoscaling policies through trial and error. Agents can create training environments, execute actions, receive rewards, and observe system state.

Authentication:

  • Simulation endpoints (/simulations, /events, /traffic-patterns, /failure-injections): Public, no authentication required
  • RL Environment endpoints (/rl/environments/*): Require API key authentication via Bearer token
  • API Key Management (/keys): Public for demonstration purposes (should be secured in production)

Use Cases:

  • Train agents for cost-efficient autoscaling
  • Test scaling policies before production deployment
  • Optimize multi-cloud resource allocation
  • Simulate months of production traffic in minutes

Episode Lifecycle:

  1. Create a simulation with cloud resources (no auth required)
  2. Generate an API key for RL training (no auth required for demo)
  3. Create an RL environment linked to the simulation (requires API key)
  4. Training loop: observe → select action → step → receive reward (requires API key)
  5. Reset episode when done or max steps reached (requires API key)

Simulation Lifecycle

The simulation API provides a complete lifecycle for creating, evolving, and analyzing virtual cloud environments without touching real infrastructure. Use this flow for load testing, architecture validation, chaos experiments, and generating realistic training data for RL agents.

The snippets below use shell variables — set them once and every subsequent command works end-to-end. Step 1 accepts an optional API key (omit for a guest demo simulation; include one to link it to your account and unlock unlimited steps). Step 2 requires auth for owned simulations; demo (keyless) simulations can step without a key, up to 20 steps. Steps 3–5 and 7–8 require an API key; Step 6 (bottleneck analysis) accepts an optional key. Step 9 (RL training) is an optional advanced branch that forks off after Step 5 — run it before or instead of cleanup.

Response samples below are illustrative abbreviations; see each endpoint's schema in this spec for the full payload shape. The shell variable SIM_ID below holds the simulation UUID returned by Step 1; it maps to the {simulationId} path parameter in all subsequent API calls.

export BASE_URL="https://your-app.replit.app"

# One-time bootstrap: mint your first admin key.
# Requires BOOTSTRAP_SECRET to be set as an environment variable on the server.
# Only succeeds when no admin key exists yet — subsequent calls return 409.
# If bootstrap is already consumed, use an existing key or an admin-issued
# registration token (POST /api/keys/register with a token from POST /api/register-tokens).
export API_KEY=$(curl -s -X POST "$BASE_URL/api/keys/bootstrap-admin" \
  -H "Content-Type: application/json" \
  -d '{ "bootstrapSecret": "your-bootstrap-secret" }' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['key'])")

# Canvas Cloud AI users: exchange your CCA token instead:
#   export API_KEY=$(curl -s -X POST "$BASE_URL/api/keys/register" \
#     -H "Content-Type: application/json" \
#     -d '{ "token": "cca_live_..." }' \
#     | python3 -c "import sys,json; print(json.load(sys.stdin)['key'])")

Step 1 — Create a simulation (POST /api/simulations)

Provision a named virtual environment with cloud resources (compute, database, network, storage) and receive a simulationId. Include the resources array in the request body to configure provider-specific settings (instance type, region, autoscaling bounds). Mix AWS, GCP, Azure, OCI, and DigitalOcean resources in a single simulation to model multi-cloud topologies. No authentication required; pass an API key to link the simulation to your account.

# Authenticated (owned) simulation — unlimited steps:
SIM_ID=$(curl -s -X POST "$BASE_URL/api/simulations" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-first-sim",
    "resources": [
      {
        "id": "web-1",
        "name": "Web Server",
        "type": "compute",
        "provider": "aws",
        "characteristics": {
          "instanceType": "t3.medium",
          "region": "us-east-1",
          "minInstances": 1,
          "maxInstances": 5
        }
      }
    ]
  }' | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")
echo "Simulation ID: $SIM_ID"
# Response: { "id": "sim_abc123", "name": "my-first-sim", "resources": [...], ... }

# Guest/demo mode (omit Authorization header) — no key needed, capped at 20 steps:
# SIM_ID=$(curl -s -X POST "$BASE_URL/api/simulations" \
#   -H "Content-Type: application/json" \
#   -d '{ "name": "demo-sim", "resources": [...] }' \
#   | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

Step 2 — Advance the simulation (POST /api/simulations/{simulationId}/step)

Drive the simulation forward one timestep. The hybrid prediction engine applies registered traffic patterns, evaluates autoscaling rules, calculates CPU utilization / error-rate / throughput metrics, and returns updated resource states. Call this in a loop to model minutes, hours, or months of production traffic in seconds. No request body is needed. Auth is required for owned simulations; demo (keyless) simulations can step without a key (limited to 20 steps).

curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/step" \
  -H "Authorization: Bearer $API_KEY"
# Response: { "simulation": { ... }, "metrics": { "cpuUsage": 42.1, "latencyP95": 180,
#             "errorRate": 0.002, "throughput": 298 }, "events": [] }

Step 3 — Inspect metrics and events

  • GET /api/simulations/{simulationId}/metrics — time-series performance metrics (CPU utilization, error rate, throughput, latency) indexed by simulation step.
  • GET /api/simulations/{simulationId}/events — the event log: scale-out decisions, failure triggers, cost spikes, and autoscaling threshold crossings.

Use these to validate that the simulation is behaving as expected before running expensive analysis jobs.

# Fetch time-series metrics
curl -s "$BASE_URL/api/simulations/$SIM_ID/metrics" \
  -H "Authorization: Bearer $API_KEY"
# Response: [{ "timestamp": 1, "cpuUsage": 42.1, "latencyP95": 180, "errorRate": 0.002,
#              "throughput": 298 }, ...]

# Fetch the event log
curl -s "$BASE_URL/api/simulations/$SIM_ID/events" \
  -H "Authorization: Bearer $API_KEY"
# Response: [{ "id": "evt_1", "type": "scale_out", "message": "Scaled out to 2 instances",
#              "severity": "info", "timestamp": "2026-05-01T12:00:00Z" }, ...]

Step 4 — Add and activate traffic patterns

  • POST /api/simulations/{simulationId}/patterns — register a named traffic pattern (ramp, burst, step, wave, or custom) that is applied on every subsequent step call. Multiple patterns compose automatically.
  • POST /api/simulations/{simulationId}/inject-traffic — inject an immediate random traffic spike into the simulation, independent of registered patterns.
# Register a ramp-up traffic pattern (applied on every /step call)
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/patterns" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gradual-ramp",
    "type": "ramp",
    "startTime": 0,
    "parameters": { "startTraffic": 100, "endTraffic": 900, "duration": 20 }
  }'
# Response: { "id": "pat_xyz", "name": "gradual-ramp", "type": "ramp", ... }

# Inject an immediate one-off traffic spike
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/inject-traffic" \
  -H "Authorization: Bearer $API_KEY"
# Response: { "simulation": { ... }, "event": { "message": "Traffic spike injected", ... } }

Step 5 — Add and trigger failure injections

  • POST /api/simulations/{simulationId}/failures — register a failure scenario (database crash, zone outage, network partition, CPU stress) against the simulation.
  • POST /api/simulations/{simulationId}/inject-failure — trigger the failure injection.

Pair with GET /api/simulations/{simulationId}/events to observe how the simulation detects, reacts to, and recovers from each failure.

# Register a zone-outage failure scenario
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/failures" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "us-east-1a outage",
    "type": "az_outage",
    "targetResourceId": "web-1",
    "severity": "severe",
    "startTime": 0,
    "parameters": { "errorRateIncrease": 0.4 }
  }'
# Response: { "id": "fail_abc", "name": "us-east-1a outage", "type": "az_outage", ... }

# Trigger a node failure on a random healthy instance
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/inject-failure" \
  -H "Authorization: Bearer $API_KEY"
# Response: { "simulation": { ... }, "event": { "message": "Node failure injected", ... } }

Step 6 — Analyze bottlenecks (POST /api/simulations/{simulationId}/analyze-bottlenecks)

Run an AI-backed bottleneck analysis over the current simulation state. The engine identifies saturated resources, latency hotspots, and single points of failure, and returns natural-language recommendations. Pass beginnerMode: true to receive simplified explanations suitable for developers who are new to cloud architecture. Authentication is optional on this endpoint (the API key is accepted but not required).

curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/analyze-bottlenecks" \
  -H "Content-Type: application/json" \
  -d '{ "beginnerMode": false }'
# Response: { "analysis": "web-1 is running at 91% CPU. Consider scaling out to 3
#             instances or upgrading to c5.large before traffic doubles.",
#             "doRecommendation": "Add a second c5.large instance to distribute load." }

Step 7 — Optimize the architecture (POST /api/analysis/optimize)

Submit an asynchronous optimization job. The engine evaluates cost, performance, and reliability trade-offs and returns ranked recommendations with projected savings and risk scores. Provide a webhookUrl to receive the result asynchronously instead of polling GET /api/analysis/jobs/{jobId}.

JOB_ID=$(curl -s -X POST "$BASE_URL/api/analysis/optimize" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "simulationId": "'"$SIM_ID"'",
    "goals": {
      "primary": "balance",
      "weights": { "cost": 0.4, "performance": 0.4, "stability": 0.2 }
    },
    "testScenario": {
      "traffic_pattern": "spike",
      "duration_steps": 10,
      "include_failures": false
    }
  }' | python3 -c "import sys,json; print(json.load(sys.stdin)['job']['id'])")
echo "Optimization job: $JOB_ID"
# Poll for status:
curl -s "$BASE_URL/api/analysis/jobs/$JOB_ID" \
  -H "Authorization: Bearer $API_KEY"
# Response: { "id": "opt_xyz789", "status": "completed", "variantsGenerated": 47,
#             "variantsCompleted": 47 }
# Fetch recommendations once status === "completed":
curl -s "$BASE_URL/api/analysis/jobs/$JOB_ID/recommendations" \
  -H "Authorization: Bearer $API_KEY"
# Response: { "recommendations": [{ "rank": 1, "name": "Serverless First",
#             "costSavingsPercent": 18, "score": 0.87 }], "totalVariants": 47 }

Step 8 — Clean up (DELETE /api/simulations/{simulationId})

Delete the simulation and all its associated resources when the experiment is complete.

curl -s -o /dev/null -w "%{http_code}" -X DELETE "$BASE_URL/api/simulations/$SIM_ID" \
  -H "Authorization: Bearer $API_KEY"
# Response: HTTP 204 No Content (empty body)

Step 9 — RL Training (optional branch — fork here after Step 5, before or instead of Step 8)

Once the simulation is populated and behaving realistically (steps 1–5 above), you can attach an RL environment to it and start training your agent. Run this before Step 8 (cleanup) or skip it entirely if you only need the analysis features. Requires an API key (see preamble).

  1. POST /api/rl/environments — create an RL environment linked to the simulation
  2. POST /api/rl/environments/{environmentId}/step — execute actions and observe rewards in a loop
  3. POST /api/rl/environments/{environmentId}/reset — begin a new episode when the current one ends

This lets you pre-warm a simulation with a realistic traffic baseline before starting RL training, so your agent begins from a meaningful initial state rather than an empty environment.

# Prerequisites: API_KEY and SIM_ID are already set (see preamble + Step 1 above).

# 1. Create an RL environment linked to the simulation
ENV_ID=$(curl -s -X POST "$BASE_URL/api/rl/environments" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "simulationId": "'"$SIM_ID"'",
    "episodeConfig": {
      "maxSteps": 100,
      "initialTraffic": 1000,
      "targetSLA": { "maxLatencyP95": 200, "maxErrorRate": 1 },
      "enableFailures": false
    }
  }' | python3 -c "import sys,json; print(json.load(sys.stdin)['environment']['id'])")
echo "RL Environment: $ENV_ID"
# Response: { "environment": { "id": "env_abc", "simulationId": "...", ... },
#             "observation": { "metrics": {...}, "resources": [...], "traffic": 1000 } }

# 2. Training loop — execute an action and receive next obs + reward
curl -s -X POST "$BASE_URL/api/rl/environments/$ENV_ID/step" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "action": { "type": "scale_out", "parameters": {} } }'
# Response: { "t": 1, "obs": { "rps": 1000, "cpu_util": 0.45, "instances": 2, ... },
#             "metrics": { "cost_usd_hr": 0.38, "latency_p95": 112, "error_rate": 0.003, ... },
#             "reward": 0.72, "reward_components": { "performance": 0.8, ... },
#             "done": false, "info": {} }

# 3. Reset to start a new episode when done=true
curl -s -X POST "$BASE_URL/api/rl/environments/$ENV_ID/reset" \
  -H "Authorization: Bearer $API_KEY"
# Response: { "environment": { "currentStep": 0, ... }, "observation": { "metrics": {...} } }

Python training-loop example

A self-contained Python script that wires all three steps above into a runnable multi-episode training loop is available at examples/rl_training_loop.py. It uses only the Python standard library (no third-party packages) and demonstrates how to read obs["cpu_util"], metrics["latency_p95"], reward, and done from each step response, reset between episodes, and print per-episode reward totals:

# Run against the local dev server (--token must have admin scope):
python examples/rl_training_loop.py --token $ADMIN_KEY

# Already have a write-scoped key? Skip key minting with --skip-mint:
python examples/rl_training_loop.py --token $API_KEY --skip-mint

# Run against a deployed instance with custom episode count:
python examples/rl_training_loop.py \
  --base-url https://your-deployment.replit.app \
  --token $ADMIN_KEY \
  --episodes 5 \
  --steps 50

JavaScript/Node.js training-loop example

A parallel Node.js script is available at examples/rl_training_loop.js for JS-first developers. It mirrors the Python script exactly — same episode flow, same action set, same printed output — and uses only Node.js built-ins (node:https / node:http), so no npm install is required:

# Run against the local dev server (--token must have admin scope):
node examples/rl_training_loop.js --token $ADMIN_KEY

# Already have a write-scoped key? Skip key minting with --skip-mint:
node examples/rl_training_loop.js --token $API_KEY --skip-mint

# Run against a deployed instance with custom episode count:
node examples/rl_training_loop.js \
  --base-url https://your-deployment.replit.app \
  --token $ADMIN_KEY \
  --episodes 5 \
  --steps 50

Fetch API training-loop example (Node.js 18+ / Deno / Bun)

A self-contained counterpart to the Node.js script above is available at examples/rl_training_loop_fetch.mjs. It mirrors the Node.js script exactly — same episode flow, same action set, same printed output — but uses the standard Fetch API throughout instead of node:https built-ins, so it runs unchanged on Node.js 18+, Deno, and Bun with no npm install required:

# Node.js 18+ (run against the local dev server):
node examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY

# Already have a write-scoped key? Skip key minting with --skip-mint:
node examples/rl_training_loop_fetch.mjs --token $API_KEY --skip-mint

# Deno (requires --allow-net):
deno run --allow-net examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY

# Bun:
bun examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY

# Run against a deployed instance with custom episode count:
node examples/rl_training_loop_fetch.mjs \
  --base-url https://your-deployment.replit.app \
  --token $ADMIN_KEY \
  --episodes 5 \
  --steps 50

If you only need a minimal inline snippet (e.g. to embed in a browser script or REPL), here is a compact fetch-based step loop:

// fetch-based RL step loop — works in browser, Deno, Bun, Node.js 18+
// (wrapped in an async IIFE so no top-level await is required)
const BASE_URL = "https://your-deployment.replit.app"; // or http://localhost:5000
const API_KEY  = "your-write-scoped-api-key";
const ENV_ID   = "your-environment-id"; // from POST /api/rl/environments

async function rlStep(action) {
  const res = await fetch(`${BASE_URL}/api/rl/environments/${ENV_ID}/step`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": `Bearer ${API_KEY}`,
    },
    body: JSON.stringify({ action }),
  });
  if (!res.ok) throw new Error(`Step failed: ${res.status} ${await res.text()}`);
  return res.json();
}

(async () => {
  for (let i = 0; i < 20; i++) {
    const { observation: obs, metrics, reward, done } = await rlStep("scale_up");
    console.log(
      `step ${i + 1} | cpu=${(obs.cpu_util * 100).toFixed(1)}%` +
      ` | p95=${metrics.latency_p95}ms | reward=${reward.toFixed(3)}`
    );
    if (done) { console.log("Episode finished."); break; }
  }
})();

TypeScript training-loop example

A typed counterpart is available at examples/rl_training_loop.ts for TypeScript projects. It imports the generated SDK types from sdk/typescript/src/openapi-types.ts so every request body and step response is fully typed — enabling autocomplete and compile-time safety. Run it with npx tsx (no separate compile step needed):

# Run against the local dev server (--token must have admin scope):
npx tsx examples/rl_training_loop.ts --token $ADMIN_KEY

# Already have a write-scoped key? Skip key minting with --skip-mint:
npx tsx examples/rl_training_loop.ts --token $API_KEY --skip-mint

# Run against a deployed instance with custom episode count:
npx tsx examples/rl_training_loop.ts \
  --base-url https://your-deployment.replit.app \
  --token $ADMIN_KEY \
  --episodes 5 \
  --steps 50

Webhook Notifications

The Cloud World Model API supports webhook notifications for asynchronous job completion events. Instead of polling job status endpoints, you can provide a webhook URL when creating jobs, and the API will send an HTTP POST request to your endpoint when the job completes.

Supported Jobs:

  • Infrastructure Optimization jobs (POST /api/analysis/optimize)
  • Chaos Engineering tests (POST /api/chaos/run)
  • Batch Chaos Engineering tests (POST /api/chaos/batch)
  • Predictive Scaling validation (POST /api/predictions/validate)
  • Predictive Scaling threshold optimization (POST /api/predictions/optimize-thresholds)
  • Multi-Cloud Strategy exploration (POST /api/multi-cloud/explore)
  • RL Environment episode completion (POST /api/rl/environments)

How to Use Webhooks:

When creating a job, include two optional fields in your request:

  • webhookUrl (string): HTTPS URL where the webhook should be delivered
  • webhookSecret (string): Secret used to sign the webhook payload (for verification)

Webhook Delivery Mechanism:

  • Asynchronous: Webhooks are sent asynchronously when the job completes (status: completed or failed)
  • Fire-and-forget: The API does not wait for your webhook endpoint to respond before marking the job complete
  • Retry Logic: Up to 3 delivery attempts with exponential backoff (0s, 2s, 8s)
  • Timeout: Each delivery attempt has a 10-second timeout
  • HTTPS Only: Webhook URLs must use HTTPS (HTTP URLs are rejected for security)
  • SSRF Protection: Private IP addresses and localhost are blocked to prevent server-side request forgery

Webhook Payload:

The webhook payload is a JSON object containing:

  • event: Event type (e.g., "optimization.completed", "chaos.completed", "rl_episode.completed")
  • jobId: Unique identifier for the job
  • status: Final status ("completed" or "failed")
  • data: Job-specific result data (structure varies by job type)
  • timestamp: ISO 8601 timestamp when the webhook was sent

Security - Signature Verification:

All webhooks include an X-Webhook-Signature header containing an HMAC-SHA256 signature. You should verify this signature to ensure the webhook came from the Cloud World Model API:

  1. Extract the raw request body as bytes
  2. Compute HMAC-SHA256 using your webhookSecret as the key and the raw body as the message
  3. Compare the computed signature with the X-Webhook-Signature header value
  4. Use constant-time comparison to prevent timing attacks

Example verification (Python):

import hmac
import hashlib

def verify_webhook_signature(payload: bytes, signature: str, secret: str) -> bool:
    expected_signature = hmac.new(
        secret.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(signature, expected_signature)

Webhook Delivery Status:

All job response objects include webhook delivery tracking fields:

  • webhookDeliveryStatus: "pending", "delivered", or "failed"
  • webhookDeliveryAttempts: Number of delivery attempts made
  • webhookDeliveryError: Error message if delivery failed (e.g., timeout, connection refused)
  • webhookDeliveredAt: ISO 8601 timestamp of successful delivery

Best Practices:

  • Use a unique webhookSecret for each job or use a rotating secret system
  • Always verify the webhook signature before processing the payload
  • Return a 2xx status code from your webhook endpoint to acknowledge receipt
  • Process webhooks asynchronously to avoid blocking the delivery request
  • Store webhook payloads for debugging and audit trails
  • Implement idempotency using the jobId (webhooks may be delivered multiple times)

Handling Failures

When a webhook arrives with "status": "failed", the data.error field contains a human-readable message that tells you exactly why the job could not complete. Agent recovery logic should inspect this message and classify the failure before deciding whether to retry the same request, fix the input and resubmit, or escalate.

Two Categories of Job Failure

CategoryWhen to useAgent action
Invalid inputThe error message describes a specific problem with the request parameters (missing resource, malformed data, etc.)Fix the input; do not retry the same request
Transient / engine errorThe error message mentions an internal error, unexpected computation result, or does not identify a user-correctable input problemWait, then retry the same request up to 3 times with exponential backoff (2 s, 8 s, 30 s)

A useful rule of thumb: if the error message ends with "resubmit" after listing corrective steps, it is an invalid-input failure. If the message does not provide corrective steps or mentions internal state, treat it as transient and retry.

Error Types and Recovery Steps

1. Simulation not found

Example error text: "Simulation 'sim_abc123' not found…"

  • Category: Invalid input (non-retryable as-is)
  • Cause: The simulationId supplied when creating the job references a simulation that no longer exists (deleted between job submission and execution) or was never created.
  • Recovery:
    1. Call POST /api/simulations to create a new simulation with the same resource configuration.
    2. Re-submit the job using the new simulationId.
  • Do not retry the original job request; it will fail again with the same error.

2. Simulation contains no resources

Example error text: "Simulation '…' contains no resources and cannot be validated…"

  • Category: Invalid input (non-retryable as-is)
  • Cause: The referenced simulation exists but has no compute or database resources attached.
  • Recovery:
    1. Add at least one compute resource and one database resource to the simulation via POST /api/simulations/{simulationId}/resources.
    2. Re-submit the job.

3. simulationId belongs to a different API key scope

Example error text: "simulationId '…' does not exist or belongs to a different API key scope…"

  • Category: Invalid input / authorization (non-retryable as-is)
  • Cause: The API key used to submit the job does not have read access to the simulation referenced by simulationId.
  • Recovery:
    1. Verify that you are using the correct API key for the target simulation.
    2. If multiple keys are in use, ensure the key used to create the simulation is the same key (or a key with the same scope) used to submit the job.
    3. Re-submit with the correct key.

4. Traffic forecast malformed

Example error text: "Traffic forecast '…' is malformed: timestamps are not strictly increasing…"

  • Category: Invalid input (non-retryable as-is)
  • Cause: The traffic forecast data provided in the request is structurally invalid (e.g., non-monotonic timestamps, missing fields, duplicate step numbers).
  • Recovery:
    1. Inspect the forecast array and sort steps so timestamps are strictly increasing.
    2. Remove any duplicate step entries.
    3. Re-submit the job with the corrected forecast.

5. Traffic forecast has insufficient data points

Example error text: "…contains only N data points spanning M simulation steps…requires at least 5 data points covering a minimum of 60 steps…"

  • Category: Invalid input (non-retryable as-is)
  • Cause: The traffic forecast is too short for the engine to evaluate scale-out and scale-in thresholds across a complete ramp-and-drain traffic cycle.
  • Recovery:
    1. Extend the forecast to at least 5 distinct load-level steps covering a minimum of 60 simulation steps.
    2. Make sure the forecast includes a clear ramp-up phase, a sustained peak, and a ramp-down (drain) phase.
    3. Re-submit the job.

6. No valid threshold combination found

Example error text: "No valid threshold combination found…all N candidate combinations…produced peak error rates above the SLA limit…"

  • Category: Infrastructure constraint (non-retryable without parameter changes)
  • Cause: Every threshold combination the optimizer tested exceeded the SLA error-rate limit for the given traffic pattern. This means the current infrastructure configuration (instance sizes, instance counts, or both) cannot handle the forecast load regardless of autoscaling thresholds.
  • Recovery (choose one or more):
    • Increase maxInstances in the simulation's autoscaling config so the optimizer has more headroom to test higher-scale configurations.
    • Raise the minimum instance count (minInstances) so the pool can absorb the initial traffic burst before autoscaling adds capacity.
    • Upgrade the node or instance SKU to a larger size in the simulation resource definition.
    • If the spike is extremely sudden (viral traffic), increase minInstances first since autoscaling provisioning time may exceed the ramp duration.
    • After making any of the above changes, re-submit the optimization job.
  • Do not retry without changing the infrastructure parameters; the optimizer will produce the same result.

7. Validation engine internal error

Example error text: "Validation engine encountered an internal error…capacity model returned a negative throughput value at step N…"

  • Category: Likely invalid input (resource misconfiguration), occasionally transient
  • Cause: The simulation engine encountered an inconsistency it could not recover from. This is usually caused by a resource configuration that produces a logically impossible state (e.g., zero or negative instance counts, throughput capacity below zero).
  • Recovery:
    1. Check that all resources in the simulation have positive, non-zero values for instance counts, vCPU allocations, and memory.
    2. Verify that replica counts and node pool sizes are set correctly.
    3. Re-submit the job.
    4. If the error persists after verifying the configuration, treat it as transient and retry up to 3 times total with exponential backoff (2 s, 8 s, 30 s).
    5. If all retries fail, escalate by recording the jobId and full error payload for support investigation.

Webhook Delivery Failures vs. Job Failures

Job failures (described above) are different from webhook delivery failures. The platform retries webhook delivery up to 3 times with exponential backoff (0 s, 2 s, 8 s). If all delivery attempts fail, the job response object reflects this:

  • webhookDeliveryStatus: "failed"
  • webhookDeliveryAttempts: 3
  • webhookDeliveryError: description of the network error (e.g., "connection refused", "timeout")

In this case the job itself may have completed successfully; only the notification failed to reach your endpoint. Agent recovery steps:

  1. Poll the job status endpoint (e.g., GET /api/predictions/optimize-thresholds/{jobId}) to retrieve the final result directly.
  2. Inspect status in the polled response:
    • "completed" → process the result as you would a successful webhook payload.
    • "failed" → apply the job-failure recovery steps above.
  3. Fix your webhook endpoint (connectivity, TLS certificate, response code) so future deliveries succeed.

The jobId included in every webhook payload is stable and idempotent — you can safely poll the same jobId multiple times without triggering side effects.

Retryable vs. Non-Retryable Quick Reference

Error pattern in data.errorCategoryRetry same request?
"…not found…" (simulation or resource)Invalid inputNo — fix simulationId first
"…no resources…"Invalid inputNo — add resources first
"…different API key scope…"AuthorizationNo — fix key/scope first
"…malformed…" or "…not strictly increasing…"Invalid inputNo — fix forecast first
"…insufficient data points…" or "…too short…"Invalid inputNo — extend forecast first
"No valid threshold combination found…"Infrastructure constraintNo — change infra params first
"…internal error…" or unexpected computation messageTransientYes — retry up to 3× with exponential backoff (2 s, 8 s, 30 s)
Any other unrecognized errorUnknownRetry once; escalate if it recurs
Servers

RL Environments

Manage reinforcement learning training environments.

This walkthrough shows the complete episode lifecycle for a DigitalOcean simulation: create → reset → step → observe. DigitalOcean simulations use Droplet-based compute and Managed Database resources. The hybrid prediction engine automatically models provider-specific behaviour — Droplet cold-start overhead (~30 s provisioning latency on first request after a scale-out) and shared-tenant network jitter — so your agent learns realistic scaling dynamics without spending real cloud budget.

Step 1 — Create a DigitalOcean simulation (no auth required)

POST /simulations with resources that include at least one Droplet (provider: "digitalocean", type: "compute") and, optionally, a Managed PostgreSQL node and a Load Balancer. Record the returned id — this is your simulationId.

Step 2 — Mint an API key (no auth required for demo)

POST /keys → copy the key field from the response.

Step 3 — Create the RL environment (Bearer auth required)

POST /rl/environments
Authorization: Bearer <your-key>
Content-Type: application/json

{
  "simulationId": "<id-from-step-1>",
  "episodeConfig": {
    "maxSteps": 200,
    "targetTrafficPattern": "wave",
    "initialTraffic": 1500,
    "targetSLA": { "maxLatencyP95": 180, "maxErrorRate": 1.0 },
    "costBudgetPerHour": 3.50
  }
}

The response contains the environment id and the initial observation — your agent's first view of the Droplet cluster state.

Step 4 — Training loop (Bearer auth required)

repeat until done == true:
  POST /rl/environments/{environmentId}/step
  { "action": { "type": "adjust_threshold",
                "parameters": { "cpuThreshold": 65, "throughputThreshold": 70 } } }
  ← { t, obs, metrics, reward, reward_components, done, info }

Step 5 — Reset for the next episode

POST /rl/environments/{environmentId}/reset

DigitalOcean-specific notes:

  • costPerHour in observations reflects Droplet + Managed Database pricing from the nyc3 region benchmark data.
  • action.type: "scale_out" provisions a new Droplet replica; the first observation after scaling models the ~30 s cold-start latency overhead automatically.
  • action.type: "adjust_threshold" tunes CPU/throughput triggers on the DigitalOcean autoscaling profile (default: CPU-weighted scoring, 180 s cooldown).
  • Droplet s-2vcpu-4gb (maxThroughput: 1800 req/s) is the recommended starting instance type for moderate workloads; upgrade to c-4 CPU-Optimized when your agent consistently saturates CPU.

This walkthrough shows the complete episode lifecycle for a GCP simulation: create → reset → step → observe. GCP simulations use GCE compute instances and Cloud SQL for managed databases in the us-central1 region. The hybrid prediction engine models GCP-specific behaviour including managed instance group warm-up latency (~45 s for the first scale-out) and Cloud SQL connection pooling characteristics.

Step 1 — Create a GCP simulation (no auth required)

POST /simulations with resources using provider: "gcp". Include at least one GCE instance (serviceFamily: "gce") and optionally a Cloud SQL node and Cloud Load Balancing frontend. Record the returned id — this is your simulationId.

Step 2 — Mint an API key (no auth required for demo)

POST /keys → copy the key field from the response.

Step 3 — Create the RL environment (Bearer auth required)

POST /rl/environments
Authorization: Bearer <your-key>
Content-Type: application/json

{
  "simulationId": "<id-from-step-1>",
  "episodeConfig": {
    "maxSteps": 150,
    "targetTrafficPattern": "ramp",
    "initialTraffic": 4000,
    "targetSLA": { "maxLatencyP95": 180, "maxErrorRate": 1.0 },
    "costBudgetPerHour": 6.0
  }
}

Step 4 — Training loop (Bearer auth required)

repeat until done == true:
  POST /rl/environments/{environmentId}/step
  { "action": { "type": "adjust_threshold",
                "parameters": { "cpuThreshold": 68, "throughputThreshold": 72 } } }
  ← { t, obs, metrics, reward, reward_components, done, info }

Step 5 — Reset for the next episode

POST /rl/environments/{environmentId}/reset

GCP-specific notes:

  • costPerHour in observations reflects GCE e2-standard-4 + Cloud SQL db-standard-4 pricing from the us-central1 region benchmark.
  • action.type: "scale_out" provisions a new GCE instance; managed instance group warm-up adds ~45 s latency overhead to the first observation after scaling.
  • action.type: "adjust_threshold" tunes the GCP autoscaling profile (default: CPU-weighted scoring, 120 s cooldown on Compute Engine autoscaler).
  • Upgrade GCE instances from e2-standard-4 to n2-standard-8 in the resource definition when your agent consistently saturates CPU.

This walkthrough shows the complete episode lifecycle for an Azure simulation: create → reset → step → observe. Azure simulations use Azure VM compute and Azure SQL Database in the East US region. The hybrid prediction engine models Azure-specific behaviour including VM Scale Set provisioning latency (~60 s for the first scale-out) and Azure SQL DTU burst characteristics.

Step 1 — Create an Azure simulation (no auth required)

POST /simulations with resources using provider: "azure". Include at least one Azure VM (serviceFamily: "azure_vm", size: "Standard_D4s_v3") and optionally an Azure SQL Database node and Azure Load Balancer. Record the returned id.

Step 2 — Mint an API key (no auth required for demo)

POST /keys → copy the key field from the response.

Step 3 — Create the RL environment (Bearer auth required)

POST /rl/environments
Authorization: Bearer <your-key>
Content-Type: application/json

{
  "simulationId": "<id-from-step-1>",
  "episodeConfig": {
    "maxSteps": 150,
    "targetTrafficPattern": "wave",
    "initialTraffic": 4500,
    "targetSLA": { "maxLatencyP95": 200, "maxErrorRate": 1.0 },
    "costBudgetPerHour": 7.0
  }
}

Step 4 — Training loop (Bearer auth required)

repeat until done == true:
  POST /rl/environments/{environmentId}/step
  { "action": { "type": "adjust_threshold",
                "parameters": { "cpuThreshold": 70, "throughputThreshold": 75 } } }
  ← { t, obs, metrics, reward, reward_components, done, info }

Step 5 — Reset for the next episode

POST /rl/environments/{environmentId}/reset

Azure-specific notes:

  • costPerHour in observations reflects Azure VM Standard_D4s_v3 + Azure SQL General Purpose 4 vCores pricing from the East US region benchmark.
  • action.type: "scale_out" provisions a new Standard_D4s_v3 VM via VM Scale Set; the first observation after scaling models the ~60 s warm-up latency.
  • action.type: "adjust_threshold" tunes the Azure autoscaling profile (default: CPU-weighted scoring, 300 s cooldown on Azure Monitor autoscale).
  • Consider Standard_F8s_v2 (compute-optimized) when your agent consistently saturates CPU on Standard_D4s_v3.

This walkthrough shows the complete episode lifecycle for an OCI simulation: create → reset → step → observe. OCI simulations use OCI Compute (VM.Standard3.Flex) and Autonomous Database in the us-ashburn-1 region. The hybrid prediction engine models OCI-specific behaviour including Flex OCPU scaling dynamics and Autonomous Database auto-scaling characteristics.

Step 1 — Create an OCI simulation (no auth required)

POST /simulations with resources using provider: "oci". Include at least one OCI VM (serviceFamily: "oci_vm", size: "VM.Standard3.Flex") and optionally an Autonomous Database node and OCI Load Balancer. Record the returned id.

Step 2 — Mint an API key (no auth required for demo)

POST /keys → copy the key field from the response.

Step 3 — Create the RL environment (Bearer auth required)

POST /rl/environments
Authorization: Bearer <your-key>
Content-Type: application/json

{
  "simulationId": "<id-from-step-1>",
  "episodeConfig": {
    "maxSteps": 150,
    "targetTrafficPattern": "burst",
    "initialTraffic": 5000,
    "targetSLA": { "maxLatencyP95": 160, "maxErrorRate": 0.5 },
    "costBudgetPerHour": 4.0
  }
}

Step 4 — Training loop (Bearer auth required)

repeat until done == true:
  POST /rl/environments/{environmentId}/step
  { "action": { "type": "adjust_threshold",
                "parameters": { "cpuThreshold": 65, "throughputThreshold": 70 } } }
  ← { t, obs, metrics, reward, reward_components, done, info }

Step 5 — Reset for the next episode

POST /rl/environments/{environmentId}/reset

OCI-specific notes:

  • costPerHour in observations reflects OCI VM.Standard3.Flex + Autonomous Database pricing from the us-ashburn-1 region benchmark.
  • OCI VM.Standard3.Flex uses flexible OCPU/memory allocation; the simulation models 4 OCPUs / 64 GB RAM per instance by default.
  • Autonomous Database auto-scales OCPU and storage independently, so database cost varies with query load rather than staying fixed.
  • OCI typically offers the lowest per-OCPU compute cost among the five providers, making it attractive for cost-optimization agents.

Simulations

Create and manage cloud infrastructure simulations

Scenarios

Browse pre-built infrastructure scenario templates

Pricing History

Cloud provider pricing history and trends

API Keys

Manage API keys for authentication

Infrastructure Optimization

Automated infrastructure analysis and optimization

Predictive Scaling

Test infrastructure against traffic forecasts and optimize autoscaling thresholds

Chaos Engineering

Test infrastructure resilience by injecting failures and analyzing recovery

Multi-Cloud Strategy

Explore and compare multi-cloud deployment strategies for optimal cost, performance, and vendor independence

Discovery

Machine-readable description of the simulator for AI agents and onboarding tooling

API Key Self-Service

Bootstrap and self-service path for obtaining API keys without direct database access.

  1. Platform operator runs the bootstrap script once:

    npx tsx scripts/bootstrap-admin-key.ts
    

    This prints a one-time admin key. Store it as a secret immediately.

  2. Admin mints a scoped, time-limited registration token for the external client:

    POST /register-tokens
    Authorization: Bearer <admin-key>
    Content-Type: application/json
    { "name": "canvas-cloud-ai", "scopes": ["read","write"], "expiresAt": "2026-06-01T00:00:00Z" }
    
  3. External client exchanges the token once (manually or via a one-off script):

    POST /keys/register
    Content-Type: application/json
    { "token": "<registration-token>", "name": "canvas-cloud-ai-prod" }
    

    The response contains a permanent API key. Store it as an environment secret.

  4. The token is burned on use and can never be reused. All subsequent API calls use the permanent key directly.

Simulation Fidelity

string
object
object | (object | object)
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
object
Online validator badge
<% customCssUrl %>