1.0.0
OAS 3.1
RESTful API for training reinforcement learning agents on cloud infrastructure autoscaling.
This API enables external AI agents to learn optimal autoscaling policies through trial and error. Agents can create training environments, execute actions, receive rewards, and observe system state.
Authentication:
/simulations, /events, /traffic-patterns, /failure-injections): Public, no authentication required/rl/environments/*): Require API key authentication via Bearer token/keys): Public for demonstration purposes (should be secured in production)Use Cases:
Episode Lifecycle:
The simulation API provides a complete lifecycle for creating, evolving, and analyzing virtual cloud environments without touching real infrastructure. Use this flow for load testing, architecture validation, chaos experiments, and generating realistic training data for RL agents.
The snippets below use shell variables — set them once and every subsequent command works end-to-end. Step 1 accepts an optional API key (omit for a guest demo simulation; include one to link it to your account and unlock unlimited steps). Step 2 requires auth for owned simulations; demo (keyless) simulations can step without a key, up to 20 steps. Steps 3–5 and 7–8 require an API key; Step 6 (bottleneck analysis) accepts an optional key. Step 9 (RL training) is an optional advanced branch that forks off after Step 5 — run it before or instead of cleanup.
Response samples below are illustrative abbreviations; see each endpoint's schema in this
spec for the full payload shape. The shell variable SIM_ID below holds the simulation
UUID returned by Step 1; it maps to the {simulationId} path parameter in all subsequent
API calls.
export BASE_URL="https://your-app.replit.app"
# One-time bootstrap: mint your first admin key.
# Requires BOOTSTRAP_SECRET to be set as an environment variable on the server.
# Only succeeds when no admin key exists yet — subsequent calls return 409.
# If bootstrap is already consumed, use an existing key or an admin-issued
# registration token (POST /api/keys/register with a token from POST /api/register-tokens).
export API_KEY=$(curl -s -X POST "$BASE_URL/api/keys/bootstrap-admin" \
-H "Content-Type: application/json" \
-d '{ "bootstrapSecret": "your-bootstrap-secret" }' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['key'])")
# Canvas Cloud AI users: exchange your CCA token instead:
# export API_KEY=$(curl -s -X POST "$BASE_URL/api/keys/register" \
# -H "Content-Type: application/json" \
# -d '{ "token": "cca_live_..." }' \
# | python3 -c "import sys,json; print(json.load(sys.stdin)['key'])")
Step 1 — Create a simulation (POST /api/simulations)
Provision a named virtual environment with cloud resources (compute, database, network, storage)
and receive a simulationId. Include the resources array in the request body to configure
provider-specific settings (instance type, region, autoscaling bounds). Mix AWS, GCP, Azure,
OCI, and DigitalOcean resources in a single simulation to model multi-cloud topologies.
No authentication required; pass an API key to link the simulation to your account.
# Authenticated (owned) simulation — unlimited steps:
SIM_ID=$(curl -s -X POST "$BASE_URL/api/simulations" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "my-first-sim",
"resources": [
{
"id": "web-1",
"name": "Web Server",
"type": "compute",
"provider": "aws",
"characteristics": {
"instanceType": "t3.medium",
"region": "us-east-1",
"minInstances": 1,
"maxInstances": 5
}
}
]
}' | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")
echo "Simulation ID: $SIM_ID"
# Response: { "id": "sim_abc123", "name": "my-first-sim", "resources": [...], ... }
# Guest/demo mode (omit Authorization header) — no key needed, capped at 20 steps:
# SIM_ID=$(curl -s -X POST "$BASE_URL/api/simulations" \
# -H "Content-Type: application/json" \
# -d '{ "name": "demo-sim", "resources": [...] }' \
# | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")
Step 2 — Advance the simulation (POST /api/simulations/{simulationId}/step)
Drive the simulation forward one timestep. The hybrid prediction engine applies registered traffic patterns, evaluates autoscaling rules, calculates CPU utilization / error-rate / throughput metrics, and returns updated resource states. Call this in a loop to model minutes, hours, or months of production traffic in seconds. No request body is needed. Auth is required for owned simulations; demo (keyless) simulations can step without a key (limited to 20 steps).
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/step" \
-H "Authorization: Bearer $API_KEY"
# Response: { "simulation": { ... }, "metrics": { "cpuUsage": 42.1, "latencyP95": 180,
# "errorRate": 0.002, "throughput": 298 }, "events": [] }
Step 3 — Inspect metrics and events
GET /api/simulations/{simulationId}/metrics — time-series performance metrics (CPU
utilization, error rate, throughput, latency) indexed by simulation step.GET /api/simulations/{simulationId}/events — the event log: scale-out decisions, failure
triggers, cost spikes, and autoscaling threshold crossings.Use these to validate that the simulation is behaving as expected before running expensive analysis jobs.
# Fetch time-series metrics
curl -s "$BASE_URL/api/simulations/$SIM_ID/metrics" \
-H "Authorization: Bearer $API_KEY"
# Response: [{ "timestamp": 1, "cpuUsage": 42.1, "latencyP95": 180, "errorRate": 0.002,
# "throughput": 298 }, ...]
# Fetch the event log
curl -s "$BASE_URL/api/simulations/$SIM_ID/events" \
-H "Authorization: Bearer $API_KEY"
# Response: [{ "id": "evt_1", "type": "scale_out", "message": "Scaled out to 2 instances",
# "severity": "info", "timestamp": "2026-05-01T12:00:00Z" }, ...]
Step 4 — Add and activate traffic patterns
POST /api/simulations/{simulationId}/patterns — register a named traffic pattern (ramp,
burst, step, wave, or custom) that is applied on every subsequent step call. Multiple
patterns compose automatically.POST /api/simulations/{simulationId}/inject-traffic — inject an immediate random traffic
spike into the simulation, independent of registered patterns.# Register a ramp-up traffic pattern (applied on every /step call)
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/patterns" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "gradual-ramp",
"type": "ramp",
"startTime": 0,
"parameters": { "startTraffic": 100, "endTraffic": 900, "duration": 20 }
}'
# Response: { "id": "pat_xyz", "name": "gradual-ramp", "type": "ramp", ... }
# Inject an immediate one-off traffic spike
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/inject-traffic" \
-H "Authorization: Bearer $API_KEY"
# Response: { "simulation": { ... }, "event": { "message": "Traffic spike injected", ... } }
Step 5 — Add and trigger failure injections
POST /api/simulations/{simulationId}/failures — register a failure scenario (database crash,
zone outage, network partition, CPU stress) against the simulation.POST /api/simulations/{simulationId}/inject-failure — trigger the failure injection.Pair with GET /api/simulations/{simulationId}/events to observe how the simulation detects,
reacts to, and recovers from each failure.
# Register a zone-outage failure scenario
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/failures" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "us-east-1a outage",
"type": "az_outage",
"targetResourceId": "web-1",
"severity": "severe",
"startTime": 0,
"parameters": { "errorRateIncrease": 0.4 }
}'
# Response: { "id": "fail_abc", "name": "us-east-1a outage", "type": "az_outage", ... }
# Trigger a node failure on a random healthy instance
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/inject-failure" \
-H "Authorization: Bearer $API_KEY"
# Response: { "simulation": { ... }, "event": { "message": "Node failure injected", ... } }
Step 6 — Analyze bottlenecks (POST /api/simulations/{simulationId}/analyze-bottlenecks)
Run an AI-backed bottleneck analysis over the current simulation state. The engine identifies
saturated resources, latency hotspots, and single points of failure, and returns natural-language
recommendations. Pass beginnerMode: true to receive simplified explanations suitable for
developers who are new to cloud architecture. Authentication is optional on this endpoint
(the API key is accepted but not required).
curl -s -X POST "$BASE_URL/api/simulations/$SIM_ID/analyze-bottlenecks" \
-H "Content-Type: application/json" \
-d '{ "beginnerMode": false }'
# Response: { "analysis": "web-1 is running at 91% CPU. Consider scaling out to 3
# instances or upgrading to c5.large before traffic doubles.",
# "doRecommendation": "Add a second c5.large instance to distribute load." }
Step 7 — Optimize the architecture (POST /api/analysis/optimize)
Submit an asynchronous optimization job. The engine evaluates cost, performance, and reliability
trade-offs and returns ranked recommendations with projected savings and risk scores. Provide a
webhookUrl to receive the result asynchronously instead of polling
GET /api/analysis/jobs/{jobId}.
JOB_ID=$(curl -s -X POST "$BASE_URL/api/analysis/optimize" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"simulationId": "'"$SIM_ID"'",
"goals": {
"primary": "balance",
"weights": { "cost": 0.4, "performance": 0.4, "stability": 0.2 }
},
"testScenario": {
"traffic_pattern": "spike",
"duration_steps": 10,
"include_failures": false
}
}' | python3 -c "import sys,json; print(json.load(sys.stdin)['job']['id'])")
echo "Optimization job: $JOB_ID"
# Poll for status:
curl -s "$BASE_URL/api/analysis/jobs/$JOB_ID" \
-H "Authorization: Bearer $API_KEY"
# Response: { "id": "opt_xyz789", "status": "completed", "variantsGenerated": 47,
# "variantsCompleted": 47 }
# Fetch recommendations once status === "completed":
curl -s "$BASE_URL/api/analysis/jobs/$JOB_ID/recommendations" \
-H "Authorization: Bearer $API_KEY"
# Response: { "recommendations": [{ "rank": 1, "name": "Serverless First",
# "costSavingsPercent": 18, "score": 0.87 }], "totalVariants": 47 }
Step 8 — Clean up (DELETE /api/simulations/{simulationId})
Delete the simulation and all its associated resources when the experiment is complete.
curl -s -o /dev/null -w "%{http_code}" -X DELETE "$BASE_URL/api/simulations/$SIM_ID" \
-H "Authorization: Bearer $API_KEY"
# Response: HTTP 204 No Content (empty body)
Step 9 — RL Training (optional branch — fork here after Step 5, before or instead of Step 8)
Once the simulation is populated and behaving realistically (steps 1–5 above), you can attach an RL environment to it and start training your agent. Run this before Step 8 (cleanup) or skip it entirely if you only need the analysis features. Requires an API key (see preamble).
POST /api/rl/environments — create an RL environment linked to the simulationPOST /api/rl/environments/{environmentId}/step — execute actions and observe rewards in a loopPOST /api/rl/environments/{environmentId}/reset — begin a new episode when the current one endsThis lets you pre-warm a simulation with a realistic traffic baseline before starting RL training, so your agent begins from a meaningful initial state rather than an empty environment.
# Prerequisites: API_KEY and SIM_ID are already set (see preamble + Step 1 above).
# 1. Create an RL environment linked to the simulation
ENV_ID=$(curl -s -X POST "$BASE_URL/api/rl/environments" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"simulationId": "'"$SIM_ID"'",
"episodeConfig": {
"maxSteps": 100,
"initialTraffic": 1000,
"targetSLA": { "maxLatencyP95": 200, "maxErrorRate": 1 },
"enableFailures": false
}
}' | python3 -c "import sys,json; print(json.load(sys.stdin)['environment']['id'])")
echo "RL Environment: $ENV_ID"
# Response: { "environment": { "id": "env_abc", "simulationId": "...", ... },
# "observation": { "metrics": {...}, "resources": [...], "traffic": 1000 } }
# 2. Training loop — execute an action and receive next obs + reward
curl -s -X POST "$BASE_URL/api/rl/environments/$ENV_ID/step" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{ "action": { "type": "scale_out", "parameters": {} } }'
# Response: { "t": 1, "obs": { "rps": 1000, "cpu_util": 0.45, "instances": 2, ... },
# "metrics": { "cost_usd_hr": 0.38, "latency_p95": 112, "error_rate": 0.003, ... },
# "reward": 0.72, "reward_components": { "performance": 0.8, ... },
# "done": false, "info": {} }
# 3. Reset to start a new episode when done=true
curl -s -X POST "$BASE_URL/api/rl/environments/$ENV_ID/reset" \
-H "Authorization: Bearer $API_KEY"
# Response: { "environment": { "currentStep": 0, ... }, "observation": { "metrics": {...} } }
Python training-loop example
A self-contained Python script that wires all three steps above into a
runnable multi-episode training loop is available at
examples/rl_training_loop.py.
It uses only the Python standard library (no third-party packages) and
demonstrates how to read obs["cpu_util"], metrics["latency_p95"],
reward, and done from each step response, reset between episodes, and
print per-episode reward totals:
# Run against the local dev server (--token must have admin scope):
python examples/rl_training_loop.py --token $ADMIN_KEY
# Already have a write-scoped key? Skip key minting with --skip-mint:
python examples/rl_training_loop.py --token $API_KEY --skip-mint
# Run against a deployed instance with custom episode count:
python examples/rl_training_loop.py \
--base-url https://your-deployment.replit.app \
--token $ADMIN_KEY \
--episodes 5 \
--steps 50
JavaScript/Node.js training-loop example
A parallel Node.js script is available at
examples/rl_training_loop.js for
JS-first developers. It mirrors the Python script exactly — same episode
flow, same action set, same printed output — and uses only Node.js
built-ins (node:https / node:http), so no npm install is required:
# Run against the local dev server (--token must have admin scope):
node examples/rl_training_loop.js --token $ADMIN_KEY
# Already have a write-scoped key? Skip key minting with --skip-mint:
node examples/rl_training_loop.js --token $API_KEY --skip-mint
# Run against a deployed instance with custom episode count:
node examples/rl_training_loop.js \
--base-url https://your-deployment.replit.app \
--token $ADMIN_KEY \
--episodes 5 \
--steps 50
Fetch API training-loop example (Node.js 18+ / Deno / Bun)
A self-contained counterpart to the Node.js script above is available at
examples/rl_training_loop_fetch.mjs.
It mirrors the Node.js script exactly — same episode flow, same action set,
same printed output — but uses the standard Fetch API throughout instead of
node:https built-ins, so it runs unchanged on Node.js 18+, Deno, and Bun
with no npm install required:
# Node.js 18+ (run against the local dev server):
node examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY
# Already have a write-scoped key? Skip key minting with --skip-mint:
node examples/rl_training_loop_fetch.mjs --token $API_KEY --skip-mint
# Deno (requires --allow-net):
deno run --allow-net examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY
# Bun:
bun examples/rl_training_loop_fetch.mjs --token $ADMIN_KEY
# Run against a deployed instance with custom episode count:
node examples/rl_training_loop_fetch.mjs \
--base-url https://your-deployment.replit.app \
--token $ADMIN_KEY \
--episodes 5 \
--steps 50
If you only need a minimal inline snippet (e.g. to embed in a browser script or REPL), here is a compact fetch-based step loop:
// fetch-based RL step loop — works in browser, Deno, Bun, Node.js 18+
// (wrapped in an async IIFE so no top-level await is required)
const BASE_URL = "https://your-deployment.replit.app"; // or http://localhost:5000
const API_KEY = "your-write-scoped-api-key";
const ENV_ID = "your-environment-id"; // from POST /api/rl/environments
async function rlStep(action) {
const res = await fetch(`${BASE_URL}/api/rl/environments/${ENV_ID}/step`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${API_KEY}`,
},
body: JSON.stringify({ action }),
});
if (!res.ok) throw new Error(`Step failed: ${res.status} ${await res.text()}`);
return res.json();
}
(async () => {
for (let i = 0; i < 20; i++) {
const { observation: obs, metrics, reward, done } = await rlStep("scale_up");
console.log(
`step ${i + 1} | cpu=${(obs.cpu_util * 100).toFixed(1)}%` +
` | p95=${metrics.latency_p95}ms | reward=${reward.toFixed(3)}`
);
if (done) { console.log("Episode finished."); break; }
}
})();
TypeScript training-loop example
A typed counterpart is available at
examples/rl_training_loop.ts for
TypeScript projects. It imports the generated SDK types from
sdk/typescript/src/openapi-types.ts so every request body and step
response is fully typed — enabling autocomplete and compile-time safety.
Run it with npx tsx (no separate compile step needed):
# Run against the local dev server (--token must have admin scope):
npx tsx examples/rl_training_loop.ts --token $ADMIN_KEY
# Already have a write-scoped key? Skip key minting with --skip-mint:
npx tsx examples/rl_training_loop.ts --token $API_KEY --skip-mint
# Run against a deployed instance with custom episode count:
npx tsx examples/rl_training_loop.ts \
--base-url https://your-deployment.replit.app \
--token $ADMIN_KEY \
--episodes 5 \
--steps 50
The Cloud World Model API supports webhook notifications for asynchronous job completion events. Instead of polling job status endpoints, you can provide a webhook URL when creating jobs, and the API will send an HTTP POST request to your endpoint when the job completes.
Supported Jobs:
POST /api/analysis/optimize)POST /api/chaos/run)POST /api/chaos/batch)POST /api/predictions/validate)POST /api/predictions/optimize-thresholds)POST /api/multi-cloud/explore)POST /api/rl/environments)How to Use Webhooks:
When creating a job, include two optional fields in your request:
webhookUrl (string): HTTPS URL where the webhook should be deliveredwebhookSecret (string): Secret used to sign the webhook payload (for verification)Webhook Delivery Mechanism:
Webhook Payload:
The webhook payload is a JSON object containing:
event: Event type (e.g., "optimization.completed", "chaos.completed", "rl_episode.completed")jobId: Unique identifier for the jobstatus: Final status ("completed" or "failed")data: Job-specific result data (structure varies by job type)timestamp: ISO 8601 timestamp when the webhook was sentSecurity - Signature Verification:
All webhooks include an X-Webhook-Signature header containing an HMAC-SHA256 signature.
You should verify this signature to ensure the webhook came from the Cloud World Model API:
webhookSecret as the key and the raw body as the messageX-Webhook-Signature header valueExample verification (Python):
import hmac
import hashlib
def verify_webhook_signature(payload: bytes, signature: str, secret: str) -> bool:
expected_signature = hmac.new(
secret.encode(),
payload,
hashlib.sha256
).hexdigest()
return hmac.compare_digest(signature, expected_signature)
Webhook Delivery Status:
All job response objects include webhook delivery tracking fields:
webhookDeliveryStatus: "pending", "delivered", or "failed"webhookDeliveryAttempts: Number of delivery attempts madewebhookDeliveryError: Error message if delivery failed (e.g., timeout, connection refused)webhookDeliveredAt: ISO 8601 timestamp of successful deliveryBest Practices:
webhookSecret for each job or use a rotating secret systemjobId (webhooks may be delivered multiple times)When a webhook arrives with "status": "failed", the data.error field contains a
human-readable message that tells you exactly why the job could not complete. Agent
recovery logic should inspect this message and classify the failure before deciding
whether to retry the same request, fix the input and resubmit, or escalate.
| Category | When to use | Agent action |
|---|---|---|
| Invalid input | The error message describes a specific problem with the request parameters (missing resource, malformed data, etc.) | Fix the input; do not retry the same request |
| Transient / engine error | The error message mentions an internal error, unexpected computation result, or does not identify a user-correctable input problem | Wait, then retry the same request up to 3 times with exponential backoff (2 s, 8 s, 30 s) |
A useful rule of thumb: if the error message ends with "resubmit" after listing corrective steps, it is an invalid-input failure. If the message does not provide corrective steps or mentions internal state, treat it as transient and retry.
1. Simulation not found
Example error text: "Simulation 'sim_abc123' not found…"
simulationId supplied when creating the job references a simulation
that no longer exists (deleted between job submission and execution) or was never created.POST /api/simulations to create a new simulation with the same resource
configuration.simulationId.2. Simulation contains no resources
Example error text: "Simulation '…' contains no resources and cannot be validated…"
POST /api/simulations/{simulationId}/resources.3. simulationId belongs to a different API key scope
Example error text: "simulationId '…' does not exist or belongs to a different API key scope…"
read access to the
simulation referenced by simulationId.4. Traffic forecast malformed
Example error text: "Traffic forecast '…' is malformed: timestamps are not strictly increasing…"
5. Traffic forecast has insufficient data points
Example error text: "…contains only N data points spanning M simulation steps…requires at least 5 data points covering a minimum of 60 steps…"
6. No valid threshold combination found
Example error text: "No valid threshold combination found…all N candidate combinations…produced peak error rates above the SLA limit…"
maxInstances in the simulation's autoscaling config so the optimizer has
more headroom to test higher-scale configurations.minInstances) so the pool can absorb the initial
traffic burst before autoscaling adds capacity.minInstances first since
autoscaling provisioning time may exceed the ramp duration.7. Validation engine internal error
Example error text: "Validation engine encountered an internal error…capacity model returned a negative throughput value at step N…"
jobId and full error payload for
support investigation.Job failures (described above) are different from webhook delivery failures. The platform retries webhook delivery up to 3 times with exponential backoff (0 s, 2 s, 8 s). If all delivery attempts fail, the job response object reflects this:
webhookDeliveryStatus: "failed"webhookDeliveryAttempts: 3webhookDeliveryError: description of the network error (e.g., "connection refused", "timeout")In this case the job itself may have completed successfully; only the notification failed to reach your endpoint. Agent recovery steps:
GET /api/predictions/optimize-thresholds/{jobId})
to retrieve the final result directly.status in the polled response:
"completed" → process the result as you would a successful webhook payload."failed" → apply the job-failure recovery steps above.The jobId included in every webhook payload is stable and idempotent — you can safely
poll the same jobId multiple times without triggering side effects.
Error pattern in data.error | Category | Retry same request? |
|---|---|---|
"…not found…" (simulation or resource) | Invalid input | No — fix simulationId first |
"…no resources…" | Invalid input | No — add resources first |
"…different API key scope…" | Authorization | No — fix key/scope first |
"…malformed…" or "…not strictly increasing…" | Invalid input | No — fix forecast first |
"…insufficient data points…" or "…too short…" | Invalid input | No — extend forecast first |
"No valid threshold combination found…" | Infrastructure constraint | No — change infra params first |
"…internal error…" or unexpected computation message | Transient | Yes — retry up to 3× with exponential backoff (2 s, 8 s, 30 s) |
| Any other unrecognized error | Unknown | Retry once; escalate if it recurs |
Manage reinforcement learning training environments.
This walkthrough shows the complete episode lifecycle for a DigitalOcean simulation: create → reset → step → observe. DigitalOcean simulations use Droplet-based compute and Managed Database resources. The hybrid prediction engine automatically models provider-specific behaviour — Droplet cold-start overhead (~30 s provisioning latency on first request after a scale-out) and shared-tenant network jitter — so your agent learns realistic scaling dynamics without spending real cloud budget.
Step 1 — Create a DigitalOcean simulation (no auth required)
POST /simulations with resources that include at least one Droplet
(provider: "digitalocean", type: "compute") and, optionally, a Managed PostgreSQL
node and a Load Balancer. Record the returned id — this is your simulationId.
Step 2 — Mint an API key (no auth required for demo)
POST /keys → copy the key field from the response.
Step 3 — Create the RL environment (Bearer auth required)
POST /rl/environments
Authorization: Bearer <your-key>
Content-Type: application/json
{
"simulationId": "<id-from-step-1>",
"episodeConfig": {
"maxSteps": 200,
"targetTrafficPattern": "wave",
"initialTraffic": 1500,
"targetSLA": { "maxLatencyP95": 180, "maxErrorRate": 1.0 },
"costBudgetPerHour": 3.50
}
}
The response contains the environment id and the initial observation — your agent's
first view of the Droplet cluster state.
Step 4 — Training loop (Bearer auth required)
repeat until done == true:
POST /rl/environments/{environmentId}/step
{ "action": { "type": "adjust_threshold",
"parameters": { "cpuThreshold": 65, "throughputThreshold": 70 } } }
← { t, obs, metrics, reward, reward_components, done, info }
Step 5 — Reset for the next episode
POST /rl/environments/{environmentId}/reset
DigitalOcean-specific notes:
costPerHour in observations reflects Droplet + Managed Database pricing from the
nyc3 region benchmark data.action.type: "scale_out" provisions a new Droplet replica; the first observation
after scaling models the ~30 s cold-start latency overhead automatically.action.type: "adjust_threshold" tunes CPU/throughput triggers on the DigitalOcean
autoscaling profile (default: CPU-weighted scoring, 180 s cooldown).s-2vcpu-4gb (maxThroughput: 1800 req/s) is the recommended starting
instance type for moderate workloads; upgrade to c-4 CPU-Optimized when your agent
consistently saturates CPU.This walkthrough shows the complete episode lifecycle for a GCP simulation:
create → reset → step → observe. GCP simulations use GCE compute instances and
Cloud SQL for managed databases in the us-central1 region. The hybrid prediction engine
models GCP-specific behaviour including managed instance group warm-up latency (~45 s for
the first scale-out) and Cloud SQL connection pooling characteristics.
Step 1 — Create a GCP simulation (no auth required)
POST /simulations with resources using provider: "gcp". Include at least one GCE
instance (serviceFamily: "gce") and optionally a Cloud SQL node and Cloud Load Balancing
frontend. Record the returned id — this is your simulationId.
Step 2 — Mint an API key (no auth required for demo)
POST /keys → copy the key field from the response.
Step 3 — Create the RL environment (Bearer auth required)
POST /rl/environments
Authorization: Bearer <your-key>
Content-Type: application/json
{
"simulationId": "<id-from-step-1>",
"episodeConfig": {
"maxSteps": 150,
"targetTrafficPattern": "ramp",
"initialTraffic": 4000,
"targetSLA": { "maxLatencyP95": 180, "maxErrorRate": 1.0 },
"costBudgetPerHour": 6.0
}
}
Step 4 — Training loop (Bearer auth required)
repeat until done == true:
POST /rl/environments/{environmentId}/step
{ "action": { "type": "adjust_threshold",
"parameters": { "cpuThreshold": 68, "throughputThreshold": 72 } } }
← { t, obs, metrics, reward, reward_components, done, info }
Step 5 — Reset for the next episode
POST /rl/environments/{environmentId}/reset
GCP-specific notes:
costPerHour in observations reflects GCE e2-standard-4 + Cloud SQL db-standard-4
pricing from the us-central1 region benchmark.action.type: "scale_out" provisions a new GCE instance; managed instance group warm-up
adds ~45 s latency overhead to the first observation after scaling.action.type: "adjust_threshold" tunes the GCP autoscaling profile (default:
CPU-weighted scoring, 120 s cooldown on Compute Engine autoscaler).e2-standard-4 to n2-standard-8 in the resource definition
when your agent consistently saturates CPU.This walkthrough shows the complete episode lifecycle for an Azure simulation:
create → reset → step → observe. Azure simulations use Azure VM compute and
Azure SQL Database in the East US region. The hybrid prediction engine models
Azure-specific behaviour including VM Scale Set provisioning latency (~60 s for the
first scale-out) and Azure SQL DTU burst characteristics.
Step 1 — Create an Azure simulation (no auth required)
POST /simulations with resources using provider: "azure". Include at least one Azure VM
(serviceFamily: "azure_vm", size: "Standard_D4s_v3") and optionally an Azure SQL
Database node and Azure Load Balancer. Record the returned id.
Step 2 — Mint an API key (no auth required for demo)
POST /keys → copy the key field from the response.
Step 3 — Create the RL environment (Bearer auth required)
POST /rl/environments
Authorization: Bearer <your-key>
Content-Type: application/json
{
"simulationId": "<id-from-step-1>",
"episodeConfig": {
"maxSteps": 150,
"targetTrafficPattern": "wave",
"initialTraffic": 4500,
"targetSLA": { "maxLatencyP95": 200, "maxErrorRate": 1.0 },
"costBudgetPerHour": 7.0
}
}
Step 4 — Training loop (Bearer auth required)
repeat until done == true:
POST /rl/environments/{environmentId}/step
{ "action": { "type": "adjust_threshold",
"parameters": { "cpuThreshold": 70, "throughputThreshold": 75 } } }
← { t, obs, metrics, reward, reward_components, done, info }
Step 5 — Reset for the next episode
POST /rl/environments/{environmentId}/reset
Azure-specific notes:
costPerHour in observations reflects Azure VM Standard_D4s_v3 + Azure SQL
General Purpose 4 vCores pricing from the East US region benchmark.action.type: "scale_out" provisions a new Standard_D4s_v3 VM via VM Scale Set;
the first observation after scaling models the ~60 s warm-up latency.action.type: "adjust_threshold" tunes the Azure autoscaling profile (default:
CPU-weighted scoring, 300 s cooldown on Azure Monitor autoscale).Standard_F8s_v2 (compute-optimized) when your agent consistently saturates
CPU on Standard_D4s_v3.This walkthrough shows the complete episode lifecycle for an OCI simulation:
create → reset → step → observe. OCI simulations use OCI Compute (VM.Standard3.Flex)
and Autonomous Database in the us-ashburn-1 region. The hybrid prediction engine models
OCI-specific behaviour including Flex OCPU scaling dynamics and Autonomous Database
auto-scaling characteristics.
Step 1 — Create an OCI simulation (no auth required)
POST /simulations with resources using provider: "oci". Include at least one OCI VM
(serviceFamily: "oci_vm", size: "VM.Standard3.Flex") and optionally an Autonomous
Database node and OCI Load Balancer. Record the returned id.
Step 2 — Mint an API key (no auth required for demo)
POST /keys → copy the key field from the response.
Step 3 — Create the RL environment (Bearer auth required)
POST /rl/environments
Authorization: Bearer <your-key>
Content-Type: application/json
{
"simulationId": "<id-from-step-1>",
"episodeConfig": {
"maxSteps": 150,
"targetTrafficPattern": "burst",
"initialTraffic": 5000,
"targetSLA": { "maxLatencyP95": 160, "maxErrorRate": 0.5 },
"costBudgetPerHour": 4.0
}
}
Step 4 — Training loop (Bearer auth required)
repeat until done == true:
POST /rl/environments/{environmentId}/step
{ "action": { "type": "adjust_threshold",
"parameters": { "cpuThreshold": 65, "throughputThreshold": 70 } } }
← { t, obs, metrics, reward, reward_components, done, info }
Step 5 — Reset for the next episode
POST /rl/environments/{environmentId}/reset
OCI-specific notes:
costPerHour in observations reflects OCI VM.Standard3.Flex + Autonomous Database
pricing from the us-ashburn-1 region benchmark.Create and manage cloud infrastructure simulations
Browse pre-built infrastructure scenario templates
Cloud provider pricing history and trends
Manage API keys for authentication
Automated infrastructure analysis and optimization
Test infrastructure against traffic forecasts and optimize autoscaling thresholds
Test infrastructure resilience by injecting failures and analyzing recovery
Explore and compare multi-cloud deployment strategies for optimal cost, performance, and vendor independence
Machine-readable description of the simulator for AI agents and onboarding tooling
Bootstrap and self-service path for obtaining API keys without direct database access.
Platform operator runs the bootstrap script once:
npx tsx scripts/bootstrap-admin-key.ts
This prints a one-time admin key. Store it as a secret immediately.
Admin mints a scoped, time-limited registration token for the external client:
POST /register-tokens
Authorization: Bearer <admin-key>
Content-Type: application/json
{ "name": "canvas-cloud-ai", "scopes": ["read","write"], "expiresAt": "2026-06-01T00:00:00Z" }
External client exchanges the token once (manually or via a one-off script):
POST /keys/register
Content-Type: application/json
{ "token": "<registration-token>", "name": "canvas-cloud-ai-prod" }
The response contains a permanent API key. Store it as an environment secret.
The token is burned on use and can never be reused. All subsequent API calls use the permanent key directly.