Real-World Scenarios

    What You Can Build
    With the Cloud World Model

    Concrete workflows for the most impactful ways Canvas Cloud AI learners and AI agents use the simulator today — from hands-on practice to RL training, chaos experiments, and cost optimization.

    Which AI Agents Benefit Most

    The simulator is purpose-built as an environment for AI agents that need to learn, plan, or reason about cloud infrastructure without touching real resources.

    Autoscaling RL Agents
    Best fit
    Agents that learn when to scale compute up or down by training on thousands of simulated traffic episodes. The Gym-compatible step/reset loop maps directly to standard RL frameworks like Stable-Baselines3 or RLlib.
    FinOps & Cost Optimization Agents
    Best fit
    Agents that reason over multi-cloud pricing, utilization data, and SLA constraints to recommend or automatically apply the cheapest viable architecture. The multi-cloud explore API returns scored strategies with full cost breakdowns.
    Resilience & Chaos Agents
    Best fit
    Agents that systematically probe an architecture for single points of failure by iterating over failure scenarios and measuring resilience scores. Useful for red-team automation and compliance validation pipelines.
    Capacity Planning Agents
    Best fit
    Agents that ingest traffic forecasts (from monitoring, analytics, or demand models) and validate whether current infrastructure will hold. They can also tune autoscaling thresholds automatically based on predicted headroom.
    Multi-Cloud Routing Agents
    Good fit
    Agents that continuously evaluate provider cost and latency signals to decide how to split traffic across AWS, GCP, Azure, and DigitalOcean. The simulator lets them test routing strategies without real provider commitments.
    LLM Infrastructure Advisors
    Good fit
    LLM-based agents that interpret simulation metrics, event logs, and AI coaching output to produce natural-language recommendations for human operators. The AI insights and optimization endpoints provide the structured context they need.

    Use Case 1

    Canvas Cloud AI Learners

    Put your Canvas Cloud AI lessons into practice by building and stress-testing real architectures in a safe, cost-free simulator.

    Without this platform

    Spinning up real AWS, GCP, Azure, or OCI resources every time you want to experiment — accumulating cloud bills, waiting for provisioning, and risking misconfigured infrastructure that leaks cost or breaks unexpectedly.

    With this platform

    An instant, zero-cost sandbox that mirrors real provider behavior. Drag resources onto the canvas, inject traffic and failures, and watch live metrics respond — all without a cloud account or a dollar of spend.

    Zero cloud spend — practice unlimited architectures for free

    Key Endpoints

    POST /api/simulations
    GET /api/simulations/{id}
    POST /api/simulations/{id}/step
    POST /api/simulations/{id}/inject-traffic
    POST /api/simulations/{id}/inject-failure
    GET /api/simulations/{id}/metrics

    Quick Start

    # 1. Create a practice simulation (no API key needed for the demo)
    curl -X POST /api/simulations \
    -H "Content-Type: application/json" \
    -d '{
    "name": "my-first-arch",
    "provider": "aws",
    "resources": [
    { "type": "ec2", "name": "web-server", "config": { "instanceType": "t3.medium" } },
    { "type": "rds", "name": "database", "config": { "instanceType": "db.t3.micro" } }
    ]
    }'
    # 2. Run a simulation step and observe metrics
    curl -X POST /api/simulations/<id>/step \
    -H "Content-Type: application/json" \
    -d '{ "trafficRPS": 500 }'
    # Returns: { metrics: { cpu, latency, errorRate, cost }, events: [] }
    # 3. Inject a failure to test resilience
    curl -X POST /api/simulations/<id>/inject-failure \
    -H "Content-Type: application/json" \
    -d '{ "type": "az_outage", "targetResourceId": "<resourceId>" }'

    Use Case 2

    RL Agent Training

    Train a reinforcement learning agent to optimize cloud autoscaling without real infrastructure costs or risk.

    Without this platform

    Months of production data collection, thousands of dollars in cloud spend per training run, and risk of degrading real user traffic during exploration.

    With this platform

    Compress months of production traffic patterns into minutes of safe simulation. No AWS bill, no production risk, no waiting for real scaling events to occur.

    Zero AWS bill — train for hours, not months

    Key Endpoints

    POST /api/simulations
    POST /api/keys
    POST /api/rl/environments
    POST /api/rl/environments/{id}/step
    POST /api/rl/environments/{id}/reset
    GET /api/rl/environments/{id}/observation

    Quick Start

    # 1. Create a simulation
    curl -X POST /api/simulations \
    -H "Content-Type: application/json" \
    -d '{"name":"autoscale-lab","provider":"aws","resources":[...]}'
    # 2. Create an RL environment
    curl -X POST /api/rl/environments \
    -H "Authorization: Bearer <key>" \
    -d '{"simulationId":"<id>","maxSteps":1000}'
    # 3. Training loop
    curl -X POST /api/rl/environments/<envId>/step \
    -H "Authorization: Bearer <key>" \
    -d '{"action":"scale_up","targetResourceId":"<resourceId>"}'
    # Returns: { observation, reward, done, info }

    Use Case 3

    Chaos Engineering & Resilience Testing

    Inject failures — AZ outages, DB crashes, network partitions — to find architectural weak points before production.

    Without this platform

    Expensive game days with real production risk, manual failure simulation, and no repeatable way to measure resilience scores across architecture changes.

    With this platform

    Inject any failure type into a virtual architecture in seconds. Get a quantified resilience score, a ranked list of vulnerabilities, and specific remediation recommendations — all without touching production.

    No production incidents — find weak points safely

    Key Endpoints

    GET /api/chaos/scenarios
    POST /api/chaos/run
    GET /api/chaos/jobs/{jobId}
    GET /api/chaos/jobs/{jobId}/results
    POST /api/chaos/batch
    GET /api/chaos/batch/{batchId}/results

    Quick Start

    # 1. Browse built-in failure scenarios
    curl /api/chaos/scenarios
    # 2. Run a chaos test (AZ outage scenario)
    curl -X POST /api/chaos/run \
    -H "Authorization: Bearer <key>" \
    -d '{"simulationId":"<id>","scenarioId":"az_outage","duration":300}'
    # Returns: { job: { id, status } }
    # 3. Poll for results
    curl /api/chaos/jobs/<jobId>/results \
    -H "Authorization: Bearer <key>"
    # Returns: resilienceScore, vulnerabilities[], recommendations[]

    Use Case 4

    Multi-Cloud Cost Optimization

    Compare AWS, GCP, Azure, and DigitalOcean strategies to find the cheapest architecture that meets your SLAs.

    Without this platform

    Running parallel production workloads on multiple providers for weeks, or relying on rough estimates that miss provider-specific pricing nuances and latency trade-offs.

    With this platform

    Evaluate every provider combination and traffic-split ratio in minutes. Typically uncovers 20–40% cost savings with detailed per-strategy cost, latency, and vendor lock-in scores.

    Typically 20–40% cost savings vs. single-provider

    Key Endpoints

    POST /api/multi-cloud/explore
    GET /api/multi-cloud/jobs/{jobId}
    GET /api/multi-cloud/jobs/{jobId}/results

    Quick Start

    # 1. Start a multi-cloud exploration job
    curl -X POST /api/multi-cloud/explore \
    -H "Authorization: Bearer <key>" \
    -d '{
    "simulationId": "<id>",
    "workloadProfile": {
    "computeInstances": 8,
    "trafficRPS": 5000,
    "latencyRequirementMs": 100
    },
    "optimizationWeights": { "cost": 0.5, "latency": 0.3, "vendorLockIn": 0.2 }
    }'
    # 2. Get ranked strategies
    curl /api/multi-cloud/jobs/<jobId>/results \
    -H "Authorization: Bearer <key>"
    # Returns: rankedStrategies[], comparisonReport, estimatedSavings

    Use Case 5

    Predictive Scaling Validation

    Validate autoscaling thresholds against traffic forecasts before deploying changes to production.

    Without this platform

    Discovering under-provisioning during a real launch or promotion, scrambling to scale reactively, and absorbing the revenue impact of a degraded user experience.

    With this platform

    Run what-if scenarios against any traffic shape in seconds. Get specific SLA violation windows and recommended threshold values before a single line of config changes in production.

    Catch capacity gaps before launch day, not after

    Key Endpoints

    POST /api/predictions/validate
    POST /api/predictions/optimize-thresholds
    GET /api/predictions/jobs/{jobId}
    GET /api/predictions/jobs/{jobId}/results

    Quick Start

    # 1. Validate infrastructure against a traffic forecast
    curl -X POST /api/predictions/validate \
    -H "Authorization: Bearer <key>" \
    -d '{
    "simulationId": "<id>",
    "trafficForecast": {
    "peakRPS": 12000,
    "rampDurationSeconds": 300,
    "sustainDurationSeconds": 3600
    },
    "optimizeThresholds": true
    }'
    # 2. Get results — SLA violations and recommended thresholds
    curl /api/predictions/jobs/<jobId>/results \
    -H "Authorization: Bearer <key>"
    # Returns: bottlenecks[], slaViolations[], recommendedThresholds

    Ready to Dive In?

    Get an API key, explore the interactive workspace, or browse the full OpenAPI reference — all free to start.