From Multi-Cloud GPUs to Edge
and Intelligent Tooling
GPU scarcity forces workloads across AWS, GCP, Azure, and bare-metal clusters
Latency-sensitive AI runs at the edge — retail, automotive, IoT, mobile
AI-native IDEs, MCP servers, and agents embedded in the SDLC
Model versioning, drift detection, cost optimization — at every layer
→ The gap between "we trained a model" and "it runs reliably everywhere" is a platform engineering problem.
ML teams request GPU clusters, model endpoints, and edge deployments through a unified API — no tickets, no waiting
Opinionated but flexible pipelines: from training → evaluation → registry → deployment → monitoring
Same deployment model whether targeting a cloud GPU cluster, an edge node, or a developer's laptop
Kueue for queuing and quotas. NVIDIA GPU Operator for driver lifecycle. Gang scheduling for distributed training across nodes.
Abstract GPU types (A100, H100, TPU) behind a unified API. Platform decides placement based on cost, availability, and SLAs.
Spot/preemptible instances for training. Reserved capacity for inference. Automatic failover across providers when preempted.
KubeFlow Training Operator, Ray on K8s, PyTorch Elastic — all managed as CRDs with auto-scaling and checkpointing
OCI-based model storage. Version control, lineage tracking, evaluation gates before promotion to production
Argo Workflows or Tekton for CI/CD. Automated retraining triggers. Blue/green and canary model deployments
GPU utilization, training loss curves, inference latency, model drift — unified dashboards with Prometheus + Grafana
Model cards, audit trails, data lineage. Policy-as-code for model deployment approvals via OPA/Gatekeeper
MLflow or Weights & Biases integrated into the platform. Auto-logged metrics, hyperparams, and artifacts
Single-digit ms latency. Data sovereignty compliance. Offline capability. Reduced cloud costs for high-volume inference.
Automatic quantization (INT8/INT4). ONNX Runtime or TensorRT for hardware-specific optimization. OTA updates via the platform.
Telemetry flows back to central observability. Model performance compared edge vs. cloud. Automatic promotion or rollback.
Not every inference request is equal. The routing engine dynamically selects the optimal execution target based on multiple dimensions:
Edge for <10ms, cloud for <200ms, batch for async
Small models → edge. Large LLMs → cloud GPU. Specialized → TPU
Budget-aware routing. Spot instances for non-critical. Reserved for SLA-bound
Current utilization across targets. Queue depth. Auto-spillover to cloud
Model Context Protocol (MCP) servers provide the coordination layer — a standardized interface between AI models, services, and developer tools.
Every platform capability exposed as an MCP tool: deploy models, query metrics, trigger retraining, manage edge nodes
AI agents discover platform tools via MCP. "Deploy vision-v3 to all edge nodes in US-South" becomes a natural language command
MCP servers coordinate between models, handle context passing, manage conversation state across distributed inference endpoints
Via CLI, IDE, or MCP-connected AI agent. Platform selects optimal GPU cluster automatically.
OCI registry stores the model. Automatic quantization for edge targets. Evaluation gates passed.
Routing policy determines targets. Canary rollout at edge. Cloud endpoints for fallback.
78% of requests served from edge at 4ms. Cloud handles complex cases. 43% cost reduction.
Self-service everything. No tickets for GPUs, no manual edge deployments, no bespoke monitoring setup. Platform handles it.
Same abstractions from laptop to cloud to edge. Dev/staging/prod parity. Reproducible training runs. Consistent model serving.
MCP servers in the IDE. AI agents that know the platform. Natural language deploys. Context-aware code assistance.
The metric that matters: Time from idea → production
Default to edge. Fall back to cloud. Not the other way around. This inverts the traditional cloud-first architecture for AI workloads with strict latency requirements.
Replace static load balancers with AI-aware routers that understand model capabilities, latency budgets, cost constraints, and real-time system load.
MCP servers become the standard coordination layer. Models discover tools, services discover models, and agents orchestrate complex multi-model workflows.
Treat the internal AI platform like a product: user research, feedback loops, versioned APIs, deprecation policies, and measured developer satisfaction.
Build AI platforms with multi-cloud and edge as first-class targets. Same abstractions everywhere.
Push inference to the edge by default. Cloud becomes the fallback, not the primary.
Use MCP servers to bridge AI models, platform services, and developer tools into a unified workflow.
Without a proper platform, AI ops collapses under operational complexity. Invest in golden paths and self-service.
Platform Engineering for Modern AI
Arc of AI Conference 2026 · Austin, TX