Arc of AI 2026  ·  AI New and Noteworthy

Platform Engineering
for Modern AI

From Multi-Cloud GPUs to Edge
and Intelligent Tooling

Kubernetes MLOps Edge Inference MCP Servers GPU Orchestration
Adao Oliveira Junior

Speaker

Adao Oliveira Junior

AI / Cloud Solutions Architect · Author · Speaker
Platform Engineering for Modern AI
From Multi-Cloud GPUs to Edge and Intelligent Tooling
CKA CKAD KCNA KCSA Terraform

The Challenge

AI Has Escaped the Data Center

☁️

Multi-Cloud Training

GPU scarcity forces workloads across AWS, GCP, Azure, and bare-metal clusters

Edge Inference

Latency-sensitive AI runs at the edge — retail, automotive, IoT, mobile

🔧

Developer Tooling

AI-native IDEs, MCP servers, and agents embedded in the SDLC

📊

Operational Complexity

Model versioning, drift detection, cost optimization — at every layer

→ The gap between "we trained a model" and "it runs reliably everywhere" is a platform engineering problem.

Foundation

Why Platform Engineering for AI?

🏗️  Self-Service Infrastructure

ML teams request GPU clusters, model endpoints, and edge deployments through a unified API — no tickets, no waiting

🛤️  Golden Paths for AI Workloads

Opinionated but flexible pipelines: from training → evaluation → registry → deployment → monitoring

🔄  Consistent Abstractions

Same deployment model whether targeting a cloud GPU cluster, an edge node, or a developer's laptop

# Kubeflow Training Operator — distributed fine-tune apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: llm-finetune-v3 namespace: ml-platform spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name: pytorch image: nvcr.io/nvidia/pytorch:24.01-py3 resources: limits: nvidia.com/gpu: "8" nodeSelector: cloud.google.com/gke-accelerator: nvidia-a100-80gb

The Big Picture

Reference Architecture

DEVELOPER
EXPERIENCE
AI-Native IDE
MCP Clients
Platform CLI
Dashboards
▼ ▼ ▼
EDGE
EXECUTION
Edge Nodes
Model Runtime
Local Cache
Telemetry
▼ ▼ ▼
ORCHESTRATION
& ROUTING
MCP Servers
Workload Router
Model Registry
Policy Engine
▼ ▼ ▼
AI/ML
PLATFORM
Training Operators
Experiment Tracking
Feature Store
MLOps Pipelines
▼ ▼ ▼
MULTI-CLOUD
GPU INFRA
AWS (A100)
GCP (TPU v5)
Azure (H100)
Bare Metal
Detailed reference architecture
Platform engineering for modern AI
From multi-cloud GPU training to edge inference and intelligent tooling
Developers, agents & tooling
IDE / AI coding assistant
VS Code + Copilot / Cursor
IntelliJ + AI plugin
Internal dev portal
Service catalog
Self-service provisioning
CI/CD pipelines
Build → Test → Scan → Deploy
AI agents / MCP clients
Developer assistant
Automation agent
Ops assistant
Unified DX for apps, models, and platform ops
AI platform control plane K8s
Kubernetes platform core
Multi-cluster orchestration
Scheduling & resource mgmt
Policy enforcement
MCP server / coordination key
Exposes models, tools, services
Standard interface for agents
Platform APIs / service catalog
Model & tool endpoints
Deployment templates
Secrets / identity / access
Vault / KMS · IAM / RBAC
Policy & governance
Guardrails · Quotas · Compliance
Standardizes access, deployment, policy, and coordination across clouds and edge
Multi-cloud GPU training & model engineering
Cloud A · AWS
GPU training cluster
Distributed training
Fine-tuning pipelines
Cloud B · GCP
Specialized accelerators
Batch inference / eval
Experimentation env
Data & model lifecycle
Data pipelines
Model registry
Artifact store
Train where the right compute exists
Distributed inference runtime
Workload-aware routing intelligent
Latency Cost Data locality Capability
Central / regional
K8s inference services
LLM / model serving
Autoscaling + gateway
Edge inference
Edge K8s / lightweight RT
Factory / retail / branch
Offline / low-latency exec
Cloud for scale, edge for speed and locality
Observability, governance & continuous feedback — signals feed continuous improvement
Metrics, logs, traces
Prometheus / Grafana
OpenTelemetry
Model telemetry
Latency / token usage
Accuracy / drift
Cost & capacity
GPU utilization
Spend optimization
Security & audit
Access logs
Compliance trails
Feedback loops
Quality signals
Retraining triggers
Control / data flow
Telemetry / feedback
MCP coordination
GPU / accelerator
Cloud provider
Edge / on-prem

Layer 1

Multi-Cloud GPU Infrastructure

Kubernetes as the Control Plane

Kueue for queuing and quotas. NVIDIA GPU Operator for driver lifecycle. Gang scheduling for distributed training across nodes.

Cross-Cloud Resource Pooling

Abstract GPU types (A100, H100, TPU) behind a unified API. Platform decides placement based on cost, availability, and SLAs.

Cost-Aware Scheduling

Spot/preemptible instances for training. Reserved capacity for inference. Automatic failover across providers when preempted.

# Kueue ClusterQueue — multi-cloud GPU pool apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: gpu-pool spec: resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: aws-a100 resources: - name: nvidia.com/gpu nominalQuota: 64 - name: gcp-h100 resources: - name: nvidia.com/gpu nominalQuota: 32

Layer 2

AI Ops & MLOps on Kubernetes

🏋️

Training Operators

KubeFlow Training Operator, Ray on K8s, PyTorch Elastic — all managed as CRDs with auto-scaling and checkpointing

📦

Model Registry

OCI-based model storage. Version control, lineage tracking, evaluation gates before promotion to production

🔄

MLOps Pipelines

Argo Workflows or Tekton for CI/CD. Automated retraining triggers. Blue/green and canary model deployments

📈

Observability

GPU utilization, training loss curves, inference latency, model drift — unified dashboards with Prometheus + Grafana

🔒

Governance

Model cards, audit trails, data lineage. Policy-as-code for model deployment approvals via OPA/Gatekeeper

🧪

Experiment Tracking

MLflow or Weights & Biases integrated into the platform. Auto-logged metrics, hyperparams, and artifacts

Layer 3

Edge-First Inference

⚡ Why Edge?

Single-digit ms latency. Data sovereignty compliance. Offline capability. Reduced cloud costs for high-volume inference.

📦 Model Lifecycle at the Edge

Automatic quantization (INT8/INT4). ONNX Runtime or TensorRT for hardware-specific optimization. OTA updates via the platform.

📡 Edge ↔ Cloud Sync

Telemetry flows back to central observability. Model performance compared edge vs. cloud. Automatic promotion or rollback.

Train (Cloud)
Quantize
Push to Edge
Serve
# KServe — INT8 edge inference service apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: vision-v2-edge namespace: edge-us-south spec: predictor: canaryTrafficPercent: 10 model: modelFormat: name: onnx runtime: kserve-onnxruntime storageUri: s3://model-registry/vision-v2.1-int8 resources: limits: nvidia.com/gpu: "1" nodeSelector: node-role: edge-gpu

Emerging Pattern

Workload-Aware Routing

Not every inference request is equal. The routing engine dynamically selects the optimal execution target based on multiple dimensions:

Latency

Edge for <10ms, cloud for <200ms, batch for async

Capability

Small models → edge. Large LLMs → cloud GPU. Specialized → TPU

Cost

Budget-aware routing. Spot instances for non-critical. Reserved for SLA-bound

Load

Current utilization across targets. Queue depth. Auto-spillover to cloud

# LiteLLM proxy — workload-aware routing router_settings: routing_strategy: latency-based-routing num_retries: 3 model_list: - model_name: vision-small litellm_params: model: openai/vision-v2-int8 api_base: http://kserve.edge-us-south.svc - model_name: llm-large litellm_params: model: openai/llama-3.3-70b api_base: http://vllm.gpu-h100.svc fallbacks: - {"vision-small": ["vision-cloud-spot"]} - {"llm-large": ["llm-a100"]}

The Glue

MCP-Based Coordination

Model Context Protocol (MCP) servers provide the coordination layer — a standardized interface between AI models, services, and developer tools.

🔗  Platform-as-MCP

Every platform capability exposed as an MCP tool: deploy models, query metrics, trigger retraining, manage edge nodes

🤖  Agent Orchestration

AI agents discover platform tools via MCP. "Deploy vision-v3 to all edge nodes in US-South" becomes a natural language command

🔄  Service Mesh for AI

MCP servers coordinate between models, handle context passing, manage conversation state across distributed inference endpoints

// MCP Server — Platform Tools const server = new MCPServer({ name: "ai-platform", tools: [{ name: "deploy_model", description: "Deploy a model to" + " cloud or edge targets", parameters: { model: { type: "string" }, target: { enum: ["cloud", "edge", "auto"] }, routing: { type: "string" } }, handler: async (params) => { const result = await platform.deploy(params); return result.status; } }] });

Putting It Together

End-to-End Example Flow

example · platform-cli
$ platform model train --config finetune.yaml
✓ Job scheduled → aws-a100 (64 GPUs)
✓ Training complete — accuracy: 94.2%
 
$ platform model register vision-v3
✓ Registered → oci://registry/vision-v3
✓ Auto-quantized: INT8 (234MB)
 
$ platform deploy vision-v3 --target auto
✓ Cloud endpoint: live (us-east, eu-west)
✓ Edge rollout: 47/47 nodes (canary 100%)
 
$ platform routing status vision-v3
→ Edge: 78% of requests (avg 4.2ms)
→ Cloud: 22% of requests (avg 89ms)
✓ Cost savings: 43% vs cloud-only

① Developer triggers training

Via CLI, IDE, or MCP-connected AI agent. Platform selects optimal GPU cluster automatically.

② Model registered & optimized

OCI registry stores the model. Automatic quantization for edge targets. Evaluation gates passed.

③ Deployed to cloud + edge

Routing policy determines targets. Canary rollout at edge. Cloud endpoints for fallback.

④ Intelligent routing in action

78% of requests served from edge at 4ms. Cloud handles complex cases. 43% cost reduction.

The Goal

Developer Experience as the Product

🚀

Reduce Friction

Self-service everything. No tickets for GPUs, no manual edge deployments, no bespoke monitoring setup. Platform handles it.

📐

Standardize Environments

Same abstractions from laptop to cloud to edge. Dev/staging/prod parity. Reproducible training runs. Consistent model serving.

🧠

Embed Intelligence

MCP servers in the IDE. AI agents that know the platform. Natural language deploys. Context-aware code assistance.

The metric that matters: Time from idea → production

Weeks
Without platform
Hours
With platform

Looking Ahead

Emerging Patterns

Edge-First Inference

Default to edge. Fall back to cloud. Not the other way around. This inverts the traditional cloud-first architecture for AI workloads with strict latency requirements.

Workload-Aware Routing

Replace static load balancers with AI-aware routers that understand model capabilities, latency budgets, cost constraints, and real-time system load.

MCP as Service Mesh for AI

MCP servers become the standard coordination layer. Models discover tools, services discover models, and agents orchestrate complex multi-model workflows.

Platform as Product

Treat the internal AI platform like a product: user research, feedback loops, versioned APIs, deprecation policies, and measured developer satisfaction.

Summary

Key Takeaways

01

Design for Portability

Build AI platforms with multi-cloud and edge as first-class targets. Same abstractions everywhere.

02

Embrace Edge-First Inference

Push inference to the edge by default. Cloud becomes the fallback, not the primary.

03

MCP as the Coordination Layer

Use MCP servers to bridge AI models, platform services, and developer tools into a unified workflow.

04

Platform Engineering is the Foundation

Without a proper platform, AI ops collapses under operational complexity. Invest in golden paths and self-service.

Thank You

Questions & Discussion

Platform Engineering for Modern AI
Arc of AI Conference 2026 · Austin, TX

Slides & Reference Arch → arcofai.com