Arc of AI 2026 · AI New and Noteworthy

Platform Engineering
for Modern AI

From Multi-Cloud GPUs to Edge
and Intelligent Tooling

Kubernetes MLOps Edge Inference MCP Servers GPU Orchestration

Speaker

Adao Oliveira Junior

AI / Cloud Solutions Architect · Author · Speaker

Platform Engineering for Modern AI
From Multi-Cloud GPUs to Edge and Intelligent Tooling

adao.dev LinkedIn GitHub Slides

CKA CKAD KCNA KCSA Terraform

The Challenge

AI Has Escaped the Data Center

☁️

Multi-Cloud Training

GPU scarcity forces workloads across AWS, GCP, Azure, and bare-metal clusters

⚡

Edge Inference

Latency-sensitive AI runs at the edge — retail, automotive, IoT, mobile

🔧

Developer Tooling

AI-native IDEs, MCP servers, and agents embedded in the SDLC

📊

Operational Complexity

Model versioning, drift detection, cost optimization — at every layer

→ The gap between "we trained a model" and "it runs reliably everywhere" is a platform engineering problem.

Foundation

Why Platform Engineering for AI?

🏗️ Self-Service Infrastructure

ML teams request GPU clusters, model endpoints, and edge deployments through a unified API — no tickets, no waiting

🛤️ Golden Paths for AI Workloads

Opinionated but flexible pipelines: from training → evaluation → registry → deployment → monitoring

🔄 Consistent Abstractions

Same deployment model whether targeting a cloud GPU cluster, an edge node, or a developer's laptop

# Kubeflow Training Operator — distributed fine-tune
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: llm-finetune-v3
  namespace: ml-platform
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: nvcr.io/nvidia/pytorch:24.01-py3
            resources:
              limits:
                nvidia.com/gpu: "8"
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-a100-80gb
        

The Big Picture

Reference Architecture

DEVELOPER
EXPERIENCE

AI-Native IDE

MCP Clients

Platform CLI

Dashboards

▼ ▼ ▼

EDGE
EXECUTION

Edge Nodes

Model Runtime

Local Cache

Telemetry

▼ ▼ ▼

ORCHESTRATION
& ROUTING

MCP Servers

Workload Router

Model Registry

Policy Engine

▼ ▼ ▼

AI/ML
PLATFORM

Training Operators

Experiment Tracking

Feature Store

MLOps Pipelines

▼ ▼ ▼

MULTI-CLOUD
GPU INFRA

AWS (A100)

GCP (TPU v5)

Azure (H100)

Bare Metal

Detailed reference architecture

Platform engineering for modern AI

From multi-cloud GPU training to edge inference and intelligent tooling

Developers, agents & tooling

IDE / AI coding assistant

VS Code + Copilot / Cursor

IntelliJ + AI plugin

Internal dev portal

Service catalog

Self-service provisioning

CI/CD pipelines

Build → Test → Scan → Deploy

AI agents / MCP clients

Developer assistant

Automation agent

Ops assistant

Unified DX for apps, models, and platform ops

AI platform control plane K8s

Kubernetes platform core

Multi-cluster orchestration

Scheduling & resource mgmt

Policy enforcement

MCP server / coordination key

Exposes models, tools, services

Standard interface for agents

Platform APIs / service catalog

Model & tool endpoints

Deployment templates

Secrets / identity / access

Vault / KMS · IAM / RBAC

Policy & governance

Guardrails · Quotas · Compliance

Standardizes access, deployment, policy, and coordination across clouds and edge

Multi-cloud GPU training & model engineering

Cloud A · AWS

GPU training cluster

Distributed training

Fine-tuning pipelines

Cloud B · GCP

Specialized accelerators

Batch inference / eval

Experimentation env

Data & model lifecycle

Data pipelines

Model registry

Artifact store

Train where the right compute exists

Distributed inference runtime

Workload-aware routing intelligent

Latency Cost Data locality Capability

Central / regional

K8s inference services

LLM / model serving

Autoscaling + gateway

Edge inference

Edge K8s / lightweight RT

Factory / retail / branch

Offline / low-latency exec

Cloud for scale, edge for speed and locality

Observability, governance & continuous feedback — signals feed continuous improvement

Metrics, logs, traces

Prometheus / Grafana

OpenTelemetry

Model telemetry

Latency / token usage

Accuracy / drift

Cost & capacity

GPU utilization

Spend optimization

Security & audit

Access logs

Compliance trails

Feedback loops

Quality signals

Retraining triggers

Control / data flow

Telemetry / feedback

MCP coordination

GPU / accelerator

Cloud provider

Edge / on-prem

Layer 1

Multi-Cloud GPU Infrastructure

Kubernetes as the Control Plane

Kueue for queuing and quotas. NVIDIA GPU Operator for driver lifecycle. Gang scheduling for distributed training across nodes.

Cross-Cloud Resource Pooling

Abstract GPU types (A100, H100, TPU) behind a unified API. Platform decides placement based on cost, availability, and SLAs.

Cost-Aware Scheduling

Spot/preemptible instances for training. Reserved capacity for inference. Automatic failover across providers when preempted.

# Kueue ClusterQueue — multi-cloud GPU pool
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-pool
spec:
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu"]
    flavors:
    - name: aws-a100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 64
    - name: gcp-h100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 32
        

Layer 2

AI Ops & MLOps on Kubernetes

🏋️

Training Operators

KubeFlow Training Operator, Ray on K8s, PyTorch Elastic — all managed as CRDs with auto-scaling and checkpointing

📦

Model Registry

OCI-based model storage. Version control, lineage tracking, evaluation gates before promotion to production

🔄

MLOps Pipelines

Argo Workflows or Tekton for CI/CD. Automated retraining triggers. Blue/green and canary model deployments

📈

Observability

GPU utilization, training loss curves, inference latency, model drift — unified dashboards with Prometheus + Grafana

🔒

Governance

Model cards, audit trails, data lineage. Policy-as-code for model deployment approvals via OPA/Gatekeeper

🧪

Experiment Tracking

MLflow or Weights & Biases integrated into the platform. Auto-logged metrics, hyperparams, and artifacts

Layer 3

Edge-First Inference

⚡ Why Edge?

Single-digit ms latency. Data sovereignty compliance. Offline capability. Reduced cloud costs for high-volume inference.

📦 Model Lifecycle at the Edge

Automatic quantization (INT8/INT4). ONNX Runtime or TensorRT for hardware-specific optimization. OTA updates via the platform.

📡 Edge ↔ Cloud Sync

Telemetry flows back to central observability. Model performance compared edge vs. cloud. Automatic promotion or rollback.

Train (Cloud)

→

Quantize

→

Push to Edge

→

Serve

# KServe — INT8 edge inference service
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: vision-v2-edge
  namespace: edge-us-south
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: onnx
      runtime: kserve-onnxruntime
      storageUri: s3://model-registry/vision-v2.1-int8
      resources:
        limits:
          nvidia.com/gpu: "1"
    nodeSelector:
      node-role: edge-gpu
        

Emerging Pattern

Workload-Aware Routing

Not every inference request is equal. The routing engine dynamically selects the optimal execution target based on multiple dimensions:

Latency

Edge for <10ms, cloud for <200ms, batch for async

Capability

Small models → edge. Large LLMs → cloud GPU. Specialized → TPU

Cost

Budget-aware routing. Spot instances for non-critical. Reserved for SLA-bound

Load

Current utilization across targets. Queue depth. Auto-spillover to cloud

# LiteLLM proxy — workload-aware routing
router_settings:
  routing_strategy: latency-based-routing
  num_retries: 3

model_list:
  - model_name: vision-small
    litellm_params:
      model: openai/vision-v2-int8
      api_base: http://kserve.edge-us-south.svc

  - model_name: llm-large
    litellm_params:
      model: openai/llama-3.3-70b
      api_base: http://vllm.gpu-h100.svc

fallbacks:
  - {"vision-small": ["vision-cloud-spot"]}
  - {"llm-large": ["llm-a100"]}
        

The Glue

MCP-Based Coordination

Model Context Protocol (MCP) servers provide the coordination layer — a standardized interface between AI models, services, and developer tools.

🔗 Platform-as-MCP

Every platform capability exposed as an MCP tool: deploy models, query metrics, trigger retraining, manage edge nodes

🤖 Agent Orchestration

AI agents discover platform tools via MCP. "Deploy vision-v3 to all edge nodes in US-South" becomes a natural language command

🔄 Service Mesh for AI

MCP servers coordinate between models, handle context passing, manage conversation state across distributed inference endpoints

// MCP Server — Platform Tools
const server = new MCPServer({
  name: "ai-platform",
  tools: [{
    name: "deploy_model",
    description: "Deploy a model to"
      + " cloud or edge targets",
    parameters: {
      model: { type: "string" },
      target: {
        enum: ["cloud", "edge", "auto"]
      },
      routing: { type: "string" }
    },
    handler: async (params) => {
      const result = await
        platform.deploy(params);
      return result.status;
    }
  }]
});
        

Putting It Together

End-to-End Example Flow

example · platform-cli

$ platform model train --config finetune.yaml

✓ Job scheduled → aws-a100 (64 GPUs)

✓ Training complete — accuracy: 94.2%

$ platform model register vision-v3

✓ Registered → oci://registry/vision-v3

✓ Auto-quantized: INT8 (234MB)

$ platform deploy vision-v3 --target auto

✓ Cloud endpoint: live (us-east, eu-west)

✓ Edge rollout: 47/47 nodes (canary 100%)

$ platform routing status vision-v3

→ Edge: 78% of requests (avg 4.2ms)

→ Cloud: 22% of requests (avg 89ms)

✓ Cost savings: 43% vs cloud-only

① Developer triggers training

Via CLI, IDE, or MCP-connected AI agent. Platform selects optimal GPU cluster automatically.

② Model registered & optimized

OCI registry stores the model. Automatic quantization for edge targets. Evaluation gates passed.

③ Deployed to cloud + edge

Routing policy determines targets. Canary rollout at edge. Cloud endpoints for fallback.

④ Intelligent routing in action

78% of requests served from edge at 4ms. Cloud handles complex cases. 43% cost reduction.

The Goal

Developer Experience as the Product

🚀

Reduce Friction

Self-service everything. No tickets for GPUs, no manual edge deployments, no bespoke monitoring setup. Platform handles it.

📐

Standardize Environments

Same abstractions from laptop to cloud to edge. Dev/staging/prod parity. Reproducible training runs. Consistent model serving.

🧠

Embed Intelligence

MCP servers in the IDE. AI agents that know the platform. Natural language deploys. Context-aware code assistance.

The metric that matters: Time from idea → production

Weeks

Without platform

→

Hours

With platform

Looking Ahead

Emerging Patterns

Edge-First Inference

Default to edge. Fall back to cloud. Not the other way around. This inverts the traditional cloud-first architecture for AI workloads with strict latency requirements.

Workload-Aware Routing

Replace static load balancers with AI-aware routers that understand model capabilities, latency budgets, cost constraints, and real-time system load.

MCP as Service Mesh for AI

MCP servers become the standard coordination layer. Models discover tools, services discover models, and agents orchestrate complex multi-model workflows.

Platform as Product

Treat the internal AI platform like a product: user research, feedback loops, versioned APIs, deprecation policies, and measured developer satisfaction.

Summary

Key Takeaways

01

Design for Portability

Build AI platforms with multi-cloud and edge as first-class targets. Same abstractions everywhere.

02

Embrace Edge-First Inference

Push inference to the edge by default. Cloud becomes the fallback, not the primary.

03

MCP as the Coordination Layer

Use MCP servers to bridge AI models, platform services, and developer tools into a unified workflow.

04

Platform Engineering is the Foundation

Without a proper platform, AI ops collapses under operational complexity. Invest in golden paths and self-service.

Thank You

Questions & Discussion

Platform Engineering for Modern AI
Arc of AI Conference 2026 · Austin, TX

Slides & Reference Arch → arcofai.com

Arc of AI 2026 · AI New and Noteworthy

Platform Engineeringfor Modern AI

Speaker

Adao Oliveira Junior

The Challenge

AI Has Escaped the Data Center

Multi-Cloud Training

Edge Inference

Developer Tooling

Operational Complexity

Foundation

Why Platform Engineering for AI?

🏗️ Self-Service Infrastructure

🛤️ Golden Paths for AI Workloads

🔄 Consistent Abstractions

The Big Picture

Reference Architecture

Layer 1

Multi-Cloud GPU Infrastructure

Kubernetes as the Control Plane

Cross-Cloud Resource Pooling

Cost-Aware Scheduling

Layer 2

AI Ops & MLOps on Kubernetes

Training Operators

Model Registry

MLOps Pipelines

Observability

Governance

Experiment Tracking

Layer 3

Edge-First Inference

⚡ Why Edge?

📦 Model Lifecycle at the Edge

📡 Edge ↔ Cloud Sync

Emerging Pattern

Workload-Aware Routing

Latency

Capability

Cost

Load

The Glue

MCP-Based Coordination

🔗 Platform-as-MCP

🤖 Agent Orchestration

🔄 Service Mesh for AI

Putting It Together

End-to-End Example Flow

① Developer triggers training

② Model registered & optimized

③ Deployed to cloud + edge

④ Intelligent routing in action

The Goal

Developer Experience as the Product

Reduce Friction

Standardize Environments

Embed Intelligence

Looking Ahead

Emerging Patterns

Edge-First Inference

Workload-Aware Routing

MCP as Service Mesh for AI

Platform as Product

Summary

Key Takeaways

Design for Portability

Embrace Edge-First Inference

MCP as the Coordination Layer

Platform Engineering is the Foundation

Thank You

Questions & Discussion

Platform Engineering
for Modern AI