Module 0: Introduction & Architecture

Overview

Before diving into hands-on exercises, let’s understand how the Self-Healing Platform works. This module covers the architecture, key components, and the hybrid approach that combines deterministic automation with AI-driven analysis.

What is the Self-Healing Platform?

The OpenShift AI Ops Self-Healing Platform is a production-ready AIOps solution that:

🤖 Hybrid Approach: Combines deterministic automation (rule-based) with AI-driven analysis (ML models)
🔧 Self-Healing: Automatically detects and remediates common cluster issues
📊 ML-Powered: Uses Isolation Forest, LSTM models for anomaly detection
🚀 OpenShift Native: Built on Red Hat OpenShift AI, KServe, Tekton, ArgoCD
💬 Natural Language Interface: Chat with your cluster via OpenShift Lightspeed

The Hybrid Architecture

The platform uses a hybrid deterministic-AI self-healing approach:

┌─────────────────────────────────────────────────────────────┐
│                 Self-Healing Platform                        │
├─────────────────────────────────────────────────────────────┤
│  Coordination Engine (Go REST API)                          │
│  ├─ Conflict Resolution                                     │
│  ├─ Priority Management                                     │
│  └─ Action Orchestration                                    │
├─────────────────────────────────────────────────────────────┤
│  Deterministic Layer    │    AI-Driven Layer               │
│  ├─ Machine Config      │    ├─ Anomaly Detection          │
│  │  Operator            │    │  (Isolation Forest, LSTM)   │
│  ├─ Known Remediation   │    ├─ Root Cause Analysis        │
│  │  Procedures          │    ├─ Predictive Analytics       │
│  └─ Rule-Based Actions  │    └─ Adaptive Responses         │
├─────────────────────────────────────────────────────────────┤
│  Shared Observability Layer (Prometheus, AlertManager)     │
└─────────────────────────────────────────────────────────────┘

Why Hybrid?

Approach	Strengths	Weaknesses
Deterministic Only	Fast, predictable, well-tested	Can’t handle novel issues
AI Only	Adapts to new patterns	May be slow, unpredictable
Hybrid (Our Approach)	Best of both worlds	More complex to build

Approach

Strengths

Weaknesses

Deterministic Only

Fast, predictable, well-tested

Can’t handle novel issues

AI Only

Adapts to new patterns

May be slow, unpredictable

Hybrid (Our Approach)

Best of both worlds

More complex to build

The Coordination Engine decides which layer handles each issue:

Known issues → Deterministic layer (fast, reliable)
Novel/complex issues → AI layer (adaptive, intelligent)
Conflicts → Coordination Engine resolves priority

Platform Components

1. OpenShift Lightspeed

What it is: AI assistant integrated into the OpenShift console

What it does:

Answers questions about your cluster in natural language
Triggers ML-powered analysis and predictions
Initiates automated remediation with your approval

Access: Click the sparkle icon ✨ in the OpenShift console header

2. MCP Server (Go)

What it is: Model Context Protocol server connecting Lightspeed to platform tools

What it does:

Exposes 7 tools that Lightspeed can call
Translates natural language intents to API calls
Returns structured responses for Lightspeed to format

MCP Tools Available:

1. get-cluster-health    → Check namespace status, pods, models
2. list-pods             → Query pods with filtering
3. analyze-anomalies     → Call ML models for detection
4. trigger-remediation   → Apply fixes via Coordination Engine
5. list-incidents        → Query historical incidents
6. get-model-status      → Check KServe InferenceService health
7. list-models           → List available ML models

3. Coordination Engine (Go)

What it is: REST API service orchestrating remediation workflows

What it does:

Receives anomaly detection requests
Queries Prometheus for current metrics
Calls KServe ML models for predictions
Applies remediation actions to the cluster
Tracks incidents and resolution history

API Endpoints:

GET  /health           → Health check
POST /api/v1/detect    → Detect anomalies
POST /api/v1/remediate → Apply remediation
GET  /api/v1/incidents → List incidents
GET  /metrics          → Prometheus metrics

4. KServe ML Models

What they are: Machine learning models served via KServe InferenceService

Models deployed:

Model Purpose Notebook

Model	Purpose	Notebook
`anomaly-detector`	Detects unusual patterns in metrics	`01-isolation-forest-implementation.ipynb`
`predictive-analytics`	Forecasts future resource usage	`05-predictive-analytics-kserve.ipynb`

anomaly-detector

Detects unusual patterns in metrics

01-isolation-forest-implementation.ipynb

predictive-analytics

Forecasts future resource usage

05-predictive-analytics-kserve.ipynb

Key Point: These models are trained on YOUR cluster’s data. Predictions reflect your workload patterns, not generic benchmarks.

5. Jupyter Workbench

What it is: JupyterLab environment for ML development

What it contains:

Training notebooks for all ML models
Utility functions for Prometheus queries
Integration code for Coordination Engine
Persistent storage for models and data

Access: Via OpenShift AI console or direct URL

6. ArgoCD (GitOps)

What it is: GitOps deployment via Validated Patterns framework

What it does:

Manages all platform components declaratively
Syncs from Git repository automatically
Provides drift detection and rollback

Data Flow

Let’s trace what happens when you ask Lightspeed a question:

You: "Predict CPU at 3 PM"
         │
         ▼
┌─────────────────┐
│   Lightspeed    │ ◄─── Natural language understanding
└────────┬────────┘
         │ MCP protocol
         ▼
┌─────────────────┐
│   MCP Server    │ ◄─── Routes to correct tool
└────────┬────────┘
         │ REST API
         ▼
┌─────────────────┐
│  Coordination   │
│    Engine       │ ◄─── Orchestrates the workflow
└────────┬────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌───────┐ ┌───────┐
│Promethe│ │KServe │ ◄─── Get metrics, call ML model
│  us   │ │ Model │
└───┬───┘ └───┬───┘
    │         │
    └────┬────┘
         │
         ▼
┌─────────────────┐
│   Lightspeed    │ ◄─── Formats response
└────────┬────────┘
         │
         ▼
You: "CPU: 74.5% at 3 PM (85% confidence)"

Why Cluster-Specific Training Matters

The ML models are trained specifically for YOUR cluster:

Your 3 PM is different from our 3 PM: A retail cluster spikes at lunch, financial services at market open
Workload fingerprints are unique: Your apps' memory patterns, CPU bursts, scaling behaviors
Anomaly baselines are contextual: "Normal" for your cluster might be "abnormal" for another

Training Pipeline

Prometheus Metrics ──► Jupyter Notebook ──► Trained Model ──► KServe
     (your data)        (automated via       (your patterns)   (inference)
                         Tekton pipeline)

Models retrain automatically (weekly by default) to capture evolving patterns.

Workshop Environment

Your pre-deployed environment includes:

Component Service Namespace

Component	Service	Namespace
OpenShift Lightspeed	AI assistant in console	`openshift-lightspeed`
MCP Server	`mcp-server:8080`	`self-healing-platform`
Coordination Engine	`coordination-engine:8080`	`self-healing-platform`
Anomaly Detector	`anomaly-detector-predictor:8080`	`self-healing-platform`
Predictive Analytics	`predictive-analytics-predictor:8080`	`self-healing-platform`
Jupyter Workbench	`self-healing-workbench`	`self-healing-platform`

OpenShift Lightspeed

AI assistant in console

openshift-lightspeed

MCP Server

mcp-server:8080

self-healing-platform

Coordination Engine

coordination-engine:8080

self-healing-platform

Anomaly Detector

anomaly-detector-predictor:8080

self-healing-platform

Predictive Analytics

predictive-analytics-predictor:8080

self-healing-platform

Jupyter Workbench

self-healing-workbench

self-healing-platform

Verify Your Environment

Let’s verify everything is running:

OpenShift Console

Open the OpenShift Console
Navigate to Workloads → Pods
Select namespace: self-healing-platform
Verify these pods are Running:
- coordination-engine-*
- mcp-server-*
- anomaly-detector-predictor-*
- predictive-analytics-predictor-*
- self-healing-workbench-0

CLI

oc get pods -n self-healing-platform

Expected output:

NAME                                         READY   STATUS    RESTARTS   AGE
coordination-engine-xxx                      1/1     Running   0          2h
mcp-server-xxx                               1/1     Running   0          2h
anomaly-detector-predictor-xxx               2/2     Running   0          2h
predictive-analytics-predictor-xxx           2/2     Running   0          2h
self-healing-workbench-0                     2/2     Running   0          2h

Lightspeed

Click the Lightspeed icon (✨) in the console header
Type: What’s the health of the self-healing-platform namespace?
If you get a response with component status, everything is working!

Key Concepts Summary

Concept	Description
Hybrid Approach	Deterministic + AI layers working together
Coordination Engine	Central orchestrator that routes issues to the right layer
MCP (Model Context Protocol)	Standard protocol connecting LLMs to external tools
KServe	Kubernetes-native model serving infrastructure
Cluster-Specific Training	ML models learn YOUR cluster’s patterns

Concept

Description

Hybrid Approach

Deterministic + AI layers working together

Coordination Engine

Central orchestrator that routes issues to the right layer

MCP (Model Context Protocol)

Standard protocol connecting LLMs to external tools

KServe

Kubernetes-native model serving infrastructure

Cluster-Specific Training

ML models learn YOUR cluster’s patterns

Next Steps

Now that you understand the architecture, let’s get hands-on!

In Module 1: ML Model Training with Tekton, you’ll:

Train anomaly detection models manually
Explore different data sources (Prometheus vs synthetic)
Understand the automated training pipeline
Monitor training runs and validate models

Architecture Reference: For detailed architectural decisions, see the ADRs (Architectural Decision Records) in the platform repository.