Module 0: Introduction & Architecture

Overview

Before diving into hands-on exercises, let’s understand how the Self-Healing Platform works. This module covers the architecture, key components, and the hybrid approach that combines deterministic automation with AI-driven analysis.

What is the Self-Healing Platform?

The OpenShift AI Ops Self-Healing Platform is a production-ready AIOps solution that:

  • πŸ€– Hybrid Approach: Combines deterministic automation (rule-based) with AI-driven analysis (ML models)

  • πŸ”§ Self-Healing: Automatically detects and remediates common cluster issues

  • πŸ“Š ML-Powered: Uses Isolation Forest, LSTM models for anomaly detection

  • πŸš€ OpenShift Native: Built on Red Hat OpenShift AI, KServe, Tekton, ArgoCD

  • πŸ’¬ Natural Language Interface: Chat with your cluster via OpenShift Lightspeed

The Hybrid Architecture

The platform uses a hybrid deterministic-AI self-healing approach:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Self-Healing Platform                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Coordination Engine (Go REST API)                          β”‚
β”‚  β”œβ”€ Conflict Resolution                                     β”‚
β”‚  β”œβ”€ Priority Management                                     β”‚
β”‚  └─ Action Orchestration                                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Deterministic Layer    β”‚    AI-Driven Layer               β”‚
β”‚  β”œβ”€ Machine Config      β”‚    β”œβ”€ Anomaly Detection          β”‚
β”‚  β”‚  Operator            β”‚    β”‚  (Isolation Forest, LSTM)   β”‚
β”‚  β”œβ”€ Known Remediation   β”‚    β”œβ”€ Root Cause Analysis        β”‚
β”‚  β”‚  Procedures          β”‚    β”œβ”€ Predictive Analytics       β”‚
β”‚  └─ Rule-Based Actions  β”‚    └─ Adaptive Responses         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Shared Observability Layer (Prometheus, AlertManager)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why Hybrid?

Approach Strengths Weaknesses

Deterministic Only

Fast, predictable, well-tested

Can’t handle novel issues

AI Only

Adapts to new patterns

May be slow, unpredictable

Hybrid (Our Approach)

Best of both worlds

More complex to build

The Coordination Engine decides which layer handles each issue:

  • Known issues β†’ Deterministic layer (fast, reliable)

  • Novel/complex issues β†’ AI layer (adaptive, intelligent)

  • Conflicts β†’ Coordination Engine resolves priority

Platform Components

1. OpenShift Lightspeed

What it is: AI assistant integrated into the OpenShift console

What it does:

  • Answers questions about your cluster in natural language

  • Triggers ML-powered analysis and predictions

  • Initiates automated remediation with your approval

Access: Click the sparkle icon ✨ in the OpenShift console header

2. MCP Server (Go)

What it is: Model Context Protocol server connecting Lightspeed to platform tools

What it does:

  • Exposes 7 tools that Lightspeed can call

  • Translates natural language intents to API calls

  • Returns structured responses for Lightspeed to format

MCP Tools Available:

1. get-cluster-health    β†’ Check namespace status, pods, models
2. list-pods             β†’ Query pods with filtering
3. analyze-anomalies     β†’ Call ML models for detection
4. trigger-remediation   β†’ Apply fixes via Coordination Engine
5. list-incidents        β†’ Query historical incidents
6. get-model-status      β†’ Check KServe InferenceService health
7. list-models           β†’ List available ML models

3. Coordination Engine (Go)

What it is: REST API service orchestrating remediation workflows

What it does:

  • Receives anomaly detection requests

  • Queries Prometheus for current metrics

  • Calls KServe ML models for predictions

  • Applies remediation actions to the cluster

  • Tracks incidents and resolution history

API Endpoints:

GET  /health           β†’ Health check
POST /api/v1/detect    β†’ Detect anomalies
POST /api/v1/remediate β†’ Apply remediation
GET  /api/v1/incidents β†’ List incidents
GET  /metrics          β†’ Prometheus metrics

4. KServe ML Models

What they are: Machine learning models served via KServe InferenceService

Models deployed:

Model Purpose Notebook

anomaly-detector

Detects unusual patterns in metrics

01-isolation-forest-implementation.ipynb

predictive-analytics

Forecasts future resource usage

05-predictive-analytics-kserve.ipynb

Key Point: These models are trained on YOUR cluster’s data. Predictions reflect your workload patterns, not generic benchmarks.

5. Jupyter Workbench

What it is: JupyterLab environment for ML development

What it contains:

  • Training notebooks for all ML models

  • Utility functions for Prometheus queries

  • Integration code for Coordination Engine

  • Persistent storage for models and data

Access: Via OpenShift AI console or direct URL

6. ArgoCD (GitOps)

What it is: GitOps deployment via Validated Patterns framework

What it does:

  • Manages all platform components declaratively

  • Syncs from Git repository automatically

  • Provides drift detection and rollback

Data Flow

Let’s trace what happens when you ask Lightspeed a question:

You: "Predict CPU at 3 PM"
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Lightspeed    β”‚ ◄─── Natural language understanding
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ MCP protocol
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   MCP Server    β”‚ ◄─── Routes to correct tool
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ REST API
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Coordination   β”‚
β”‚    Engine       β”‚ ◄─── Orchestrates the workflow
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β–Ό         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚Prometheβ”‚ β”‚KServe β”‚ ◄─── Get metrics, call ML model
β”‚  us   β”‚ β”‚ Model β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”˜
    β”‚         β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Lightspeed    β”‚ ◄─── Formats response
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
You: "CPU: 74.5% at 3 PM (85% confidence)"

Why Cluster-Specific Training Matters

The ML models are trained specifically for YOUR cluster:

  • Your 3 PM is different from our 3 PM: A retail cluster spikes at lunch, financial services at market open

  • Workload fingerprints are unique: Your apps' memory patterns, CPU bursts, scaling behaviors

  • Anomaly baselines are contextual: "Normal" for your cluster might be "abnormal" for another

Training Pipeline

Prometheus Metrics ──► Jupyter Notebook ──► Trained Model ──► KServe
     (your data)        (automated via       (your patterns)   (inference)
                         Tekton pipeline)

Models retrain automatically (weekly by default) to capture evolving patterns.

Workshop Environment

Your pre-deployed environment includes:

Component Service Namespace

OpenShift Lightspeed

AI assistant in console

openshift-lightspeed

MCP Server

mcp-server:8080

self-healing-platform

Coordination Engine

coordination-engine:8080

self-healing-platform

Anomaly Detector

anomaly-detector-predictor:8080

self-healing-platform

Predictive Analytics

predictive-analytics-predictor:8080

self-healing-platform

Jupyter Workbench

self-healing-workbench

self-healing-platform

Verify Your Environment

Let’s verify everything is running:

OpenShift Console
  1. Open the OpenShift Console

  2. Navigate to Workloads β†’ Pods

  3. Select namespace: self-healing-platform

  4. Verify these pods are Running:

    • coordination-engine-*

    • mcp-server-*

    • anomaly-detector-predictor-*

    • predictive-analytics-predictor-*

    • self-healing-workbench-0

CLI
oc get pods -n self-healing-platform

Expected output:

NAME                                         READY   STATUS    RESTARTS   AGE
coordination-engine-xxx                      1/1     Running   0          2h
mcp-server-xxx                               1/1     Running   0          2h
anomaly-detector-predictor-xxx               2/2     Running   0          2h
predictive-analytics-predictor-xxx           2/2     Running   0          2h
self-healing-workbench-0                     2/2     Running   0          2h
Lightspeed
  1. Click the Lightspeed icon (✨) in the console header

  2. Type: What’s the health of the self-healing-platform namespace?

  3. If you get a response with component status, everything is working!

Key Concepts Summary

Concept Description

Hybrid Approach

Deterministic + AI layers working together

Coordination Engine

Central orchestrator that routes issues to the right layer

MCP (Model Context Protocol)

Standard protocol connecting LLMs to external tools

KServe

Kubernetes-native model serving infrastructure

Cluster-Specific Training

ML models learn YOUR cluster’s patterns

Next Steps

Now that you understand the architecture, let’s get hands-on!

  • Train anomaly detection models manually

  • Explore different data sources (Prometheus vs synthetic)

  • Understand the automated training pipeline

  • Monitor training runs and validate models


Architecture Reference: For detailed architectural decisions, see the ADRs (Architectural Decision Records) in the platform repository.