Module 0: Introduction & Architecture
Overview
Before diving into hands-on exercises, let’s understand how the Self-Healing Platform works. This module covers the architecture, key components, and the hybrid approach that combines deterministic automation with AI-driven analysis.
What is the Self-Healing Platform?
The OpenShift AI Ops Self-Healing Platform is a production-ready AIOps solution that:
-
π€ Hybrid Approach: Combines deterministic automation (rule-based) with AI-driven analysis (ML models)
-
π§ Self-Healing: Automatically detects and remediates common cluster issues
-
π ML-Powered: Uses Isolation Forest, LSTM models for anomaly detection
-
π OpenShift Native: Built on Red Hat OpenShift AI, KServe, Tekton, ArgoCD
-
π¬ Natural Language Interface: Chat with your cluster via OpenShift Lightspeed
The Hybrid Architecture
The platform uses a hybrid deterministic-AI self-healing approach:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Self-Healing Platform β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Coordination Engine (Go REST API) β
β ββ Conflict Resolution β
β ββ Priority Management β
β ββ Action Orchestration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Deterministic Layer β AI-Driven Layer β
β ββ Machine Config β ββ Anomaly Detection β
β β Operator β β (Isolation Forest, LSTM) β
β ββ Known Remediation β ββ Root Cause Analysis β
β β Procedures β ββ Predictive Analytics β
β ββ Rule-Based Actions β ββ Adaptive Responses β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Shared Observability Layer (Prometheus, AlertManager) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why Hybrid?
| Approach | Strengths | Weaknesses |
|---|---|---|
Deterministic Only |
Fast, predictable, well-tested |
Can’t handle novel issues |
AI Only |
Adapts to new patterns |
May be slow, unpredictable |
Hybrid (Our Approach) |
Best of both worlds |
More complex to build |
The Coordination Engine decides which layer handles each issue:
-
Known issues β Deterministic layer (fast, reliable)
-
Novel/complex issues β AI layer (adaptive, intelligent)
-
Conflicts β Coordination Engine resolves priority
Platform Components
1. OpenShift Lightspeed
What it is: AI assistant integrated into the OpenShift console
What it does:
-
Answers questions about your cluster in natural language
-
Triggers ML-powered analysis and predictions
-
Initiates automated remediation with your approval
Access: Click the sparkle icon β¨ in the OpenShift console header
2. MCP Server (Go)
What it is: Model Context Protocol server connecting Lightspeed to platform tools
What it does:
-
Exposes 7 tools that Lightspeed can call
-
Translates natural language intents to API calls
-
Returns structured responses for Lightspeed to format
MCP Tools Available:
1. get-cluster-health β Check namespace status, pods, models
2. list-pods β Query pods with filtering
3. analyze-anomalies β Call ML models for detection
4. trigger-remediation β Apply fixes via Coordination Engine
5. list-incidents β Query historical incidents
6. get-model-status β Check KServe InferenceService health
7. list-models β List available ML models
3. Coordination Engine (Go)
What it is: REST API service orchestrating remediation workflows
What it does:
-
Receives anomaly detection requests
-
Queries Prometheus for current metrics
-
Calls KServe ML models for predictions
-
Applies remediation actions to the cluster
-
Tracks incidents and resolution history
API Endpoints:
GET /health β Health check
POST /api/v1/detect β Detect anomalies
POST /api/v1/remediate β Apply remediation
GET /api/v1/incidents β List incidents
GET /metrics β Prometheus metrics
4. KServe ML Models
What they are: Machine learning models served via KServe InferenceService
Models deployed:
| Model | Purpose | Notebook |
|---|---|---|
|
Detects unusual patterns in metrics |
|
|
Forecasts future resource usage |
|
Key Point: These models are trained on YOUR cluster’s data. Predictions reflect your workload patterns, not generic benchmarks.
Data Flow
Let’s trace what happens when you ask Lightspeed a question:
You: "Predict CPU at 3 PM"
β
βΌ
βββββββββββββββββββ
β Lightspeed β ββββ Natural language understanding
ββββββββββ¬βββββββββ
β MCP protocol
βΌ
βββββββββββββββββββ
β MCP Server β ββββ Routes to correct tool
ββββββββββ¬βββββββββ
β REST API
βΌ
βββββββββββββββββββ
β Coordination β
β Engine β ββββ Orchestrates the workflow
ββββββββββ¬βββββββββ
β
ββββββ΄βββββ
βΌ βΌ
βββββββββ βββββββββ
βPrometheβ βKServe β ββββ Get metrics, call ML model
β us β β Model β
βββββ¬ββββ βββββ¬ββββ
β β
ββββββ¬βββββ
β
βΌ
βββββββββββββββββββ
β Lightspeed β ββββ Formats response
ββββββββββ¬βββββββββ
β
βΌ
You: "CPU: 74.5% at 3 PM (85% confidence)"
Why Cluster-Specific Training Matters
The ML models are trained specifically for YOUR cluster:
-
Your 3 PM is different from our 3 PM: A retail cluster spikes at lunch, financial services at market open
-
Workload fingerprints are unique: Your apps' memory patterns, CPU bursts, scaling behaviors
-
Anomaly baselines are contextual: "Normal" for your cluster might be "abnormal" for another
Workshop Environment
Your pre-deployed environment includes:
| Component | Service | Namespace |
|---|---|---|
OpenShift Lightspeed |
AI assistant in console |
|
MCP Server |
|
|
Coordination Engine |
|
|
Anomaly Detector |
|
|
Predictive Analytics |
|
|
Jupyter Workbench |
|
|
Verify Your Environment
Let’s verify everything is running:
- OpenShift Console
-
-
Open the OpenShift Console
-
Navigate to Workloads β Pods
-
Select namespace:
self-healing-platform -
Verify these pods are Running:
-
coordination-engine-* -
mcp-server-* -
anomaly-detector-predictor-* -
predictive-analytics-predictor-* -
self-healing-workbench-0
-
-
- CLI
-
oc get pods -n self-healing-platformExpected output:
NAME READY STATUS RESTARTS AGE coordination-engine-xxx 1/1 Running 0 2h mcp-server-xxx 1/1 Running 0 2h anomaly-detector-predictor-xxx 2/2 Running 0 2h predictive-analytics-predictor-xxx 2/2 Running 0 2h self-healing-workbench-0 2/2 Running 0 2h - Lightspeed
-
-
Click the Lightspeed icon (β¨) in the console header
-
Type:
What’s the health of the self-healing-platform namespace? -
If you get a response with component status, everything is working!
-
Key Concepts Summary
| Concept | Description |
|---|---|
Hybrid Approach |
Deterministic + AI layers working together |
Coordination Engine |
Central orchestrator that routes issues to the right layer |
MCP (Model Context Protocol) |
Standard protocol connecting LLMs to external tools |
KServe |
Kubernetes-native model serving infrastructure |
Cluster-Specific Training |
ML models learn YOUR cluster’s patterns |
Next Steps
Now that you understand the architecture, let’s get hands-on!
In Module 1: ML Model Training with Tekton, you’ll:
-
Train anomaly detection models manually
-
Explore different data sources (Prometheus vs synthetic)
-
Understand the automated training pipeline
-
Monitor training runs and validate models
|
Architecture Reference: For detailed architectural decisions, see the ADRs (Architectural Decision Records) in the platform repository. |