Module 3: End-to-End Self-Healing with Lightspeed
Overview
Imagine talking to your Kubernetes cluster like you would talk to a colleague:
-
"Are there any failing pods?"
-
"What will my memory usage be at 3 PM?"
-
"Fix the broken pods automatically"
This is now possible with OpenShift Lightspeed connected to our self-healing platform. In this module, we’ll explore how to interact with an already-deployed platform using just natural language.
|
Prerequisite: Complete Module 2: Deploy MCP Server & Configure Lightspeed before starting this module. |
What’s already deployed:
-
✅ OpenShift Lightspeed (AI assistant) - Configured in Module 2
-
✅ MCP Server (Go service - connects Lightspeed to cluster tools) - Verified in Module 2
-
✅ Coordination Engine (Go service - orchestrates remediation)
-
✅ KServe ML Models (anomaly detection + capacity forecasting)
|
Architecture Note: The MCP Server and Coordination Engine are Go services for production performance. The notebooks are Python for ML/data science. You don’t need to write Go code to use the platform! |
Part 1: Your First Conversation - Health Check
|
Building on Module 2: In Module 2, you configured Lightspeed and tested basic queries using the MCP tools. The platform provides 12 MCP tools for cluster interaction: Core Health & Status:
ML-Powered Analysis:
Remediation & Incident Management:
Now we’ll use these tools for real troubleshooting workflows - all through natural language in the UI! |
|
About Lightspeed Responses in This Module: The example conversations below show sample interactions for demonstration purposes. Your actual Lightspeed responses will differ based on:
The key is successful interaction, not matching exact text. As long as Lightspeed understands your questions and provides relevant cluster information, you’re successfully using the self-healing platform! |
|
About Response Verbosity: The example conversations in this module show condensed responses for readability. Your actual Lightspeed responses will be more verbose and include:
This is normal! Lightspeed is designed to be helpful and educational, not just execute commands. The key is that the MCP tools work and provide accurate cluster data, even if the response format differs. |
Open OpenShift Lightspeed
-
Navigate to your OpenShift web console: https://console-openshift-console.apps.{guid}.example.com
-
Click the Lightspeed icon (✨) in the top-right corner
-
The chat interface opens
Ask About Cluster Health
You type:
What's the health of the {namespace} namespace?
Example response:
✅ Cluster Health Summary for {namespace}:
Healthy Components (4):
- coordination-engine: Running
- mcp-server: Running
- anomaly-detector-predictor: 2/2 Running
- predictive-analytics-predictor: 2/2 Running
ML Models Status:
- anomaly-detector: READY ✅
- predictive-analytics: READY ✅
Recent Activity:
- 0 incidents in the last hour
- All systems operational
Everything looks healthy!
🎯 What just happened? Lightspeed used the get-cluster-health MCP tool to query your cluster, check pod status, and verify ML models are ready.
Part 2: Deploy a Sample Application
Let’s deploy a simple Flask app that we’ll use throughout the rest of this module.
|
We use the Red Hat UBI Python image ( |
oc apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-flask-app
namespace: self-healing-platform
spec:
replicas: 2
selector:
matchLabels:
app: sample-flask
template:
metadata:
labels:
app: sample-flask
spec:
containers:
- name: web
image: registry.access.redhat.com/ubi9/python-311:latest
command: ["/bin/bash", "-c"]
args:
- |
pip install flask && python -c '
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello from OpenShift!"
app.run(host="0.0.0.0", port=8080)
'
ports:
- containerPort: 8080
resources:
limits:
memory: "256Mi"
cpu: "200m"
requests:
memory: "128Mi"
cpu: "100m"
EOF
Wait about 30 seconds for the pods to start, then use Lightspeed to verify:
You type:
Show me the sample-flask-app pods in the {namespace} namespace
Example response:
Found 2 pods for sample-flask-app:
1. sample-flask-app-xxx
Status: Running ✅
CPU: 45m / 200m (22%)
Memory: 82Mi / 128Mi (64%)
Restarts: 0
2. sample-flask-app-yyy
Status: Running ✅
CPU: 47m / 200m (23%)
Memory: 79Mi / 128Mi (62%)
Restarts: 0
Both pods are healthy!
Part 3: Predicting Future Resource Usage
This is where the ML models shine! Let’s ask about future resource usage.
Scenario 1: Predict Afternoon Usage
You type:
What will the CPU and memory usage be at 3 PM today?
Example response:
Confirmed prediction for 15:00 today (cluster-wide)
CPU: 58.0%
Memory: 58.6%
Model confidence: 85%
Current baseline (estimate):
- CPU: ~3.5%
- Memory: ~27.1%
Key notes:
- Big CPU jump predicted — verify this isn't a metrics-collection gap
- Important metrics to check in Prometheus:
- CPU: container_cpu_usage_seconds_total
- Memory: container_memory_working_set_bytes
Quick commands:
kubectl top nodes
kubectl top pods -A
Next steps — pick one:
1. I can run an anomaly scan for CPU/memory
2. I can calculate how many more pods you can safely schedule
Which would you like?
|
📊 Your predictions will differ based on your cluster’s learned patterns! The model trains on YOUR Prometheus metrics. |
🎯 What happened? Lightspeed:
-
Called the MCP
predict-resource-usagetool -
The Coordination Engine queried Prometheus for current 24h rolling mean
-
Extracted hour (15) and day of week from "3 PM today"
-
Called the
predictive-analyticsKServe model -
Formatted a human-friendly response with recommendations
Scenario 2: Scaling Impact Prediction
You type:
If I scale sample-flask-app to 5 replicas, what will happen to memory usage?
Example response:
Scaling Impact Analysis:
Current State:
- Replicas: 2
- Memory per pod: ~80Mi
- Total memory: ~160Mi
Projected State (5 replicas):
- Replicas: 5 (+3)
- Memory per pod: ~82Mi
- Total memory: ~410Mi
- Namespace memory: 92.3% ⚠️
⚠️ WARNING: Memory usage will approach critical threshold (95%)
💡 Recommendations:
1. Scale to 4 replicas instead (predicted: 86.7%)
2. Increase namespace memory limits by 20%
3. Monitor for 30 minutes after scaling
Would you like me to proceed with scaling?
Part 4: Breaking Things (On Purpose!)
Let’s deploy a broken app to see self-healing in action.
Deploy a Memory-Starved Application
This app has a memory leak - each request allocates 10MB. With only 96Mi limit, it will OOMKill after a few requests.
oc apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: broken-app
namespace: self-healing-platform
spec:
replicas: 2
selector:
matchLabels:
app: broken-app
template:
metadata:
labels:
app: broken-app
spec:
containers:
- name: web
image: registry.access.redhat.com/ubi9/python-311:latest
command: ["/bin/bash", "-c"]
args:
- |
pip install flask && python -c '
from flask import Flask
app = Flask(__name__)
data = []
@app.route("/")
def hello():
data.append("x" * 1024 * 1024 * 10) # Allocate 10MB per request
return f"Allocated {len(data)} chunks ({len(data) * 10}MB total)"
app.run(host="0.0.0.0", port=8080)
'
ports:
- containerPort: 8080
resources:
limits:
memory: "96Mi" # Low - will OOMKill after ~3-4 requests
cpu: "100m"
requests:
memory: "64Mi"
cpu: "50m"
EOF
Trigger the Memory Leak
The app won’t crash on its own - we need to trigger the memory leak by sending requests:
# Wait for pods to start
sleep 30
# Hit the endpoint repeatedly to trigger memory leak
for i in {1..20}; do
echo "Request $i:"
oc exec -n self-healing-platform deployment/broken-app -- curl -s http://localhost:8080/ 2>/dev/null || echo "Pod crashed!"
sleep 1
done
You should see output like:
Request 1: Allocated 1 chunks (10MB total)
Request 2: Allocated 2 chunks (20MB total)
...
Request 7: Allocated 7 chunks (70MB total)
Request 8: Pod crashed!
Request 9: Allocated 1 chunks (10MB total) ← Pod restarted!
Watch It Fail with Lightspeed
You type:
Are there any failing pods in the {namespace} namespace?
Example response:
⚠️ Found 2 pods with issues:
1. broken-app-xxx
Status: OOMKilled ❌
Restarts: 2
Reason: Container exceeded memory limit (96Mi)
Pattern: Crash loop - restarts when memory leak triggered
2. broken-app-yyy
Status: OOMKilled ❌
Restarts: 1
Reason: Container exceeded memory limit (96Mi)
Pattern: Memory leak causes OOMKill under load
Both pods are experiencing Out Of Memory errors. The app has a
memory leak - each request allocates 10MB that isn't released.
The 96Mi limit is exceeded after ~7 requests.
Would you like me to analyze these for anomalies?
Part 5: AI-Powered Anomaly Detection
You type:
Yes, analyze the broken-app pods for anomalies
Example response:
Summary
Scope: pods with label app=broken-app in namespace self-healing-platform
Metric: pod_restarts (last 6h)
Result: 1 minor anomaly detected OR No anomalies detected
anomaly_count: 0-1
max_score: 0.30 or 0
severity: info
recommendation: Monitor — investigate if this persists or grows
Interpretation
This is a low-confidence signal. The ML model may not flag recent OOMKilled
restarts as critical anomalies, especially if the cluster has low historical
data. You may see "no anomalies detected" even with failing pods.
Recommended next checks (run from a machine with oc and cluster-admin access):
List the broken-app pods and restart counts:
oc get pods -n self-healing-platform -l app=broken-app
Describe any pod with restarts or non-Ready status:
oc describe pod -n self-healing-platform <pod-name>
Fetch container logs (current and previous):
oc logs -n self-healing-platform <pod-name>
oc logs -n self-healing-platform <pod-name> --previous
Check namespace events:
oc get events -n self-healing-platform --sort-by='.lastTimestamp'
|
Known Limitation: The anomaly detection model may show low confidence or zero anomalies even when pods are actively OOMKilling. This is expected for newly deployed clusters with limited historical data. For now, you can still trigger remediation manually by asking Lightspeed to "Fix the broken-app pods" even if no anomalies are detected. We’re tracking improvements to anomaly detection in: planned |
🎯 What happened? The anomaly-detector ML model analyzed the pod metrics from Prometheus. The model requires historical baseline data and may not flag recent issues with high confidence.
Part 6: Self-Healing in Action
You type:
Yes, fix it automatically
Example response:
I've triggered automated remediations to recreate the broken-app pods.
Details
Incident: inc-0ad6dc07
Actions triggered:
- Delete pod broken-app-56c6f7fbc9-99c4t (workflow wf-88dd7a4c) — status: pending
- Delete pod broken-app-56c6f7fbc9-stw22 (workflow wf-d9e925b2) — status: pending
Estimated duration: ~5 minutes
What to expect
The pods were deleted so their owning controller will create new pods.
You may see short disruption while new pods start.
New pod names will differ from the old ones.
|
Current Limitation: The remediation currently only restarts pods - it does NOT update the underlying deployment resource limits. This means the OOMKilled issue will recur when the app receives more requests. This means:
Workaround: Manually update the deployment after Lightspeed remediation:
|
Verify the Fix
After 45 seconds:
You type:
Are the broken-app pods healthy now?
Example response:
✅ Pods Recreated!
1. broken-app-abc123
Status: Running ✅
CPU: 52m / 100m (52%)
Memory: 45Mi / 96Mi (47%)
Restarts: 0 (since remediation)
Uptime: 1m
2. broken-app-def456
Status: Running ✅
CPU: 49m / 100m (49%)
Memory: 43Mi / 96Mi (45%)
Restarts: 0 (since remediation)
Uptime: 1m
Status:
- Fresh pods are running with original limits (96Mi memory)
- No restarts yet (the memory leak hasn't been triggered)
- The underlying issue is NOT fixed - pods will OOMKill again under load
⚠️ Important: If you applied the manual patch to increase memory limits,
you'll see 256Mi limits instead of 96Mi. Otherwise, the pods will fail
again when the memory leak is triggered.
Part 7: Understanding Historical Patterns
You type:
Show me recent incidents and how they were resolved
Example response:
Incident History (Last 24 Hours):
Total: 1 incident
Average resolution: 45 seconds
📋 Incident Details:
inc-20260112-001
- Time: 15:44:38 UTC
- Type: OOMKilled (Memory Leak)
- Pods: broken-app-xxx, broken-app-yyy
- Root cause: Application memory leak (~10MB/request)
- Trigger: Memory exceeded 96Mi limit after ~7 requests
Remediation:
- Action: Updated deployment resources (temporary fix)
- Applied: YES ✅
- Resolution time: 45s
- Changes: Memory 96Mi→256Mi, CPU 100m→200m
- Outcome: 100% success (pods stable)
- Follow-up: Code review recommended for memory leak
Pattern Recognition:
ML detected this as "memory leak" pattern, not just
"insufficient resources". Recommended both immediate
fix (increase limits) and permanent fix (code review).
Part 8: More Conversations to Try
|
These are example conversation flows showing possible interactions. Your actual conversations will vary based on your LLM model and cluster state. Use these as inspiration for questions to ask! |
Troubleshooting
Investigate slow performance:
You: The sample-flask-app feels slow
Lightspeed:
Investigating...
Pod Analysis:
- sample-flask-app-xxx: CPU 185m/200m (92%) ⚠️
- sample-flask-app-yyy: CPU 178m/200m (89%) ⚠️
Root Cause: CPU saturation (throttling)
Recommendation: Scale to 3 replicas or increase CPU limits
Should I scale now?
Proactive monitoring:
You: Will any pods have issues in the next hour?
Lightspeed:
🔮 Predictions for next 60 minutes:
Low Risk: 5 pods stable
⚠️ Medium Risk:
- sample-flask-app-xxx: CPU will reach 99% in ~45 min
- sample-flask-app-yyy: CPU will reach 97% in ~50 min
Recommendation: Scale to 3 replicas NOW to prevent issues
Should I apply preemptively?
Capacity Questions
You: What will memory usage be tomorrow at 9 AM?
Lightspeed:
Predicted memory at 09:00 tomorrow:
- 71.3% (current trend: -3.2% from today)
- Lower usage expected (morning hours)
- Confidence: 88%
You: Can I run 10 more medium-sized pods?
Lightspeed:
Capacity check for 10 medium pods (128Mi/200m):
Required resources:
- Memory: 1280Mi
- CPU: 2000m
Current availability:
- Memory: Only ~600Mi available
- CPU: ~800m available
❌ Cannot accommodate 10 pods
✅ Can safely run 4 pods
⚠️ 5-6 pods possible but approaching limits
Model Health
You: Are the ML models working properly?
Lightspeed:
ML Model Status:
1. anomaly-detector
Status: READY ✅
Endpoint: http://anomaly-detector-predictor:8080
Recent predictions: 47 (last hour)
Average latency: 23ms
Success rate: 100%
2. predictive-analytics
Status: READY ✅
Endpoint: http://predictive-analytics-predictor:8080
Recent predictions: 32 (last hour)
Average latency: 18ms
Success rate: 100%
All models healthy and performing well!
Part 9: Quick Reference
Health & Status
"What's the cluster health?"
"Are there any failing pods?"
"Show me pods in {namespace}"
"What's using the most memory?"
Predictions
"What will CPU be at 3 PM?"
"Predict memory usage tomorrow at 9 AM"
"What happens if I scale to 5 replicas?"
"How many more pods can I run?"
"Will I have capacity issues this weekend?"
Part 10: Cleanup
Before moving to the next module, let’s clean up the test deployments:
oc delete deployment sample-flask-app broken-app -n self-healing-platform
Troubleshooting
Issue: "No pods found with label app=broken-app"
Cause: The deployment was deployed with a different label, or you’re querying the wrong namespace.
Fix: Explicitly specify the correct label and namespace in your query:
Show me pods with app=broken-app in {namespace}
Or verify the deployment exists and check its labels:
oc get deployment broken-app -n self-healing-platform -o yaml | grep -A 3 labels
Issue: "No anomalies detected" even with OOMKilled pods
Cause: The ML model requires historical baseline data and may not flag recent issues with high confidence. Newly deployed clusters or apps may show low or zero anomaly scores.
What this means: This is expected behavior for the current platform version. The anomaly detection model learns from historical Prometheus metrics and needs time to establish baselines.
Workaround: You can still trigger remediation by directly asking:
Fix the broken-app pods in {namespace}
Status: We’re working on improving anomaly detection for cold-start scenarios and will update the workshop when improvements are deployed.
Issue: Pods still OOMKill after remediation
Cause: Current remediation only restarts pods, doesn’t update resource limits automatically.
Fix: Manually patch the deployment after Lightspeed remediation:
oc patch deployment broken-app -n self-healing-platform --type='json' -p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "256Mi"}
]'
Verify the change:
oc get deployment broken-app -n self-healing-platform -o jsonpath='{.spec.template.spec.containers[0].resources}'
Status: Intelligent resource limit adjustment is being tracked as a feature enhancement to the coordination-engine.
Issue: Lightspeed doesn’t respond or shows errors
Cause: Multiple possible issues:
-
MCP Server not running: Check if the MCP server pod is healthy
-
OLSConfig not created: Verify you completed Module 2
-
LLM API key invalid: Check secret in
openshift-lightspeednamespace -
Network issues: Verify Lightspeed can reach the MCP server service
Debug steps:
# Check MCP server status
oc get pods -n self-healing-platform -l app=mcp-server
# Check Lightspeed configuration
oc get olsconfig -n openshift-lightspeed
# Check Lightspeed logs
oc logs -n openshift-lightspeed deployment/lightspeed-app-server
Issue: "Permission denied" or "Unauthorized" errors
Cause: The MCP server ServiceAccount may not have sufficient RBAC permissions.
Fix: Verify the cluster-admin ClusterRoleBinding exists:
oc get clusterrolebinding mcp-server-cluster-admin
If missing, recreate it:
oc create clusterrolebinding mcp-server-cluster-admin \
--clusterrole=cluster-admin \
--serviceaccount=self-healing-platform:mcp-server
Summary
You’ve now explored:
-
✅ Chatting with your cluster using natural language
-
✅ Deploying sample workloads
-
✅ Using ML models to predict resource usage
-
✅ Detecting anomalies automatically
-
✅ Triggering self-healing remediation
-
✅ Understanding historical patterns
The Power of Natural Language Operations
Traditional way:
kubectl get pods -n {namespace} | grep -v Running
kubectl describe pod broken-app-xxx
kubectl logs broken-app-xxx
kubectl edit deployment broken-app
# ... manually update limits ...
kubectl rollout status deployment/broken-app
With Lightspeed:
"Fix the broken pods"
Key Capabilities Demonstrated
-
✅ Natural language cluster interaction via Lightspeed UI
-
✅ MCP tools provide accurate pod status and cluster data
-
✅ ML models (anomaly detection, predictive analytics) are accessible
-
✅ Automated pod restart remediation via coordination-engine
-
⚠️ Anomaly detection works but may show low confidence for new clusters
-
⚠️ Resource limit updates currently require manual intervention
Current Platform Status
What works today:
-
Pod health queries and status checks
-
Resource usage predictions (when historical data available)
-
Automated incident tracking
-
Pod restart remediation
-
Historical incident analysis
What’s being improved:
-
Anomaly detection confidence for newly deployed apps
-
Intelligent resource limit adjustment (currently manual)
-
Scaling recommendations based on ML predictions
This is an active platform - features improve as the ML models learn from your cluster’s metrics!
Next Steps
Explore more advanced topics:
-
Module 4: Extra Credit - Advanced ML & Custom Models - Train custom ML models and explore advanced use cases
-
Module 5: Notebook Catalog & Use Cases - Explore Jupyter notebooks and additional examples
Happy chatting with your cluster! 🚀