Module 3: End-to-End Self-Healing with Lightspeed

Overview

Imagine talking to your Kubernetes cluster like you would talk to a colleague:

  • "Are there any failing pods?"

  • "What will my memory usage be at 3 PM?"

  • "Fix the broken pods automatically"

This is now possible with OpenShift Lightspeed connected to our self-healing platform. In this module, we’ll explore how to interact with an already-deployed platform using just natural language.

Prerequisite: Complete Module 2: Deploy MCP Server & Configure Lightspeed before starting this module.

What’s already deployed:

  • ✅ OpenShift Lightspeed (AI assistant) - Configured in Module 2

  • ✅ MCP Server (Go service - connects Lightspeed to cluster tools) - Verified in Module 2

  • ✅ Coordination Engine (Go service - orchestrates remediation)

  • ✅ KServe ML Models (anomaly detection + capacity forecasting)

Architecture Note: The MCP Server and Coordination Engine are Go services for production performance. The notebooks are Python for ML/data science. You don’t need to write Go code to use the platform!

Part 1: Your First Conversation - Health Check

Building on Module 2: In Module 2, you configured Lightspeed and tested basic queries using the MCP tools. The platform provides 12 MCP tools for cluster interaction:

Core Health & Status:

  • get-cluster-health - Check namespace pod/model health

  • list-pods - Query pods with filtering

  • get-model-status - Check KServe model health

  • list-models - List ML model catalog

ML-Powered Analysis:

  • analyze-anomalies - Run ML anomaly detection

  • predict-resource-usage - Predict future CPU/memory usage

  • analyze-scaling-impact - Analyze deployment scaling impact

  • get-remediation-recommendations - Get ML-powered remediation suggestions

Remediation & Incident Management:

  • trigger-remediation - Apply automated fixes

  • list-incidents - Query historical incidents

  • create-incident - Manually create incidents for tracking

  • calculate-pod-capacity - Calculate remaining pod capacity

Now we’ll use these tools for real troubleshooting workflows - all through natural language in the UI!

About Lightspeed Responses in This Module:

The example conversations below show sample interactions for demonstration purposes. Your actual Lightspeed responses will differ based on:

  • LLM Model: GPT-4o, vLLM (Llama 3.2), Gemini, etc. generate different response styles

  • Your Cluster State: Actual resource usage, pod counts, and issues you encounter

  • Deployed Apps: What you deploy in Part 2 affects predictions and analysis results

  • Model Training Data: ML models learn from YOUR cluster’s Prometheus metrics

The key is successful interaction, not matching exact text. As long as Lightspeed understands your questions and provides relevant cluster information, you’re successfully using the self-healing platform!

About Response Verbosity:

The example conversations in this module show condensed responses for readability. Your actual Lightspeed responses will be more verbose and include:

  • Detailed next-step commands (oc describe, oc logs, etc.)

  • Multiple remediation options to choose from

  • Follow-up questions asking for clarification

  • Links to related OpenShift documentation

  • Technical explanations of pod states and controller behavior

This is normal! Lightspeed is designed to be helpful and educational, not just execute commands. The key is that the MCP tools work and provide accurate cluster data, even if the response format differs.

Open OpenShift Lightspeed

  1. Navigate to your OpenShift web console: https://console-openshift-console.apps.{guid}.example.com

  2. Click the Lightspeed icon (✨) in the top-right corner

  3. The chat interface opens

Ask About Cluster Health

You type:

What's the health of the {namespace} namespace?

Example response:

✅ Cluster Health Summary for {namespace}:

Healthy Components (4):
- coordination-engine: Running
- mcp-server: Running
- anomaly-detector-predictor: 2/2 Running
- predictive-analytics-predictor: 2/2 Running

ML Models Status:
- anomaly-detector: READY ✅
- predictive-analytics: READY ✅

Recent Activity:
- 0 incidents in the last hour
- All systems operational

Everything looks healthy!

🎯 What just happened? Lightspeed used the get-cluster-health MCP tool to query your cluster, check pod status, and verify ML models are ready.

Part 2: Deploy a Sample Application

Let’s deploy a simple Flask app that we’ll use throughout the rest of this module.

We use the Red Hat UBI Python image (registry.access.redhat.com/ubi9/python-311) because it’s designed for OpenShift’s security model, which runs containers as a random non-root UID.

oc apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-flask-app
  namespace: self-healing-platform
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sample-flask
  template:
    metadata:
      labels:
        app: sample-flask
    spec:
      containers:
      - name: web
        image: registry.access.redhat.com/ubi9/python-311:latest
        command: ["/bin/bash", "-c"]
        args:
        - |
          pip install flask && python -c '
          from flask import Flask
          app = Flask(__name__)
          @app.route("/")
          def hello():
              return "Hello from OpenShift!"
          app.run(host="0.0.0.0", port=8080)
          '
        ports:
        - containerPort: 8080
        resources:
          limits:
            memory: "256Mi"
            cpu: "200m"
          requests:
            memory: "128Mi"
            cpu: "100m"
EOF

Wait about 30 seconds for the pods to start, then use Lightspeed to verify:

You type:

Show me the sample-flask-app pods in the {namespace} namespace

Example response:

Found 2 pods for sample-flask-app:

1. sample-flask-app-xxx
   Status: Running ✅
   CPU: 45m / 200m (22%)
   Memory: 82Mi / 128Mi (64%)
   Restarts: 0

2. sample-flask-app-yyy
   Status: Running ✅
   CPU: 47m / 200m (23%)
   Memory: 79Mi / 128Mi (62%)
   Restarts: 0

Both pods are healthy!

Part 3: Predicting Future Resource Usage

This is where the ML models shine! Let’s ask about future resource usage.

Scenario 1: Predict Afternoon Usage

You type:

What will the CPU and memory usage be at 3 PM today?

Example response:

Confirmed prediction for 15:00 today (cluster-wide)

CPU: 58.0%
Memory: 58.6%
Model confidence: 85%

Current baseline (estimate):
- CPU: ~3.5%
- Memory: ~27.1%

Key notes:
- Big CPU jump predicted — verify this isn't a metrics-collection gap
- Important metrics to check in Prometheus:
  - CPU: container_cpu_usage_seconds_total
  - Memory: container_memory_working_set_bytes

Quick commands:
  kubectl top nodes
  kubectl top pods -A

Next steps — pick one:
1. I can run an anomaly scan for CPU/memory
2. I can calculate how many more pods you can safely schedule

Which would you like?

📊 Your predictions will differ based on your cluster’s learned patterns! The model trains on YOUR Prometheus metrics.

🎯 What happened? Lightspeed:

  1. Called the MCP predict-resource-usage tool

  2. The Coordination Engine queried Prometheus for current 24h rolling mean

  3. Extracted hour (15) and day of week from "3 PM today"

  4. Called the predictive-analytics KServe model

  5. Formatted a human-friendly response with recommendations

Scenario 2: Scaling Impact Prediction

You type:

If I scale sample-flask-app to 5 replicas, what will happen to memory usage?

Example response:

Scaling Impact Analysis:

Current State:
- Replicas: 2
- Memory per pod: ~80Mi
- Total memory: ~160Mi

Projected State (5 replicas):
- Replicas: 5 (+3)
- Memory per pod: ~82Mi
- Total memory: ~410Mi
- Namespace memory: 92.3% ⚠️

⚠️ WARNING: Memory usage will approach critical threshold (95%)

💡 Recommendations:
1. Scale to 4 replicas instead (predicted: 86.7%)
2. Increase namespace memory limits by 20%
3. Monitor for 30 minutes after scaling

Would you like me to proceed with scaling?

Scenario 3: Capacity Planning

You type:

How many more pods can I run before hitting resource limits?

Part 4: Breaking Things (On Purpose!)

Let’s deploy a broken app to see self-healing in action.

Deploy a Memory-Starved Application

This app has a memory leak - each request allocates 10MB. With only 96Mi limit, it will OOMKill after a few requests.

oc apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: broken-app
  namespace: self-healing-platform
spec:
  replicas: 2
  selector:
    matchLabels:
      app: broken-app
  template:
    metadata:
      labels:
        app: broken-app
    spec:
      containers:
      - name: web
        image: registry.access.redhat.com/ubi9/python-311:latest
        command: ["/bin/bash", "-c"]
        args:
        - |
          pip install flask && python -c '
          from flask import Flask
          app = Flask(__name__)
          data = []
          @app.route("/")
          def hello():
              data.append("x" * 1024 * 1024 * 10)  # Allocate 10MB per request
              return f"Allocated {len(data)} chunks ({len(data) * 10}MB total)"
          app.run(host="0.0.0.0", port=8080)
          '
        ports:
        - containerPort: 8080
        resources:
          limits:
            memory: "96Mi"    # Low - will OOMKill after ~3-4 requests
            cpu: "100m"
          requests:
            memory: "64Mi"
            cpu: "50m"
EOF

Trigger the Memory Leak

The app won’t crash on its own - we need to trigger the memory leak by sending requests:

# Wait for pods to start
sleep 30

# Hit the endpoint repeatedly to trigger memory leak
for i in {1..20}; do
  echo "Request $i:"
  oc exec -n self-healing-platform deployment/broken-app -- curl -s http://localhost:8080/ 2>/dev/null || echo "Pod crashed!"
  sleep 1
done

You should see output like:

Request 1: Allocated 1 chunks (10MB total)
Request 2: Allocated 2 chunks (20MB total)
...
Request 7: Allocated 7 chunks (70MB total)
Request 8: Pod crashed!
Request 9: Allocated 1 chunks (10MB total)  ← Pod restarted!

Watch It Fail with Lightspeed

You type:

Are there any failing pods in the {namespace} namespace?

Example response:

⚠️ Found 2 pods with issues:

1. broken-app-xxx
   Status: OOMKilled ❌
   Restarts: 2
   Reason: Container exceeded memory limit (96Mi)
   Pattern: Crash loop - restarts when memory leak triggered

2. broken-app-yyy
   Status: OOMKilled ❌
   Restarts: 1
   Reason: Container exceeded memory limit (96Mi)
   Pattern: Memory leak causes OOMKill under load

Both pods are experiencing Out Of Memory errors. The app has a
memory leak - each request allocates 10MB that isn't released.
The 96Mi limit is exceeded after ~7 requests.

Would you like me to analyze these for anomalies?

Part 5: AI-Powered Anomaly Detection

You type:

Yes, analyze the broken-app pods for anomalies

Example response:

Summary

Scope: pods with label app=broken-app in namespace self-healing-platform
Metric: pod_restarts (last 6h)
Result: 1 minor anomaly detected OR No anomalies detected
anomaly_count: 0-1
max_score: 0.30 or 0
severity: info
recommendation: Monitor — investigate if this persists or grows

Interpretation

This is a low-confidence signal. The ML model may not flag recent OOMKilled
restarts as critical anomalies, especially if the cluster has low historical
data. You may see "no anomalies detected" even with failing pods.

Recommended next checks (run from a machine with oc and cluster-admin access):

List the broken-app pods and restart counts:
  oc get pods -n self-healing-platform -l app=broken-app

Describe any pod with restarts or non-Ready status:
  oc describe pod -n self-healing-platform <pod-name>

Fetch container logs (current and previous):
  oc logs -n self-healing-platform <pod-name>
  oc logs -n self-healing-platform <pod-name> --previous

Check namespace events:
  oc get events -n self-healing-platform --sort-by='.lastTimestamp'

Known Limitation: The anomaly detection model may show low confidence or zero anomalies even when pods are actively OOMKilling. This is expected for newly deployed clusters with limited historical data.

For now, you can still trigger remediation manually by asking Lightspeed to "Fix the broken-app pods" even if no anomalies are detected.

We’re tracking improvements to anomaly detection in: planned

🎯 What happened? The anomaly-detector ML model analyzed the pod metrics from Prometheus. The model requires historical baseline data and may not flag recent issues with high confidence.

Part 6: Self-Healing in Action

You type:

Yes, fix it automatically

Example response:

I've triggered automated remediations to recreate the broken-app pods.

Details

Incident: inc-0ad6dc07
Actions triggered:
- Delete pod broken-app-56c6f7fbc9-99c4t (workflow wf-88dd7a4c) — status: pending
- Delete pod broken-app-56c6f7fbc9-stw22 (workflow wf-d9e925b2) — status: pending
Estimated duration: ~5 minutes

What to expect

The pods were deleted so their owning controller will create new pods.
You may see short disruption while new pods start.
New pod names will differ from the old ones.

Current Limitation: The remediation currently only restarts pods - it does NOT update the underlying deployment resource limits. This means the OOMKilled issue will recur when the app receives more requests.

This means:

  • Pods are deleted and recreated with the SAME 96Mi memory limit

  • OOMKilled errors will recur when the app is used

  • True self-healing (resource limit adjustment) is being tracked

Workaround: Manually update the deployment after Lightspeed remediation:

oc patch deployment broken-app -n self-healing-platform --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "256Mi"}
]'

Verify the Fix

After 45 seconds:

You type:

Are the broken-app pods healthy now?

Example response:

✅ Pods Recreated!

1. broken-app-abc123
   Status: Running ✅
   CPU: 52m / 100m (52%)
   Memory: 45Mi / 96Mi (47%)
   Restarts: 0 (since remediation)
   Uptime: 1m

2. broken-app-def456
   Status: Running ✅
   CPU: 49m / 100m (49%)
   Memory: 43Mi / 96Mi (45%)
   Restarts: 0 (since remediation)
   Uptime: 1m

Status:
- Fresh pods are running with original limits (96Mi memory)
- No restarts yet (the memory leak hasn't been triggered)
- The underlying issue is NOT fixed - pods will OOMKill again under load

⚠️ Important: If you applied the manual patch to increase memory limits,
you'll see 256Mi limits instead of 96Mi. Otherwise, the pods will fail
again when the memory leak is triggered.

Part 7: Understanding Historical Patterns

You type:

Show me recent incidents and how they were resolved

Example response:

Incident History (Last 24 Hours):

Total: 1 incident
Average resolution: 45 seconds

📋 Incident Details:

inc-20260112-001
- Time: 15:44:38 UTC
- Type: OOMKilled (Memory Leak)
- Pods: broken-app-xxx, broken-app-yyy
- Root cause: Application memory leak (~10MB/request)
- Trigger: Memory exceeded 96Mi limit after ~7 requests

Remediation:
- Action: Updated deployment resources (temporary fix)
- Applied: YES ✅
- Resolution time: 45s
- Changes: Memory 96Mi→256Mi, CPU 100m→200m
- Outcome: 100% success (pods stable)
- Follow-up: Code review recommended for memory leak

Pattern Recognition:
ML detected this as "memory leak" pattern, not just
"insufficient resources". Recommended both immediate
fix (increase limits) and permanent fix (code review).

Part 8: More Conversations to Try

These are example conversation flows showing possible interactions. Your actual conversations will vary based on your LLM model and cluster state. Use these as inspiration for questions to ask!

Troubleshooting

Investigate slow performance:

You: The sample-flask-app feels slow

Lightspeed:
Investigating...

Pod Analysis:
- sample-flask-app-xxx: CPU 185m/200m (92%) ⚠️
- sample-flask-app-yyy: CPU 178m/200m (89%) ⚠️

Root Cause: CPU saturation (throttling)
Recommendation: Scale to 3 replicas or increase CPU limits

Should I scale now?

Proactive monitoring:

You: Will any pods have issues in the next hour?

Lightspeed:
🔮 Predictions for next 60 minutes:

Low Risk: 5 pods stable

⚠️ Medium Risk:
- sample-flask-app-xxx: CPU will reach 99% in ~45 min
- sample-flask-app-yyy: CPU will reach 97% in ~50 min

Recommendation: Scale to 3 replicas NOW to prevent issues

Should I apply preemptively?

Capacity Questions

You: What will memory usage be tomorrow at 9 AM?

Lightspeed:
Predicted memory at 09:00 tomorrow:
- 71.3% (current trend: -3.2% from today)
- Lower usage expected (morning hours)
- Confidence: 88%
You: Can I run 10 more medium-sized pods?

Lightspeed:
Capacity check for 10 medium pods (128Mi/200m):

Required resources:
- Memory: 1280Mi
- CPU: 2000m

Current availability:
- Memory: Only ~600Mi available
- CPU: ~800m available

❌ Cannot accommodate 10 pods
✅ Can safely run 4 pods
⚠️ 5-6 pods possible but approaching limits

Model Health

You: Are the ML models working properly?

Lightspeed:
ML Model Status:

1. anomaly-detector
   Status: READY ✅
   Endpoint: http://anomaly-detector-predictor:8080
   Recent predictions: 47 (last hour)
   Average latency: 23ms
   Success rate: 100%

2. predictive-analytics
   Status: READY ✅
   Endpoint: http://predictive-analytics-predictor:8080
   Recent predictions: 32 (last hour)
   Average latency: 18ms
   Success rate: 100%

All models healthy and performing well!

Part 9: Quick Reference

Health & Status

"What's the cluster health?"
"Are there any failing pods?"
"Show me pods in {namespace}"
"What's using the most memory?"

Predictions

"What will CPU be at 3 PM?"
"Predict memory usage tomorrow at 9 AM"
"What happens if I scale to 5 replicas?"
"How many more pods can I run?"
"Will I have capacity issues this weekend?"

Troubleshooting

"Why is pod X failing?"
"Analyze deployment Y for anomalies"
"What caused the OOMKilled errors?"
"Show me pods with high restarts"
"Will any pods fail in the next hour?"

Actions

"Fix the failing pods"
"Scale deployment X to 5 replicas"
"Increase memory for pod Y"

ML Models

"Are ML models healthy?"
"What models are available?"
"Check anomaly detector status"

History

"Show recent incidents"
"What's the average resolution time?"
"How many incidents happened today?"

Part 10: Cleanup

Before moving to the next module, let’s clean up the test deployments:

oc delete deployment sample-flask-app broken-app -n self-healing-platform

Troubleshooting

Issue: "No pods found with label app=broken-app"

Cause: The deployment was deployed with a different label, or you’re querying the wrong namespace.

Fix: Explicitly specify the correct label and namespace in your query:

Show me pods with app=broken-app in {namespace}

Or verify the deployment exists and check its labels:

oc get deployment broken-app -n self-healing-platform -o yaml | grep -A 3 labels

Issue: "No anomalies detected" even with OOMKilled pods

Cause: The ML model requires historical baseline data and may not flag recent issues with high confidence. Newly deployed clusters or apps may show low or zero anomaly scores.

What this means: This is expected behavior for the current platform version. The anomaly detection model learns from historical Prometheus metrics and needs time to establish baselines.

Workaround: You can still trigger remediation by directly asking:

Fix the broken-app pods in {namespace}

Status: We’re working on improving anomaly detection for cold-start scenarios and will update the workshop when improvements are deployed.

Issue: Pods still OOMKill after remediation

Cause: Current remediation only restarts pods, doesn’t update resource limits automatically.

Fix: Manually patch the deployment after Lightspeed remediation:

oc patch deployment broken-app -n self-healing-platform --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "256Mi"}
]'

Verify the change:

oc get deployment broken-app -n self-healing-platform -o jsonpath='{.spec.template.spec.containers[0].resources}'

Status: Intelligent resource limit adjustment is being tracked as a feature enhancement to the coordination-engine.

Issue: Lightspeed doesn’t respond or shows errors

Cause: Multiple possible issues:

  1. MCP Server not running: Check if the MCP server pod is healthy

  2. OLSConfig not created: Verify you completed Module 2

  3. LLM API key invalid: Check secret in openshift-lightspeed namespace

  4. Network issues: Verify Lightspeed can reach the MCP server service

Debug steps:

# Check MCP server status
oc get pods -n self-healing-platform -l app=mcp-server

# Check Lightspeed configuration
oc get olsconfig -n openshift-lightspeed

# Check Lightspeed logs
oc logs -n openshift-lightspeed deployment/lightspeed-app-server

Issue: "Permission denied" or "Unauthorized" errors

Cause: The MCP server ServiceAccount may not have sufficient RBAC permissions.

Fix: Verify the cluster-admin ClusterRoleBinding exists:

oc get clusterrolebinding mcp-server-cluster-admin

If missing, recreate it:

oc create clusterrolebinding mcp-server-cluster-admin \
  --clusterrole=cluster-admin \
  --serviceaccount=self-healing-platform:mcp-server

Summary

You’ve now explored:

  • ✅ Chatting with your cluster using natural language

  • ✅ Deploying sample workloads

  • ✅ Using ML models to predict resource usage

  • ✅ Detecting anomalies automatically

  • ✅ Triggering self-healing remediation

  • ✅ Understanding historical patterns

The Power of Natural Language Operations

Traditional way:

kubectl get pods -n {namespace} | grep -v Running
kubectl describe pod broken-app-xxx
kubectl logs broken-app-xxx
kubectl edit deployment broken-app
# ... manually update limits ...
kubectl rollout status deployment/broken-app

With Lightspeed:

"Fix the broken pods"

Key Capabilities Demonstrated

  • ✅ Natural language cluster interaction via Lightspeed UI

  • ✅ MCP tools provide accurate pod status and cluster data

  • ✅ ML models (anomaly detection, predictive analytics) are accessible

  • ✅ Automated pod restart remediation via coordination-engine

  • ⚠️ Anomaly detection works but may show low confidence for new clusters

  • ⚠️ Resource limit updates currently require manual intervention

Current Platform Status

What works today:

  • Pod health queries and status checks

  • Resource usage predictions (when historical data available)

  • Automated incident tracking

  • Pod restart remediation

  • Historical incident analysis

What’s being improved:

  • Anomaly detection confidence for newly deployed apps

  • Intelligent resource limit adjustment (currently manual)

  • Scaling recommendations based on ML predictions

This is an active platform - features improve as the ML models learn from your cluster’s metrics!

Next Steps

Explore more advanced topics:


Happy chatting with your cluster! 🚀