Module 2: Deploy MCP Server & Configure Lightspeed
Overview
This module is REQUIRED before the End-to-End Self-Healing demo. You’ll deploy the MCP Server and configure OpenShift Lightspeed to enable AI-powered cluster management.
What you’ll accomplish:
-
Deploy the Cluster Health MCP Server
-
Configure LLM provider (OpenAI, existing Lightspeed, or vLLM)
-
Create OLSConfig to connect Lightspeed to MCP Server
-
Explore Lightspeed capabilities with hands-on UI testing
-
Learn the 7 MCP tools available for cluster management
-
Understand the complete data flow architecture
|
This module must be completed before Module 3 (End-to-End Self-Healing), as the self-healing demo relies on Lightspeed to interact with the platform. Focus: This module emphasizes natural language interaction through the OpenShift Console chatbot. Module 3 will demonstrate practical troubleshooting workflows using these capabilities. |
Prerequisites
Before proceeding, ensure:
-
✅ Completed Module 0 and Module 1
-
✅ Platform is deployed (via AgnosticD workload)
-
✅ You have an OpenAI API key OR existing Lightspeed cluster OR Hugging Face account
-
✅ Admin access to the cluster
Verify platform is running:
oc get pods -n self-healing-platform
oc get inferenceservices -n self-healing-platform
Part 1: Verify the MCP Server
The MCP (Model Context Protocol) Server enables OpenShift Lightspeed to interact with your self-healing platform.
Step 1.1: Verify MCP Server Deployment
The MCP Server should already be deployed by the AgnosticD workload:
oc get deployment mcp-server -n self-healing-platform
oc get service mcp-server -n self-healing-platform
Expected output:
NAME READY UP-TO-DATE AVAILABLE AGE
mcp-server 1/1 1 1 2d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mcp-server ClusterIP 172.30.118.176 <none> 8080/TCP 2d
Step 1.2: Check MCP Server Health
oc run test-curl --image=registry.access.redhat.com/ubi9/ubi-minimal:latest \
--rm -i --restart=Never -n self-healing-platform -- \
curl -s http://mcp-server:8080/health
Expected output:
OK
Step 1.3: View MCP Server Logs
Check the MCP server logs to see registered tools and resources:
oc logs deployment/mcp-server -n self-healing-platform --tail=30
Expected log output:
2026/02/19 14:34:01 Total tools registered: 12
2026/02/19 14:34:01 Total resources registered: 4
2026/02/19 14:34:01 Total prompts registered: 6
2026/02/19 14:34:01 MCP Server initialized: openshift-cluster-health v0.1.0
2026/02/19 14:34:01 MCP Server listening on 0.0.0.0:8080
|
The MCP Server provides 12 tools, 4 resources, and 6 prompts that Lightspeed can use for cluster management. You’ll learn about these in Part 6. |
Part 2: Configure LLM Provider
OpenShift Lightspeed requires an LLM provider. Choose ONE option based on your environment:
| Option | Description | Best For | Setup Time |
|---|---|---|---|
A: OpenAI |
Use OpenAI’s cloud API (manual setup) |
Quick start, reliable responses |
~5 min |
B: Existing Lightspeed |
Cluster already has Lightspeed configured |
Pre-configured environments (Azure, watsonx) |
~10 min |
C: Deploy vLLM |
Self-host Llama 3.2 1B in your cluster |
No external dependencies, learning experience |
~20 min |
Option A: OpenAI (Recommended for Quick Start)
Step A.1: Get OpenAI API Key
-
Create new API key
-
Copy the key (starts with
sk-proj-)
Step A.2: Create Secret
oc create secret generic openai-api-key \
-n openshift-lightspeed \
--from-literal=apitoken='sk-proj-YOUR_KEY_HERE'
|
Continue to Part 3, Option A for OLSConfig creation. |
Option B: Existing Lightspeed Cluster
For clusters with pre-configured Lightspeed (Azure OpenAI, IBM watsonx, etc.)
Step B.1: Verify Existing OLSConfig
# Check if OLSConfig exists
oc get olsconfig cluster -n openshift-lightspeed
# View current configuration
oc get olsconfig cluster -o yaml | less
Look for existing spec.llm.providers section showing your LLM provider.
Step B.2: Identify Provider Type
Common configurations:
Azure OpenAI:
spec:
llm:
providers:
- name: Azure
type: azure_openai
url: 'https://your-instance.openai.azure.com/'
credentialsSecretRef:
name: azure-api-keys
IBM watsonx:
spec:
llm:
providers:
- name: watsonx
type: watsonx
url: 'https://us-south.ml.cloud.ibm.com'
credentialsSecretRef:
name: watsonx-api-keys
|
Your existing LLM configuration will be preserved. We’ll only add MCP Server integration in Part 3. Continue to Part 3, Option B. |
Option C: Deploy vLLM (Self-Hosted LLM)
Deploy vLLM inference server in your cluster - no external API keys required!
|
Requirements: - 8Gi+ available memory - Persistent storage (10Gi for model caching) - Network access to Hugging Face Hub - ~10-15 minutes for initial model download |
Step C.1: Accept Llama 3.2 License
-
Visit https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
-
Sign in to Hugging Face (create account if needed)
-
Click "Agree and access repository"
-
Create access token: https://huggingface.co/settings/tokens
-
Token type: Read
-
Copy the token (starts with
hf_)
-
Step C.2: Create Hugging Face Token Secret
# Replace with your actual token
oc create secret generic huggingface-token \
-n self-healing-platform \
--from-literal=token='hf_YOUR_TOKEN_HERE'
Step C.3: Deploy vLLM Server
cat <<EOF | oc apply -f -
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-model-cache
namespace: self-healing-platform
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: self-healing-platform
labels:
app: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: quay.io/vllm/vllm-openai:latest
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model
- meta-llama/Llama-3.2-1B-Instruct
- --port
- "8000"
- --host
- "0.0.0.0"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
- name: HF_HOME
value: "/models"
ports:
- containerPort: 8000
protocol: TCP
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: model-cache
mountPath: /models
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 320
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
namespace: self-healing-platform
spec:
selector:
app: vllm-server
ports:
- port: 8000
targetPort: 8000
protocol: TCP
name: http
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: vllm-server
namespace: self-healing-platform
spec:
to:
kind: Service
name: vllm-server
port:
targetPort: http
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect
EOF
Step C.4: Monitor vLLM Deployment
# Watch pod startup (model download takes 5-10 minutes)
oc get pods -n self-healing-platform -l app=vllm-server -w
Open another terminal and watch logs:
# Check logs for model download progress
oc logs -n self-healing-platform deployment/vllm-server -f
Expected log output:
INFO: Downloading meta-llama/Llama-3.2-1B-Instruct...
INFO: Downloaded 483 MB / 1.2 GB...
INFO: Model loaded successfully
INFO: Started server process
Step C.5: Test vLLM Server
# Get the route URL
VLLM_URL=$(oc get route vllm-server -n self-healing-platform -o jsonpath='{.spec.host}')
echo "vLLM URL: https://$VLLM_URL"
# Test with a simple completion
curl -k https://$VLLM_URL/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"prompt": "OpenShift is",
"max_tokens": 20
}'
Expected response:
{
"id": "cmpl-xxx",
"object": "text_completion",
"created": 1234567890,
"model": "meta-llama/Llama-3.2-1B-Instruct",
"choices": [{
"text": " a container orchestration platform...",
"index": 0,
"finish_reason": "length"
}]
}
✅ If you see a valid JSON response with generated text, vLLM is working!
|
Continue to Part 3, Option C for OLSConfig creation. |
|
Secret key name: The secret key MUST be named |
Part 3: Create OLSConfig
The OLSConfig custom resource connects OpenShift Lightspeed to your LLM provider and MCP Server.
- Option A: OpenAI
-
For Option A (Manual OpenAI Setup)
Step A.3: Create OLSConfig
cat <<EOF | oc apply -f - apiVersion: ols.openshift.io/v1alpha1 kind: OLSConfig metadata: name: cluster spec: featureGates: - MCPServer llm: providers: - name: openai type: openai url: "https://api.openai.com/v1" credentialsSecretRef: name: openai-api-key models: - name: gpt-5-mini - name: gpt-4o mcpServers: - name: cluster-health streamableHTTP: url: http://mcp-server.self-healing-platform.svc:8080/mcp timeout: 30 enableSSE: true ols: defaultModel: gpt-5-mini defaultProvider: openai deployment: replicas: 1 conversationCache: type: postgres console: enabled: true EOFContinue to verification steps below.
- Option B: Existing Lightspeed
-
For Option B (Existing Lightspeed Cluster)
Instead of creating a new OLSConfig, we’ll patch the existing one to add MCP Server integration.
Step B.3: Backup Current OLSConfig
oc get olsconfig cluster -o yaml > /tmp/olsconfig-backup.yamlStep B.4: Add MCP Server Configuration
# Add MCPServer feature gate and mcpServers section oc patch olsconfig cluster --type=merge -p ' spec: featureGates: - MCPServer mcpServers: - name: cluster-health streamableHTTP: url: http://mcp-server.self-healing-platform.svc:8080/mcp timeout: 30 enableSSE: true '
This patch preserves your existing LLM provider configuration while adding MCP Server support.
Step B.5: Verify Merged Configuration
# View the updated OLSConfig
oc get olsconfig cluster -o yaml | less
# Verify both sections exist:
# 1. Your existing llm.providers (should be unchanged)
# 2. New mcpServers section (should show cluster-health)
# 3. featureGates should include MCPServer
Continue to verification steps below.
- Option C: vLLM
For Option C (Deploy vLLM)
Step C.6: Create Dummy API Secret
vLLM doesn’t require authentication, but OLSConfig requires a credentialsSecretRef:
oc create secret generic vllm-api-key \
-n openshift-lightspeed \
--from-literal=apitoken='none'
Step C.7: Create OLSConfig for vLLM
# Get vLLM route URL
VLLM_URL=$(oc get route vllm-server -n self-healing-platform -o jsonpath='{.spec.host}')
echo "Using vLLM URL: https://$VLLM_URL"
cat <<EOF | oc apply -f -
apiVersion: ols.openshift.io/v1alpha1
kind: OLSConfig
metadata:
name: cluster
spec:
featureGates:
- MCPServer
llm:
providers:
- name: vllm-local
type: rhoai_vllm
url: "https://$VLLM_URL/v1"
credentialsSecretRef:
name: vllm-api-key
models:
- name: meta-llama/Llama-3.2-1B-Instruct
mcpServers:
- name: cluster-health
streamableHTTP:
url: http://mcp-server.self-healing-platform.svc:8080/mcp
timeout: 30
enableSSE: true
ols:
defaultModel: meta-llama/Llama-3.2-1B-Instruct
defaultProvider: vllm-local
deployment:
replicas: 1
conversationCache:
type: postgres
console:
enabled: true
EOF
Provider Type: We use rhoai_vllm for OpenShift AI vLLM integration. This tells Lightspeed to use the vLLM OpenAI-compatible API format.
Continue to verification steps below.
Verify OLSConfig (All Options)
# Check OLSConfig exists
oc get olsconfig cluster
# Watch for all conditions to become Ready
oc get olsconfig cluster -o jsonpath='{.status.conditions[*].type}' && echo
oc get olsconfig cluster -o jsonpath='{.status.conditions[*].status}' && echo
Expected conditions: ConsolePluginReady, CacheReady, ApiReady = True
Part 4: Explore Lightspeed Capabilities
Now that Lightspeed is configured, let’s thoroughly test it with hands-on queries. This section demonstrates all 7 MCP tools through natural language interaction.
|
About Response Examples in This Section: The "Example Response" sections below show sample outputs for illustration purposes. Your actual Lightspeed responses will vary based on:
Don’t worry if your responses don’t match exactly! As long as Lightspeed provides relevant information about your cluster, the integration is working correctly. |
|
For vLLM Users (Option C): Llama 3.2 1B is a smaller model than GPT-4o. Responses may be:
This is expected behavior - the model is optimized for efficiency over maximum accuracy. |
Step 4.1: Access Lightspeed via Console
-
Open OpenShift Console: https://console-openshift-console.apps.{guid}.example.com
-
Look for the Lightspeed chatbot icon (✨) in the top-right corner of the page
-
Click the icon to open the chat interface
|
The Lightspeed chatbot appears as a floating panel on the right side of the console. You can minimize it when not in use and expand it again by clicking the icon. |
Step 4.2: Deploy Optional Sample Application (Recommended)
To make testing more interactive, let’s deploy a simple sample application that you can query with Lightspeed.
oc apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-nginx
namespace: self-healing-platform
labels:
app: sample-nginx
spec:
replicas: 2
selector:
matchLabels:
app: sample-nginx
template:
metadata:
labels:
app: sample-nginx
spec:
containers:
- name: nginx
image: registry.access.redhat.com/ubi9/nginx-122:latest
ports:
- containerPort: 8080
resources:
limits:
memory: "128Mi"
cpu: "100m"
requests:
memory: "64Mi"
cpu: "50m"
---
apiVersion: v1
kind: Service
metadata:
name: sample-nginx
namespace: self-healing-platform
spec:
selector:
app: sample-nginx
ports:
- port: 8080
targetPort: 8080
EOF
Wait ~30 seconds for the pods to start, then proceed to test queries.
|
You can skip this step and still test Lightspeed with the platform components (coordination-engine, mcp-server, ML models), but having a sample app makes the queries more realistic. |
Step 4.3: Test Cluster Health Queries
Let’s start with basic health checks that use the get-cluster-health MCP tool.
Query 1: Overall Cluster Health
Type in Lightspeed:
What is the overall health of the self-healing-platform namespace?
Example Response:
✅ Cluster Health Summary for self-healing-platform:
Healthy Components (6):
- coordination-engine: Running
- mcp-server: Running
- sample-nginx: 2/2 Running ← Your sample app!
- anomaly-detector-predictor: 2/2 Running
- predictive-analytics-predictor: 2/2 Running
ML Models Status:
- anomaly-detector: READY ✅
- predictive-analytics: READY ✅
Recent Activity:
- 0 incidents in the last hour
- All systems operational
Everything looks healthy!
🎯 What just happened? Lightspeed used the get-cluster-health MCP tool to query your cluster, check pod status, and verify ML models are ready.
MCP Tool Used: get-cluster-health
Query 2: Check for Issues
Are there any unhealthy pods in self-healing-platform?
Example Response:
✅ All pods in self-healing-platform are healthy!
Status: 10/10 pods Running
- coordination-engine-0: Running (uptime: 2d)
- mcp-server-xxx: Running (uptime: 2d)
- sample-nginx-xxx (2 pods): Running
- anomaly-detector-predictor-xxx (2 pods): Running
- predictive-analytics-predictor-xxx (2 pods): Running
No issues detected.
MCP Tool Used: get-cluster-health with pod filtering
Query 3: Check Other Namespaces
You can also query other namespaces! Try:
Are there any issues in the openshift-lightspeed namespace?
Example Response:
✅ Namespace openshift-lightspeed looks healthy!
Pods: 3/3 Running
- lightspeed-app-server-xxx: Running
- lightspeed-console-plugin-xxx: Running
- ols-postgres-xxx: Running
No issues detected.
Or check multiple namespaces at once:
Compare the health of self-healing-platform and openshift-lightspeed namespaces
🎯 Lightspeed is cluster-aware! You can ask about any namespace you have access to, not just the platform namespace.
Step 4.4: Test Pod Management Queries
These queries use the list-pods MCP tool to retrieve detailed pod information.
Query 3: List All Pods
List all pods in the {namespace} namespace with their resource usage
Example Response:
Found 8 pods in {namespace}:
1. coordination-engine-0
Status: Running ✅
CPU: 45m / 500m (9%)
Memory: 182Mi / 512Mi (35%)
Restarts: 0
Age: 2d
2. mcp-server-xxx
Status: Running ✅
CPU: 12m / 200m (6%)
Memory: 89Mi / 256Mi (35%)
Restarts: 0
Age: 2d
[... additional pods ...]
All pods are healthy!
MCP Tool Used: list-pods
Query 4: Describe Specific Pod
Show me details about the coordination-engine pod
Example Response:
Coordination Engine Pod Details:
Name: coordination-engine-0
Namespace: {namespace}
Status: Running ✅
Image: quay.io/kubeheal/coordination-engine:latest
Resources:
- CPU: 45m used / 500m limit (9%)
- Memory: 182Mi used / 512Mi limit (35%)
Networking:
- IP: 10.128.2.45
- Ports: 8080 (HTTP), 8081 (Metrics)
Health:
- Readiness: Passed ✅
- Liveness: Passed ✅
- Restarts: 0
- Uptime: 2 days
Recent Events: None (stable)
MCP Tool Used: list-pods with pod name filter
Step 4.5: Test ML Model Queries
These queries use the list-models and get-model-status MCP tools.
Query 5: List Available Models
What ML models are available in the platform?
Example Response:
ML Model Catalog ({namespace}):
1. anomaly-detector
Type: Anomaly Detection
Framework: scikit-learn
Status: READY ✅
Endpoint: http://anomaly-detector-predictor:8080/v1/models/anomaly-detector:predict
Capabilities:
- Detects unusual patterns in metrics
- Memory leak detection
- CPU spike detection
- Pod crash pattern analysis
2. predictive-analytics
Type: Time-Series Forecasting
Framework: scikit-learn
Status: READY ✅
Endpoint: http://predictive-analytics-predictor:8080/v1/models/predictive-analytics:predict
Capabilities:
- Predicts future CPU usage
- Predicts future memory usage
- Capacity planning forecasts
Both models are healthy and ready for inference!
MCP Tool Used: list-models
Query 6: Check Model Status
Show me the status of the anomaly-detector model
Example Response:
Model: anomaly-detector
Status: READY ✅
Runtime: KServe v0.13.1
Replicas: 2/2 Ready
Endpoint: http://anomaly-detector-predictor.{namespace}.svc:8080
Recent Activity:
- Predictions (last hour): 12
- Average latency: 23ms
- Success rate: 100%
Model Details:
- Framework: scikit-learn
- Model type: Isolation Forest
- Features: 8 metrics (CPU, memory, restarts, etc.)
- Trained: 2026-02-18
- Accuracy: 94.3%
The model is healthy and performing well!
MCP Tool Used: get-model-status
Step 4.6: Test Anomaly Detection Queries
These queries use the analyze-anomalies MCP tool to run ML-based detection.
Query 7: Scan for Anomalies
Are there any anomalies in the {namespace} namespace?
Example Response:
Anomaly Detection Scan Complete:
Scanned: 8 pods across {namespace}
Analysis: Last 24 hours of metrics
✅ No anomalies detected!
All pods are operating within normal parameters:
- CPU usage: Normal patterns
- Memory usage: Stable
- Restart counts: 0 (healthy)
- Network traffic: Expected ranges
ML Model: anomaly-detector (94.3% accuracy)
Confidence: 92%
Scan time: 0.34s
MCP Tool Used: analyze-anomalies
Query 8: Check for Unusual Resource Usage
Check for unusual CPU or memory usage patterns in the cluster
Example Response:
Resource Usage Pattern Analysis:
Timeframe: Last 24 hours
Scope: {namespace}
CPU Patterns:
✅ Normal - All pods within expected ranges
- coordination-engine: 9% avg (normal for orchestrator)
- mcp-server: 6% avg (low activity)
- ML predictors: 12-15% avg (inference workload)
Memory Patterns:
✅ Normal - Stable memory usage
- No memory leaks detected
- No unusual growth patterns
- Working set stable over 24h
Anomaly Score: 0.12 (threshold: 0.7)
Status: HEALTHY ✅
No action required - all systems operating normally.
MCP Tool Used: analyze-anomalies with resource focus
Step 4.7: Test Resource Analysis Queries
These queries use get-cluster-health and custom analysis tools.
Query 9: CPU Usage Analysis
Show me CPU usage across all pods in {namespace}
Example Response:
CPU Usage Report ({namespace}):
Total Cluster CPU: 2.5% (156m / 6200m)
By Pod:
1. coordination-engine-0: 45m (9% of limit)
2. mcp-server-xxx: 12m (6% of limit)
3. anomaly-detector-predictor-xxx: 32m (16% of limit)
4. anomaly-detector-predictor-yyy: 28m (14% of limit)
5. predictive-analytics-predictor-xxx: 18m (9% of limit)
6. predictive-analytics-predictor-yyy: 21m (10% of limit)
Top CPU Consumers:
1. coordination-engine: 45m (orchestration workload)
2. anomaly-detector: 60m total (ML inference)
3. predictive-analytics: 39m total (forecasting)
Status: ✅ Healthy - All pods well below limits
Headroom: 6044m (97.5%) available
MCP Tool Used: get-cluster-health with resource breakdown
Query 10: Memory Usage Analysis
Which pods are using the most memory in {namespace}?
Example Response:
Memory Usage Report ({namespace}):
Total Namespace Memory: 842Mi / 2560Mi (33%)
Top Memory Consumers:
1. coordination-engine-0: 182Mi (orchestrator + incident DB)
2. anomaly-detector-predictor-xxx: 156Mi (ML model in memory)
3. anomaly-detector-predictor-yyy: 152Mi (ML model in memory)
4. predictive-analytics-predictor-xxx: 128Mi (model + cache)
5. predictive-analytics-predictor-yyy: 124Mi (model + cache)
Total ML Models Memory: 560Mi (66% of namespace usage)
Total Infrastructure: 282Mi (34% of namespace usage)
Status: ✅ Healthy - 67% headroom available
Recommendations: No scaling needed
MCP Tool Used: get-cluster-health with memory sorting
Step 4.8: Test MCP Catalog Queries
These queries use the internal MCP catalog tools.
Query 11: List Available MCP Tools
What tools are available in the MCP server?
Example Response:
MCP Server Tool Catalog:
Total: 12 tools registered
Core Tools (7):
1. get-cluster-health - Check namespace pod/model health
2. list-pods - Query pods with filtering
3. analyze-anomalies - Run ML anomaly detection
4. trigger-remediation - Apply automated fixes
5. list-incidents - Query historical incidents
6. get-model-status - Check KServe model health
7. list-models - List ML model catalog
Extended Tools (5):
8. predict-resource-usage - Forecast CPU/memory
9. get-metrics - Query Prometheus metrics
10. check-capacity - Calculate available resources
11. get-pod-logs - Retrieve recent logs
12. describe-resource - Get Kubernetes resource details
All tools are accessible via natural language queries!
MCP Tool Used: Internal MCP catalog listing
Query 12: Explain a Specific Tool
Explain the get-cluster-health tool
Example Response:
MCP Tool: get-cluster-health
Description:
Retrieves comprehensive health status for a namespace, including
pod health, ML model readiness, and recent incident activity.
What it does:
1. Queries Kubernetes API for pod status
2. Checks KServe InferenceService readiness
3. Retrieves recent incidents from Coordination Engine
4. Aggregates health metrics
Parameters:
- namespace (optional): Target namespace (default: {namespace})
- include_metrics (optional): Include resource usage (default: true)
Returns:
- Pod status list (name, state, restarts, age)
- ML model readiness (READY/NOT_READY)
- Incident count (last 1h, 24h)
- Overall health score
Used by queries like:
- "What's the cluster health?"
- "Are there any failing pods?"
- "Show me the platform status"
Response time: ~200-500ms
MCP Tool Used: Internal tool introspection
Step 4.9: Verify MCP Server Tool Calls
As you try these queries, you can watch the MCP Server logs to see which tools are being invoked:
# In a separate terminal, watch MCP server logs
oc logs -f deployment/mcp-server -n self-healing-platform
Expected log output:
2026-02-20 10:23:45 Tool invoked: get-cluster-health (params: namespace={namespace})
2026-02-20 10:23:45 Response: 8 pods, 2 models, 0 incidents
2026-02-20 10:24:12 Tool invoked: list-pods (params: namespace={namespace}, filter=all)
2026-02-20 10:24:12 Response: 8 pods returned
2026-02-20 10:25:03 Tool invoked: list-models (params: namespace={namespace})
2026-02-20 10:25:03 Response: 2 models (anomaly-detector, predictive-analytics)
2026-02-20 10:26:34 Tool invoked: analyze-anomalies (params: namespace={namespace}, timeframe=24h)
2026-02-20 10:26:35 Response: 0 anomalies detected (score: 0.12)
🎯 What you’ve accomplished: You’ve now verified that Lightspeed can successfully use all 7 core MCP tools through natural language queries!
Step 4.10: Troubleshooting
Problem: Lightspeed doesn’t respond to queries
Check OLSConfig status:
oc get olsconfig cluster -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' && echo
Expected: True
Check Lightspeed pods:
oc get pods -n openshift-lightspeed
Expected: lightspeed-app-server-xxx is Running
Check Lightspeed logs for errors:
oc logs -n openshift-lightspeed deployment/lightspeed-app-server --tail=50
Problem: Lightspeed responds but doesn’t use MCP tools
Verify MCP Server is accessible:
oc get olsconfig cluster -o jsonpath='{.spec.mcpServers[0].streamableHTTP.url}' && echo
Test MCP Server directly:
oc run test-mcp --image=registry.access.redhat.com/ubi9/ubi-minimal:latest \
--rm -i --restart=Never -n self-healing-platform -- \
curl -s http://mcp-server:8080/health
Expected: OK
Problem: MCP tools return errors
Check MCP Server logs for errors:
oc logs deployment/mcp-server -n self-healing-platform --tail=100 | grep -i error
Verify Coordination Engine is running:
oc get pods -n self-healing-platform -l app=coordination-engine
Problem: vLLM responses are poor quality
If using vLLM (Option C), Llama 3.2 1B is a smaller model:
-
Try rephrasing queries to be more specific
-
Use shorter, direct questions
-
Avoid complex multi-step queries
-
Consider switching to OpenAI for production use
|
If you encounter persistent issues, verify that Module 2 Part 2 (LLM Provider configuration) was completed successfully. The most common issue is an invalid or missing API key secret. |
Part 5: Understanding OLSConfig
Now that Lightspeed is working, let’s understand the OLSConfig in detail.
Complete OLSConfig Reference
The OLSConfig is a cluster-scoped singleton that configures OpenShift Lightspeed. Here’s a comprehensive example showing all supported provider types:
apiVersion: ols.openshift.io/v1alpha1
kind: OLSConfig
metadata:
name: cluster # MUST be 'cluster' (cluster-scoped singleton)
spec:
# Enable MCP Server integration
featureGates:
- MCPServer
# LLM Provider configuration (configure one or multiple)
llm:
providers:
# OpenAI (GPT-5, GPT-4o)
- name: openai
type: openai
url: "https://api.openai.com/v1"
credentialsSecretRef:
name: openai-api-key
models:
- name: gpt-5-mini # Latest efficient model (default)
- name: gpt-4o # Previous generation multimodal
- name: gpt-4 # Stable baseline
# Google Gemini (OpenAI-compatible endpoint)
- name: google
type: openai
url: "https://generativelanguage.googleapis.com/v1beta/openai"
credentialsSecretRef:
name: google-api-key
models:
- name: gemini-2.0-flash-exp # Fast & efficient
- name: gemini-1.5-pro # Advanced reasoning
# Azure OpenAI
- name: azure
type: azure_openai
url: "https://YOUR_RESOURCE.openai.azure.com"
credentialsSecretRef:
name: azure-openai-key
models:
- name: gpt-4o # Requires matching deployment name
- name: gpt-4o-mini # High efficiency
# IBM watsonx (BAM)
- name: watsonx
type: watsonx
url: "https://us-south.ml.cloud.ibm.com"
credentialsSecretRef:
name: watsonx-api-key
models:
- name: ibm/granite-13b-chat-v2 # Enterprise model
# vLLM (Self-hosted - RHOAI integration)
- name: vllm-local
type: rhoai_vllm
url: "https://vllm-server-{namespace}.apps.cluster.example.com/v1"
credentialsSecretRef:
name: vllm-api-key
models:
- name: meta-llama/Llama-3.2-1B-Instruct # Self-hosted
# MCP Server configuration (integrates with self-healing platform)
mcpServers:
- name: cluster-health
streamableHTTP:
url: http://mcp-server.{namespace}.svc:8080/mcp
timeout: 30
enableSSE: true
# OLS deployment configuration
ols:
defaultModel: gpt-5-mini # Choose your default model
defaultProvider: openai # Choose your default provider
deployment:
replicas: 1
conversationCache:
type: postgres # REQUIRED: Only 'postgres' is supported
# Console plugin configuration
console:
enabled: true
View Current Configuration
# View the complete OLSConfig
oc get olsconfig cluster -o yaml
# View just the LLM providers
oc get olsconfig cluster -o jsonpath='{.spec.llm.providers[*].name}' && echo
# View MCP servers
oc get olsconfig cluster -o jsonpath='{.spec.mcpServers[*].name}' && echo
Supported LLM Provider Types
| Provider | Type Value | Authentication | Notes |
|---|---|---|---|
OpenAI |
|
API key in secret |
GPT-5-mini, GPT-4o recommended |
Google Gemini |
|
API key in secret |
Uses OpenAI-compatible endpoint |
Azure OpenAI |
|
API key + endpoint |
Requires Azure deployment name |
IBM watsonx |
|
API key |
Granite models for enterprise |
vLLM (RHOAI) |
|
Optional (dummy secret) |
Self-hosted, no external API |
IBM BAM |
|
API key |
IBM Research platform |
Secret Format Requirements
All LLM provider secrets must use the key name apitoken:
# Correct format
oc create secret generic openai-api-key \
-n openshift-lightspeed \
--from-literal=apitoken='sk-proj-...'
# WRONG - will fail validation
oc create secret generic openai-api-key \
-n openshift-lightspeed \
--from-literal=api_key='sk-proj-...' # ❌ Wrong key name
Key Configuration Points
Cluster-scoped singleton:
- OLSConfig name MUST be cluster
- Only one OLSConfig can exist per cluster
- No namespace field in metadata
MCP Server integration:
- Requires MCPServer feature gate
- Uses internal service URL for cluster-health MCP server
- Timeout of 30 seconds recommended for ML model calls
Conversation cache:
- Only postgres type is supported
- Automatically provisioned by Lightspeed operator
- Stores conversation history for context
Multiple providers:
- Configure multiple providers for redundancy
- Set defaultProvider and defaultModel for primary use
- Users can switch providers via console UI
Part 6: The 7 MCP Tools
The MCP Server exposes these tools that Lightspeed can call:
Tool Reference
| Tool | Description | Returns |
|---|---|---|
|
Check namespace status, pods, ML models |
Health summary with metrics |
|
Query pods with filtering |
Pod list with details |
|
Call ML models for detection |
Anomaly detection results + recommendations |
|
Apply fixes via Coordination Engine |
Remediation status and ID |
|
Query historical incidents |
Incident data with resolution info |
|
Check KServe InferenceService health |
Model status and endpoints |
|
List available ML model catalog |
Model names and capabilities |
How Tools Are Called
When you ask Lightspeed a question, it:
-
Parses your natural language intent
-
Selects the appropriate MCP tool(s)
-
Calls the tool via MCP protocol
-
Receives structured JSON response
-
Formats a human-readable reply
Example Flow:
You: "What's the cluster health?"
│
▼
Lightspeed: Intent = "get health status"
Tool = "get-cluster-health"
│
▼
MCP Server: GET /tools/get-cluster-health
Params: {namespace: "self-healing-platform"}
│
▼
Coordination Engine: Query Kubernetes API
Check InferenceServices
│
▼
Response: {
"status": "healthy",
"pods": [...],
"models": [...]
}
│
▼
Lightspeed: "✅ Cluster Health Summary for self-healing-platform:
Healthy Components (4): ..."
Part 7: Understanding the Architecture
Complete Data Flow
┌─────────────────────────────────────────────────────────────────┐
│ User Interface Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ OCP Console │ │ Python API │ │ Custom Integrations │ │
│ └──────┬──────┘ └──────┬──────┘ └───────────┬─────────────┘ │
│ │ │ │ │
│ └────────────────┴─────────────────────┘ │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────────┤
│ AI/LLM Layer │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ OpenShift Lightspeed (OLS Server) │ │
│ │ • Natural language understanding │ │
│ │ • MCP tool selection │ │
│ │ • Response formatting │ │
│ └────────────────────────┬────────────────────────────────┘ │
│ │ MCP Protocol │
│ ▼ │
├─────────────────────────────────────────────────────────────────┤
│ Tool Layer │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ MCP Server (Go) │ │
│ │ • 7 tools exposed via MCP protocol │ │
│ │ • Routes requests to Coordination Engine │ │
│ │ • Returns structured JSON responses │ │
│ └────────────────────────┬────────────────────────────────┘ │
│ │ REST API │
│ ▼ │
├─────────────────────────────────────────────────────────────────┤
│ Orchestration Layer │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Coordination Engine (Go) │ │
│ │ • Queries Prometheus for metrics │ │
│ │ • Calls KServe ML models │ │
│ │ • Applies remediation to cluster │ │
│ │ • Tracks incidents and history │ │
│ └─────────┬─────────────────┬─────────────────┬───────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
├─────────────────────────────────────────────────────────────────┤
│ Data/ML Layer │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────────────┐ │
│ │ Prometheus │ │ KServe Models │ │ Kubernetes API │ │
│ │ (metrics) │ │ (inference) │ │ (cluster ops) │ │
│ └───────────────┘ └───────────────┘ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Summary
In this module, you accomplished:
-
✅ Deployed MCP Server - Verified the cluster-health MCP server is running
-
✅ Configured LLM Provider - Set up OpenAI, existing Lightspeed cluster, or self-hosted vLLM
-
✅ Created OLSConfig - Connected Lightspeed to MCP Server via MCPServer feature gate
-
✅ Explored Lightspeed Capabilities - Tested 12 different natural language queries covering all 7 MCP tools
-
✅ Learned the MCP Tools - Understand cluster health, pod management, ML models, and anomaly detection tools
-
✅ Understood Architecture - Complete data flow from console chat to ML models and back
|
You’re now ready for Module 3: End-to-End Self-Healing! You’ve verified that Lightspeed can successfully interact with your cluster through natural language. Module 3 will show you how to use these capabilities for real-world troubleshooting scenarios:
All through conversational interaction with Lightspeed - no code required! |
Resources
Advanced: Programmatic Access
For users interested in programmatic access to Lightspeed (Python API, automation scripts, integration patterns):
-
Python Examples Directory - LightspeedClient, monitoring scripts, integration patterns
-
Python Examples README - Installation, usage, and container image instructions
|
The Python examples are optional and not covered in this workshop. They’re provided for advanced users who want to build custom automation or integrations. The workshop focuses on natural language interaction through the OpenShift Console. |