Module 2: Deploy MCP Server & Configure Lightspeed

Overview

This module is REQUIRED before the End-to-End Self-Healing demo. You’ll deploy the MCP Server and configure OpenShift Lightspeed to enable AI-powered cluster management.

What you’ll accomplish:

  • Deploy the Cluster Health MCP Server

  • Configure LLM provider (OpenAI, existing Lightspeed, or vLLM)

  • Create OLSConfig to connect Lightspeed to MCP Server

  • Explore Lightspeed capabilities with hands-on UI testing

  • Learn the 7 MCP tools available for cluster management

  • Understand the complete data flow architecture

This module must be completed before Module 3 (End-to-End Self-Healing), as the self-healing demo relies on Lightspeed to interact with the platform.

Focus: This module emphasizes natural language interaction through the OpenShift Console chatbot. Module 3 will demonstrate practical troubleshooting workflows using these capabilities.

Prerequisites

Before proceeding, ensure:

  • ✅ Completed Module 0 and Module 1

  • ✅ Platform is deployed (via AgnosticD workload)

  • ✅ You have an OpenAI API key OR existing Lightspeed cluster OR Hugging Face account

  • ✅ Admin access to the cluster

Verify platform is running:

oc get pods -n self-healing-platform
oc get inferenceservices -n self-healing-platform

Part 1: Verify the MCP Server

The MCP (Model Context Protocol) Server enables OpenShift Lightspeed to interact with your self-healing platform.

Step 1.1: Verify MCP Server Deployment

The MCP Server should already be deployed by the AgnosticD workload:

oc get deployment mcp-server -n self-healing-platform
oc get service mcp-server -n self-healing-platform

Expected output:

NAME         READY   UP-TO-DATE   AVAILABLE   AGE
mcp-server   1/1     1            1           2d

NAME         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
mcp-server   ClusterIP   172.30.118.176   <none>        8080/TCP   2d

Step 1.2: Check MCP Server Health

oc run test-curl --image=registry.access.redhat.com/ubi9/ubi-minimal:latest \
  --rm -i --restart=Never -n self-healing-platform -- \
  curl -s http://mcp-server:8080/health

Expected output:

OK

Step 1.3: View MCP Server Logs

Check the MCP server logs to see registered tools and resources:

oc logs deployment/mcp-server -n self-healing-platform --tail=30

Expected log output:

2026/02/19 14:34:01 Total tools registered: 12
2026/02/19 14:34:01 Total resources registered: 4
2026/02/19 14:34:01 Total prompts registered: 6
2026/02/19 14:34:01 MCP Server initialized: openshift-cluster-health v0.1.0
2026/02/19 14:34:01 MCP Server listening on 0.0.0.0:8080

The MCP Server provides 12 tools, 4 resources, and 6 prompts that Lightspeed can use for cluster management. You’ll learn about these in Part 6.

Part 2: Configure LLM Provider

OpenShift Lightspeed requires an LLM provider. Choose ONE option based on your environment:

Option Description Best For Setup Time

A: OpenAI

Use OpenAI’s cloud API (manual setup)

Quick start, reliable responses

~5 min

B: Existing Lightspeed

Cluster already has Lightspeed configured

Pre-configured environments (Azure, watsonx)

~10 min

C: Deploy vLLM

Self-host Llama 3.2 1B in your cluster

No external dependencies, learning experience

~20 min

Step A.1: Get OpenAI API Key

  1. Visit https://platform.openai.com/api-keys

  2. Create new API key

  3. Copy the key (starts with sk-proj-)

Step A.2: Create Secret

oc create secret generic openai-api-key \
  -n openshift-lightspeed \
  --from-literal=apitoken='sk-proj-YOUR_KEY_HERE'

Continue to Part 3, Option A for OLSConfig creation.

Option B: Existing Lightspeed Cluster

For clusters with pre-configured Lightspeed (Azure OpenAI, IBM watsonx, etc.)

Step B.1: Verify Existing OLSConfig

# Check if OLSConfig exists
oc get olsconfig cluster -n openshift-lightspeed

# View current configuration
oc get olsconfig cluster -o yaml | less

Look for existing spec.llm.providers section showing your LLM provider.

Step B.2: Identify Provider Type

Common configurations:

Azure OpenAI:

spec:
  llm:
    providers:
      - name: Azure
        type: azure_openai
        url: 'https://your-instance.openai.azure.com/'
        credentialsSecretRef:
          name: azure-api-keys

IBM watsonx:

spec:
  llm:
    providers:
      - name: watsonx
        type: watsonx
        url: 'https://us-south.ml.cloud.ibm.com'
        credentialsSecretRef:
          name: watsonx-api-keys

Your existing LLM configuration will be preserved. We’ll only add MCP Server integration in Part 3. Continue to Part 3, Option B.

Option C: Deploy vLLM (Self-Hosted LLM)

Deploy vLLM inference server in your cluster - no external API keys required!

Requirements: - 8Gi+ available memory - Persistent storage (10Gi for model caching) - Network access to Hugging Face Hub - ~10-15 minutes for initial model download

Step C.1: Accept Llama 3.2 License

  1. Visit https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

  2. Sign in to Hugging Face (create account if needed)

  3. Click "Agree and access repository"

  4. Create access token: https://huggingface.co/settings/tokens

    • Token type: Read

    • Copy the token (starts with hf_)

Step C.2: Create Hugging Face Token Secret

# Replace with your actual token
oc create secret generic huggingface-token \
  -n self-healing-platform \
  --from-literal=token='hf_YOUR_TOKEN_HERE'

Step C.3: Deploy vLLM Server

cat <<EOF | oc apply -f -
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-model-cache
  namespace: self-healing-platform
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: self-healing-platform
  labels:
    app: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: quay.io/vllm/vllm-openai:latest
        command:
          - python3
          - -m
          - vllm.entrypoints.openai.api_server
          - --model
          - meta-llama/Llama-3.2-1B-Instruct
          - --port
          - "8000"
          - --host
          - "0.0.0.0"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-token
              key: token
        - name: HF_HOME
          value: "/models"
        ports:
        - containerPort: 8000
          protocol: TCP
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        volumeMounts:
        - name: model-cache
          mountPath: /models
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 320
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: vllm-model-cache
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
  namespace: self-healing-platform
spec:
  selector:
    app: vllm-server
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP
    name: http
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: vllm-server
  namespace: self-healing-platform
spec:
  to:
    kind: Service
    name: vllm-server
  port:
    targetPort: http
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Redirect
EOF

Step C.4: Monitor vLLM Deployment

# Watch pod startup (model download takes 5-10 minutes)
oc get pods -n self-healing-platform -l app=vllm-server -w

Open another terminal and watch logs:

# Check logs for model download progress
oc logs -n self-healing-platform deployment/vllm-server -f

Expected log output:

INFO: Downloading meta-llama/Llama-3.2-1B-Instruct...
INFO: Downloaded 483 MB / 1.2 GB...
INFO: Model loaded successfully
INFO: Started server process

Step C.5: Test vLLM Server

# Get the route URL
VLLM_URL=$(oc get route vllm-server -n self-healing-platform -o jsonpath='{.spec.host}')
echo "vLLM URL: https://$VLLM_URL"

# Test with a simple completion
curl -k https://$VLLM_URL/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "prompt": "OpenShift is",
    "max_tokens": 20
  }'

Expected response:

{
  "id": "cmpl-xxx",
  "object": "text_completion",
  "created": 1234567890,
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "choices": [{
    "text": " a container orchestration platform...",
    "index": 0,
    "finish_reason": "length"
  }]
}

✅ If you see a valid JSON response with generated text, vLLM is working!

Continue to Part 3, Option C for OLSConfig creation.

Secret key name: The secret key MUST be named apitoken for OpenAI (Option A). For vLLM (Option C), the Hugging Face token uses key name token.

Part 3: Create OLSConfig

The OLSConfig custom resource connects OpenShift Lightspeed to your LLM provider and MCP Server.

Option A: OpenAI

For Option A (Manual OpenAI Setup)

Step A.3: Create OLSConfig

cat <<EOF | oc apply -f -
apiVersion: ols.openshift.io/v1alpha1
kind: OLSConfig
metadata:
  name: cluster
spec:
  featureGates:
    - MCPServer

  llm:
    providers:
      - name: openai
        type: openai
        url: "https://api.openai.com/v1"
        credentialsSecretRef:
          name: openai-api-key
        models:
          - name: gpt-5-mini
          - name: gpt-4o

  mcpServers:
    - name: cluster-health
      streamableHTTP:
        url: http://mcp-server.self-healing-platform.svc:8080/mcp
        timeout: 30
        enableSSE: true

  ols:
    defaultModel: gpt-5-mini
    defaultProvider: openai
    deployment:
      replicas: 1
    conversationCache:
      type: postgres

  console:
    enabled: true
EOF

Continue to verification steps below.

Option B: Existing Lightspeed

For Option B (Existing Lightspeed Cluster)

Instead of creating a new OLSConfig, we’ll patch the existing one to add MCP Server integration.

Step B.3: Backup Current OLSConfig

oc get olsconfig cluster -o yaml > /tmp/olsconfig-backup.yaml

Step B.4: Add MCP Server Configuration

# Add MCPServer feature gate and mcpServers section
oc patch olsconfig cluster --type=merge -p '
spec:
  featureGates:
    - MCPServer
  mcpServers:
    - name: cluster-health
      streamableHTTP:
        url: http://mcp-server.self-healing-platform.svc:8080/mcp
        timeout: 30
        enableSSE: true
'

This patch preserves your existing LLM provider configuration while adding MCP Server support.

Step B.5: Verify Merged Configuration

# View the updated OLSConfig
oc get olsconfig cluster -o yaml | less

# Verify both sections exist:
# 1. Your existing llm.providers (should be unchanged)
# 2. New mcpServers section (should show cluster-health)
# 3. featureGates should include MCPServer

Continue to verification steps below.

Option C: vLLM

For Option C (Deploy vLLM)

Step C.6: Create Dummy API Secret

vLLM doesn’t require authentication, but OLSConfig requires a credentialsSecretRef:

oc create secret generic vllm-api-key \
  -n openshift-lightspeed \
  --from-literal=apitoken='none'

Step C.7: Create OLSConfig for vLLM

# Get vLLM route URL
VLLM_URL=$(oc get route vllm-server -n self-healing-platform -o jsonpath='{.spec.host}')
echo "Using vLLM URL: https://$VLLM_URL"

cat <<EOF | oc apply -f -
apiVersion: ols.openshift.io/v1alpha1
kind: OLSConfig
metadata:
  name: cluster
spec:
  featureGates:
    - MCPServer

  llm:
    providers:
      - name: vllm-local
        type: rhoai_vllm
        url: "https://$VLLM_URL/v1"
        credentialsSecretRef:
          name: vllm-api-key
        models:
          - name: meta-llama/Llama-3.2-1B-Instruct

  mcpServers:
    - name: cluster-health
      streamableHTTP:
        url: http://mcp-server.self-healing-platform.svc:8080/mcp
        timeout: 30
        enableSSE: true

  ols:
    defaultModel: meta-llama/Llama-3.2-1B-Instruct
    defaultProvider: vllm-local
    deployment:
      replicas: 1
    conversationCache:
      type: postgres

  console:
    enabled: true
EOF

Provider Type: We use rhoai_vllm for OpenShift AI vLLM integration. This tells Lightspeed to use the vLLM OpenAI-compatible API format.

Continue to verification steps below.

Verify OLSConfig (All Options)

# Check OLSConfig exists
oc get olsconfig cluster

# Watch for all conditions to become Ready
oc get olsconfig cluster -o jsonpath='{.status.conditions[*].type}' && echo
oc get olsconfig cluster -o jsonpath='{.status.conditions[*].status}' && echo

Expected conditions: ConsolePluginReady, CacheReady, ApiReady = True

Verify Lightspeed Pods

oc get pods -n openshift-lightspeed

Wait for lightspeed-app-server to be Running (~2-3 minutes).

Part 4: Explore Lightspeed Capabilities

Now that Lightspeed is configured, let’s thoroughly test it with hands-on queries. This section demonstrates all 7 MCP tools through natural language interaction.

About Response Examples in This Section:

The "Example Response" sections below show sample outputs for illustration purposes. Your actual Lightspeed responses will vary based on:

  • LLM Model Used: GPT-4o, vLLM (Llama 3.2), Gemini, etc. produce different response formats

  • Cluster State: Your actual pod counts, resource usage, and component status

  • Components Deployed: Optional sample app (Step 4.2) affects pod listings

  • Response Style: Different models use different formatting, tone, and detail levels

Don’t worry if your responses don’t match exactly! As long as Lightspeed provides relevant information about your cluster, the integration is working correctly.

For vLLM Users (Option C): Llama 3.2 1B is a smaller model than GPT-4o. Responses may be:

  • More concise

  • Occasionally less accurate for complex queries

  • Faster (lower latency)

This is expected behavior - the model is optimized for efficiency over maximum accuracy.

Step 4.1: Access Lightspeed via Console

  1. Open OpenShift Console: https://console-openshift-console.apps.{guid}.example.com

  2. Look for the Lightspeed chatbot icon (✨) in the top-right corner of the page

  3. Click the icon to open the chat interface

The Lightspeed chatbot appears as a floating panel on the right side of the console. You can minimize it when not in use and expand it again by clicking the icon.

To make testing more interactive, let’s deploy a simple sample application that you can query with Lightspeed.

oc apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-nginx
  namespace: self-healing-platform
  labels:
    app: sample-nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sample-nginx
  template:
    metadata:
      labels:
        app: sample-nginx
    spec:
      containers:
      - name: nginx
        image: registry.access.redhat.com/ubi9/nginx-122:latest
        ports:
        - containerPort: 8080
        resources:
          limits:
            memory: "128Mi"
            cpu: "100m"
          requests:
            memory: "64Mi"
            cpu: "50m"
---
apiVersion: v1
kind: Service
metadata:
  name: sample-nginx
  namespace: self-healing-platform
spec:
  selector:
    app: sample-nginx
  ports:
  - port: 8080
    targetPort: 8080
EOF

Wait ~30 seconds for the pods to start, then proceed to test queries.

You can skip this step and still test Lightspeed with the platform components (coordination-engine, mcp-server, ML models), but having a sample app makes the queries more realistic.

Step 4.3: Test Cluster Health Queries

Let’s start with basic health checks that use the get-cluster-health MCP tool.

Query 1: Overall Cluster Health

Type in Lightspeed:

What is the overall health of the self-healing-platform namespace?

Example Response:

✅ Cluster Health Summary for self-healing-platform:

Healthy Components (6):
- coordination-engine: Running
- mcp-server: Running
- sample-nginx: 2/2 Running  ← Your sample app!
- anomaly-detector-predictor: 2/2 Running
- predictive-analytics-predictor: 2/2 Running

ML Models Status:
- anomaly-detector: READY ✅
- predictive-analytics: READY ✅

Recent Activity:
- 0 incidents in the last hour
- All systems operational

Everything looks healthy!

🎯 What just happened? Lightspeed used the get-cluster-health MCP tool to query your cluster, check pod status, and verify ML models are ready.

MCP Tool Used: get-cluster-health

Query 2: Check for Issues

Are there any unhealthy pods in self-healing-platform?

Example Response:

✅ All pods in self-healing-platform are healthy!

Status: 10/10 pods Running
- coordination-engine-0: Running (uptime: 2d)
- mcp-server-xxx: Running (uptime: 2d)
- sample-nginx-xxx (2 pods): Running
- anomaly-detector-predictor-xxx (2 pods): Running
- predictive-analytics-predictor-xxx (2 pods): Running

No issues detected.

MCP Tool Used: get-cluster-health with pod filtering

Query 3: Check Other Namespaces

You can also query other namespaces! Try:

Are there any issues in the openshift-lightspeed namespace?

Example Response:

✅ Namespace openshift-lightspeed looks healthy!

Pods: 3/3 Running
- lightspeed-app-server-xxx: Running
- lightspeed-console-plugin-xxx: Running
- ols-postgres-xxx: Running

No issues detected.

Or check multiple namespaces at once:

Compare the health of self-healing-platform and openshift-lightspeed namespaces

🎯 Lightspeed is cluster-aware! You can ask about any namespace you have access to, not just the platform namespace.

Step 4.4: Test Pod Management Queries

These queries use the list-pods MCP tool to retrieve detailed pod information.

Query 3: List All Pods

List all pods in the {namespace} namespace with their resource usage

Example Response:

Found 8 pods in {namespace}:

1. coordination-engine-0
   Status: Running ✅
   CPU: 45m / 500m (9%)
   Memory: 182Mi / 512Mi (35%)
   Restarts: 0
   Age: 2d

2. mcp-server-xxx
   Status: Running ✅
   CPU: 12m / 200m (6%)
   Memory: 89Mi / 256Mi (35%)
   Restarts: 0
   Age: 2d

[... additional pods ...]

All pods are healthy!

MCP Tool Used: list-pods

Query 4: Describe Specific Pod

Show me details about the coordination-engine pod

Example Response:

Coordination Engine Pod Details:

Name: coordination-engine-0
Namespace: {namespace}
Status: Running ✅
Image: quay.io/kubeheal/coordination-engine:latest

Resources:
- CPU: 45m used / 500m limit (9%)
- Memory: 182Mi used / 512Mi limit (35%)

Networking:
- IP: 10.128.2.45
- Ports: 8080 (HTTP), 8081 (Metrics)

Health:
- Readiness: Passed ✅
- Liveness: Passed ✅
- Restarts: 0
- Uptime: 2 days

Recent Events: None (stable)

MCP Tool Used: list-pods with pod name filter

Step 4.5: Test ML Model Queries

These queries use the list-models and get-model-status MCP tools.

Query 5: List Available Models

What ML models are available in the platform?

Example Response:

ML Model Catalog ({namespace}):

1. anomaly-detector
   Type: Anomaly Detection
   Framework: scikit-learn
   Status: READY ✅
   Endpoint: http://anomaly-detector-predictor:8080/v1/models/anomaly-detector:predict
   Capabilities:
   - Detects unusual patterns in metrics
   - Memory leak detection
   - CPU spike detection
   - Pod crash pattern analysis

2. predictive-analytics
   Type: Time-Series Forecasting
   Framework: scikit-learn
   Status: READY ✅
   Endpoint: http://predictive-analytics-predictor:8080/v1/models/predictive-analytics:predict
   Capabilities:
   - Predicts future CPU usage
   - Predicts future memory usage
   - Capacity planning forecasts

Both models are healthy and ready for inference!

MCP Tool Used: list-models

Query 6: Check Model Status

Show me the status of the anomaly-detector model

Example Response:

Model: anomaly-detector

Status: READY ✅
Runtime: KServe v0.13.1
Replicas: 2/2 Ready
Endpoint: http://anomaly-detector-predictor.{namespace}.svc:8080

Recent Activity:
- Predictions (last hour): 12
- Average latency: 23ms
- Success rate: 100%

Model Details:
- Framework: scikit-learn
- Model type: Isolation Forest
- Features: 8 metrics (CPU, memory, restarts, etc.)
- Trained: 2026-02-18
- Accuracy: 94.3%

The model is healthy and performing well!

MCP Tool Used: get-model-status

Step 4.6: Test Anomaly Detection Queries

These queries use the analyze-anomalies MCP tool to run ML-based detection.

Query 7: Scan for Anomalies

Are there any anomalies in the {namespace} namespace?

Example Response:

Anomaly Detection Scan Complete:

Scanned: 8 pods across {namespace}
Analysis: Last 24 hours of metrics

✅ No anomalies detected!

All pods are operating within normal parameters:
- CPU usage: Normal patterns
- Memory usage: Stable
- Restart counts: 0 (healthy)
- Network traffic: Expected ranges

ML Model: anomaly-detector (94.3% accuracy)
Confidence: 92%
Scan time: 0.34s

MCP Tool Used: analyze-anomalies

Query 8: Check for Unusual Resource Usage

Check for unusual CPU or memory usage patterns in the cluster

Example Response:

Resource Usage Pattern Analysis:

Timeframe: Last 24 hours
Scope: {namespace}

CPU Patterns:
✅ Normal - All pods within expected ranges
- coordination-engine: 9% avg (normal for orchestrator)
- mcp-server: 6% avg (low activity)
- ML predictors: 12-15% avg (inference workload)

Memory Patterns:
✅ Normal - Stable memory usage
- No memory leaks detected
- No unusual growth patterns
- Working set stable over 24h

Anomaly Score: 0.12 (threshold: 0.7)
Status: HEALTHY ✅

No action required - all systems operating normally.

MCP Tool Used: analyze-anomalies with resource focus

Step 4.7: Test Resource Analysis Queries

These queries use get-cluster-health and custom analysis tools.

Query 9: CPU Usage Analysis

Show me CPU usage across all pods in {namespace}

Example Response:

CPU Usage Report ({namespace}):

Total Cluster CPU: 2.5% (156m / 6200m)

By Pod:
1. coordination-engine-0: 45m (9% of limit)
2. mcp-server-xxx: 12m (6% of limit)
3. anomaly-detector-predictor-xxx: 32m (16% of limit)
4. anomaly-detector-predictor-yyy: 28m (14% of limit)
5. predictive-analytics-predictor-xxx: 18m (9% of limit)
6. predictive-analytics-predictor-yyy: 21m (10% of limit)

Top CPU Consumers:
1. coordination-engine: 45m (orchestration workload)
2. anomaly-detector: 60m total (ML inference)
3. predictive-analytics: 39m total (forecasting)

Status: ✅ Healthy - All pods well below limits
Headroom: 6044m (97.5%) available

MCP Tool Used: get-cluster-health with resource breakdown

Query 10: Memory Usage Analysis

Which pods are using the most memory in {namespace}?

Example Response:

Memory Usage Report ({namespace}):

Total Namespace Memory: 842Mi / 2560Mi (33%)

Top Memory Consumers:
1. coordination-engine-0: 182Mi (orchestrator + incident DB)
2. anomaly-detector-predictor-xxx: 156Mi (ML model in memory)
3. anomaly-detector-predictor-yyy: 152Mi (ML model in memory)
4. predictive-analytics-predictor-xxx: 128Mi (model + cache)
5. predictive-analytics-predictor-yyy: 124Mi (model + cache)

Total ML Models Memory: 560Mi (66% of namespace usage)
Total Infrastructure: 282Mi (34% of namespace usage)

Status: ✅ Healthy - 67% headroom available
Recommendations: No scaling needed

MCP Tool Used: get-cluster-health with memory sorting

Step 4.8: Test MCP Catalog Queries

These queries use the internal MCP catalog tools.

Query 11: List Available MCP Tools

What tools are available in the MCP server?

Example Response:

MCP Server Tool Catalog:

Total: 12 tools registered

Core Tools (7):
1. get-cluster-health - Check namespace pod/model health
2. list-pods - Query pods with filtering
3. analyze-anomalies - Run ML anomaly detection
4. trigger-remediation - Apply automated fixes
5. list-incidents - Query historical incidents
6. get-model-status - Check KServe model health
7. list-models - List ML model catalog

Extended Tools (5):
8. predict-resource-usage - Forecast CPU/memory
9. get-metrics - Query Prometheus metrics
10. check-capacity - Calculate available resources
11. get-pod-logs - Retrieve recent logs
12. describe-resource - Get Kubernetes resource details

All tools are accessible via natural language queries!

MCP Tool Used: Internal MCP catalog listing

Query 12: Explain a Specific Tool

Explain the get-cluster-health tool

Example Response:

MCP Tool: get-cluster-health

Description:
Retrieves comprehensive health status for a namespace, including
pod health, ML model readiness, and recent incident activity.

What it does:
1. Queries Kubernetes API for pod status
2. Checks KServe InferenceService readiness
3. Retrieves recent incidents from Coordination Engine
4. Aggregates health metrics

Parameters:
- namespace (optional): Target namespace (default: {namespace})
- include_metrics (optional): Include resource usage (default: true)

Returns:
- Pod status list (name, state, restarts, age)
- ML model readiness (READY/NOT_READY)
- Incident count (last 1h, 24h)
- Overall health score

Used by queries like:
- "What's the cluster health?"
- "Are there any failing pods?"
- "Show me the platform status"

Response time: ~200-500ms

MCP Tool Used: Internal tool introspection

Step 4.9: Verify MCP Server Tool Calls

As you try these queries, you can watch the MCP Server logs to see which tools are being invoked:

# In a separate terminal, watch MCP server logs
oc logs -f deployment/mcp-server -n self-healing-platform

Expected log output:

2026-02-20 10:23:45 Tool invoked: get-cluster-health (params: namespace={namespace})
2026-02-20 10:23:45 Response: 8 pods, 2 models, 0 incidents
2026-02-20 10:24:12 Tool invoked: list-pods (params: namespace={namespace}, filter=all)
2026-02-20 10:24:12 Response: 8 pods returned
2026-02-20 10:25:03 Tool invoked: list-models (params: namespace={namespace})
2026-02-20 10:25:03 Response: 2 models (anomaly-detector, predictive-analytics)
2026-02-20 10:26:34 Tool invoked: analyze-anomalies (params: namespace={namespace}, timeframe=24h)
2026-02-20 10:26:35 Response: 0 anomalies detected (score: 0.12)

🎯 What you’ve accomplished: You’ve now verified that Lightspeed can successfully use all 7 core MCP tools through natural language queries!

Step 4.10: Troubleshooting

Problem: Lightspeed doesn’t respond to queries

Check OLSConfig status:

oc get olsconfig cluster -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' && echo

Expected: True

Check Lightspeed pods:

oc get pods -n openshift-lightspeed

Expected: lightspeed-app-server-xxx is Running

Check Lightspeed logs for errors:

oc logs -n openshift-lightspeed deployment/lightspeed-app-server --tail=50

Problem: Lightspeed responds but doesn’t use MCP tools

Verify MCP Server is accessible:

oc get olsconfig cluster -o jsonpath='{.spec.mcpServers[0].streamableHTTP.url}' && echo

Test MCP Server directly:

oc run test-mcp --image=registry.access.redhat.com/ubi9/ubi-minimal:latest \
  --rm -i --restart=Never -n self-healing-platform -- \
  curl -s http://mcp-server:8080/health

Expected: OK

Problem: MCP tools return errors

Check MCP Server logs for errors:

oc logs deployment/mcp-server -n self-healing-platform --tail=100 | grep -i error

Verify Coordination Engine is running:

oc get pods -n self-healing-platform -l app=coordination-engine

Problem: vLLM responses are poor quality

If using vLLM (Option C), Llama 3.2 1B is a smaller model:

  • Try rephrasing queries to be more specific

  • Use shorter, direct questions

  • Avoid complex multi-step queries

  • Consider switching to OpenAI for production use

If you encounter persistent issues, verify that Module 2 Part 2 (LLM Provider configuration) was completed successfully. The most common issue is an invalid or missing API key secret.

Part 5: Understanding OLSConfig

Now that Lightspeed is working, let’s understand the OLSConfig in detail.

Complete OLSConfig Reference

The OLSConfig is a cluster-scoped singleton that configures OpenShift Lightspeed. Here’s a comprehensive example showing all supported provider types:

apiVersion: ols.openshift.io/v1alpha1
kind: OLSConfig
metadata:
  name: cluster  # MUST be 'cluster' (cluster-scoped singleton)
spec:
  # Enable MCP Server integration
  featureGates:
    - MCPServer

  # LLM Provider configuration (configure one or multiple)
  llm:
    providers:
      # OpenAI (GPT-5, GPT-4o)
      - name: openai
        type: openai
        url: "https://api.openai.com/v1"
        credentialsSecretRef:
          name: openai-api-key
        models:
          - name: gpt-5-mini       # Latest efficient model (default)
          - name: gpt-4o           # Previous generation multimodal
          - name: gpt-4            # Stable baseline

      # Google Gemini (OpenAI-compatible endpoint)
      - name: google
        type: openai
        url: "https://generativelanguage.googleapis.com/v1beta/openai"
        credentialsSecretRef:
          name: google-api-key
        models:
          - name: gemini-2.0-flash-exp  # Fast & efficient
          - name: gemini-1.5-pro        # Advanced reasoning

      # Azure OpenAI
      - name: azure
        type: azure_openai
        url: "https://YOUR_RESOURCE.openai.azure.com"
        credentialsSecretRef:
          name: azure-openai-key
        models:
          - name: gpt-4o           # Requires matching deployment name
          - name: gpt-4o-mini      # High efficiency

      # IBM watsonx (BAM)
      - name: watsonx
        type: watsonx
        url: "https://us-south.ml.cloud.ibm.com"
        credentialsSecretRef:
          name: watsonx-api-key
        models:
          - name: ibm/granite-13b-chat-v2  # Enterprise model

      # vLLM (Self-hosted - RHOAI integration)
      - name: vllm-local
        type: rhoai_vllm
        url: "https://vllm-server-{namespace}.apps.cluster.example.com/v1"
        credentialsSecretRef:
          name: vllm-api-key
        models:
          - name: meta-llama/Llama-3.2-1B-Instruct  # Self-hosted

  # MCP Server configuration (integrates with self-healing platform)
  mcpServers:
    - name: cluster-health
      streamableHTTP:
        url: http://mcp-server.{namespace}.svc:8080/mcp
        timeout: 30
        enableSSE: true

  # OLS deployment configuration
  ols:
    defaultModel: gpt-5-mini        # Choose your default model
    defaultProvider: openai         # Choose your default provider
    deployment:
      replicas: 1
    conversationCache:
      type: postgres                # REQUIRED: Only 'postgres' is supported

  # Console plugin configuration
  console:
    enabled: true

View Current Configuration

# View the complete OLSConfig
oc get olsconfig cluster -o yaml

# View just the LLM providers
oc get olsconfig cluster -o jsonpath='{.spec.llm.providers[*].name}' && echo

# View MCP servers
oc get olsconfig cluster -o jsonpath='{.spec.mcpServers[*].name}' && echo

Supported LLM Provider Types

Provider Type Value Authentication Notes

OpenAI

openai

API key in secret

GPT-5-mini, GPT-4o recommended

Google Gemini

openai

API key in secret

Uses OpenAI-compatible endpoint

Azure OpenAI

azure_openai

API key + endpoint

Requires Azure deployment name

IBM watsonx

watsonx

API key

Granite models for enterprise

vLLM (RHOAI)

rhoai_vllm

Optional (dummy secret)

Self-hosted, no external API

IBM BAM

bam

API key

IBM Research platform

Secret Format Requirements

All LLM provider secrets must use the key name apitoken:

# Correct format
oc create secret generic openai-api-key \
  -n openshift-lightspeed \
  --from-literal=apitoken='sk-proj-...'

# WRONG - will fail validation
oc create secret generic openai-api-key \
  -n openshift-lightspeed \
  --from-literal=api_key='sk-proj-...'  # ❌ Wrong key name

Key Configuration Points

Cluster-scoped singleton: - OLSConfig name MUST be cluster - Only one OLSConfig can exist per cluster - No namespace field in metadata

MCP Server integration: - Requires MCPServer feature gate - Uses internal service URL for cluster-health MCP server - Timeout of 30 seconds recommended for ML model calls

Conversation cache: - Only postgres type is supported - Automatically provisioned by Lightspeed operator - Stores conversation history for context

Multiple providers: - Configure multiple providers for redundancy - Set defaultProvider and defaultModel for primary use - Users can switch providers via console UI

Part 6: The 7 MCP Tools

The MCP Server exposes these tools that Lightspeed can call:

Tool Reference

Tool Description Returns

get-cluster-health

Check namespace status, pods, ML models

Health summary with metrics

list-pods

Query pods with filtering

Pod list with details

analyze-anomalies

Call ML models for detection

Anomaly detection results + recommendations

trigger-remediation

Apply fixes via Coordination Engine

Remediation status and ID

list-incidents

Query historical incidents

Incident data with resolution info

get-model-status

Check KServe InferenceService health

Model status and endpoints

list-models

List available ML model catalog

Model names and capabilities

How Tools Are Called

When you ask Lightspeed a question, it:

  1. Parses your natural language intent

  2. Selects the appropriate MCP tool(s)

  3. Calls the tool via MCP protocol

  4. Receives structured JSON response

  5. Formats a human-readable reply

Example Flow:

You: "What's the cluster health?"
     │
     ▼
Lightspeed: Intent = "get health status"
            Tool = "get-cluster-health"
     │
     ▼
MCP Server: GET /tools/get-cluster-health
            Params: {namespace: "self-healing-platform"}
     │
     ▼
Coordination Engine: Query Kubernetes API
                     Check InferenceServices
     │
     ▼
Response: {
  "status": "healthy",
  "pods": [...],
  "models": [...]
}
     │
     ▼
Lightspeed: "✅ Cluster Health Summary for self-healing-platform:
            Healthy Components (4): ..."

Part 7: Understanding the Architecture

Complete Data Flow

┌─────────────────────────────────────────────────────────────────┐
│                     User Interface Layer                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │ OCP Console │  │ Python API  │  │ Custom Integrations     │  │
│  └──────┬──────┘  └──────┬──────┘  └───────────┬─────────────┘  │
│         │                │                     │                 │
│         └────────────────┴─────────────────────┘                 │
│                          │                                       │
│                          ▼                                       │
├─────────────────────────────────────────────────────────────────┤
│                     AI/LLM Layer                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │           OpenShift Lightspeed (OLS Server)             │   │
│  │  • Natural language understanding                       │   │
│  │  • MCP tool selection                                   │   │
│  │  • Response formatting                                  │   │
│  └────────────────────────┬────────────────────────────────┘   │
│                           │ MCP Protocol                        │
│                           ▼                                     │
├─────────────────────────────────────────────────────────────────┤
│                     Tool Layer                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              MCP Server (Go)                            │   │
│  │  • 7 tools exposed via MCP protocol                     │   │
│  │  • Routes requests to Coordination Engine               │   │
│  │  • Returns structured JSON responses                    │   │
│  └────────────────────────┬────────────────────────────────┘   │
│                           │ REST API                            │
│                           ▼                                     │
├─────────────────────────────────────────────────────────────────┤
│                     Orchestration Layer                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │           Coordination Engine (Go)                       │   │
│  │  • Queries Prometheus for metrics                       │   │
│  │  • Calls KServe ML models                               │   │
│  │  • Applies remediation to cluster                       │   │
│  │  • Tracks incidents and history                         │   │
│  └─────────┬─────────────────┬─────────────────┬───────────┘   │
│            │                 │                 │                 │
│            ▼                 ▼                 ▼                 │
├─────────────────────────────────────────────────────────────────┤
│                     Data/ML Layer                               │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────────────┐ │
│  │   Prometheus  │ │ KServe Models │ │   Kubernetes API      │ │
│  │   (metrics)   │ │ (inference)   │ │   (cluster ops)       │ │
│  └───────────────┘ └───────────────┘ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Summary

In this module, you accomplished:

  • Deployed MCP Server - Verified the cluster-health MCP server is running

  • Configured LLM Provider - Set up OpenAI, existing Lightspeed cluster, or self-hosted vLLM

  • Created OLSConfig - Connected Lightspeed to MCP Server via MCPServer feature gate

  • Explored Lightspeed Capabilities - Tested 12 different natural language queries covering all 7 MCP tools

  • Learned the MCP Tools - Understand cluster health, pod management, ML models, and anomaly detection tools

  • Understood Architecture - Complete data flow from console chat to ML models and back

You’re now ready for Module 3: End-to-End Self-Healing!

You’ve verified that Lightspeed can successfully interact with your cluster through natural language. Module 3 will show you how to use these capabilities for real-world troubleshooting scenarios:

  • Deploy sample applications

  • Use ML models to predict resource usage

  • Detect anomalies automatically

  • Trigger self-healing remediation

  • Track historical incidents

All through conversational interaction with Lightspeed - no code required!

Resources

Advanced: Programmatic Access

For users interested in programmatic access to Lightspeed (Python API, automation scripts, integration patterns):

The Python examples are optional and not covered in this workshop. They’re provided for advanced users who want to build custom automation or integrations. The workshop focuses on natural language interaction through the OpenShift Console.