Module 1: ML Model Training with Tekton

Overview

This module demonstrates how to train and deploy machine learning models using Tekton pipelines. Automated model training ensures models stay current with cluster behavior, improving prediction accuracy and anomaly detection reliability.

What you’ll learn:

Train models manually with custom time windows
Schedule automated weekly retraining
Integrate real Prometheus metrics with synthetic data
Validate model health before deployment
Add your own custom models

Quick Start: Train a Model

Manual Training with Default Settings (24h Data)

Train the anomaly detector with 24 hours of recent data:

oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: train-anomaly-detector-
  namespace: self-healing-platform
spec:
  pipelineRef:
    name: model-training-pipeline
  params:
    - name: model-name
      value: "anomaly-detector"
    - name: notebook-path
      value: "notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb"
    - name: data-source
      value: "prometheus"
    - name: training-hours
      value: "24"
    - name: inference-service-name
      value: "anomaly-detector"
    - name: git-url
      value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
    - name: git-ref
      value: "main"
  timeout: 30m
EOF

Monitor the training progress:

# Watch pipeline execution
tkn pipelinerun logs -f -n self-healing-platform

# Check training job status
oc get notebookvalidationjobs -n self-healing-platform

# View model file
oc exec -n self-healing-platform deployment/anomaly-detector-predictor -- \
  ls -lh /mnt/models/anomaly-detector/

Manual Training with Custom Time Window (GPU Pipeline)

Train the predictive analytics model with 30 days of data for capturing seasonal patterns.

New Architecture: As of ADR-053, the platform uses two separate pipelines:

model-training-pipeline - CPU-only for anomaly-detector
model-training-pipeline-gpu - GPU support for predictive-analytics

This separation provides better resource isolation and removes the sed patching issues from the old unified pipeline.

Reference: https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/053-tekton-model-training-pipelines.md

oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: train-predictive-analytics-
  namespace: self-healing-platform
spec:
  pipelineRef:
    name: model-training-pipeline-gpu
  params:
    - name: model-name
      value: "predictive-analytics"
    - name: notebook-path
      value: "notebooks/02-anomaly-detection/05-predictive-analytics-kserve.ipynb"
    - name: data-source
      value: "prometheus"
    - name: training-hours
      value: "720"
    - name: inference-service-name
      value: "predictive-analytics"
    - name: git-url
      value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
    - name: git-ref
      value: "main"
  timeout: 45m
EOF

Requirements:

GPU nodes with nvidia.com/gpu.present: "true" label
GPU tolerations configured (automatically handled by GPU pipeline)
Uses model-storage-gpu-pvc (GP3 storage) instead of model-storage-pvc (CephFS)

Don’t have GPU nodes? Use the CPU pipeline with anomaly-detector for all examples:

oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: train-anomaly-1week-
  namespace: self-healing-platform
spec:
  pipelineRef:
    name: model-training-pipeline
  params:
    - name: model-name
      value: "anomaly-detector"
    - name: notebook-path
      value: "notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb"
    - name: data-source
      value: "prometheus"
    - name: training-hours
      value: "168"
    - name: inference-service-name
      value: "anomaly-detector"
    - name: git-url
      value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
    - name: git-ref
      value: "main"
  timeout: 30m
EOF

GPU-Accelerated Training with NotebookValidationJob

For faster training, use GPU resources:

oc create -f - <<'EOF'
apiVersion: mlops.mlops.dev/v1alpha1
kind: NotebookValidationJob
metadata:
  name: train-predictive-gpu
  namespace: self-healing-platform
  labels:
    model-name: predictive-analytics
spec:
  notebook:
    git:
      ref: main
      url: https://github.com/KubeHeal/openshift-aiops-platform.git
    path: notebooks/02-anomaly-detection/05-predictive-analytics-kserve.ipynb
  podConfig:
    containerImage: image-registry.openshift-image-registry.svc:5000/self-healing-platform/notebook-validator:latest
    env:
    - name: DATA_SOURCE
      value: prometheus
    - name: PROMETHEUS_URL
      value: https://prometheus-k8s.openshift-monitoring.svc:9091
    - name: TRAINING_HOURS
      value: "168"
    - name: MODEL_NAME
      value: predictive-analytics
    nodeSelector:
      nvidia.com/gpu.present: "true"
    resources:
      limits:
        cpu: "4"
        memory: 8Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "2"
        memory: 4Gi
        nvidia.com/gpu: "1"
    serviceAccountName: self-healing-workbench
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists
    volumeMounts:
    - mountPath: /mnt/models
      name: model-storage
    volumes:
    - name: model-storage
      persistentVolumeClaim:
        claimName: model-storage-gpu-pvc
  timeout: 45m
EOF

GPU nodes may not have CephFS drivers. Use model-storage-gpu-pvc (GP3 storage class) instead of model-storage-pvc (CephFS) for GPU-accelerated training. The GPU pipeline automatically copies models from model-storage-gpu-pvc to model-storage-pvc after training completes.

Training Time Windows

Choose the appropriate time window based on your use case:

Duration	Hours	Use Case	Example
1 day	24	Quick iteration, development, testing	Testing notebook changes
1 week	168	Weekly retraining, production anomaly detection	Anomaly detector scheduled training
30 days	720	Initial training, seasonal patterns, forecasting	Predictive analytics scheduled training

Duration

Hours

Use Case

Example

1 day

Quick iteration, development, testing

Testing notebook changes

1 week

168

Weekly retraining, production anomaly detection

Anomaly detector scheduled training

30 days

720

Initial training, seasonal patterns, forecasting

Predictive analytics scheduled training

Recommended defaults:

Anomaly Detector (model-training-pipeline): 168h (1 week) - Captures weekly patterns without excessive noise
Predictive Analytics (model-training-pipeline-gpu): 720h (30 days) - Captures monthly trends and seasonality

Pipeline Selection Guide:

Model Pipeline Requirements

Model	Pipeline	Requirements
anomaly-detector	`model-training-pipeline`	CPU-only, works on any node
predictive-analytics	`model-training-pipeline-gpu`	Requires GPU nodes, uses GP3 storage
Custom CPU models	`model-training-pipeline`	Any custom notebook for CPU training
Custom GPU models	`model-training-pipeline-gpu`	Custom notebooks requiring GPU acceleration

anomaly-detector

model-training-pipeline

CPU-only, works on any node

predictive-analytics

model-training-pipeline-gpu

Requires GPU nodes, uses GP3 storage

Custom CPU models

model-training-pipeline

Any custom notebook for CPU training

Custom GPU models

model-training-pipeline-gpu

Custom notebooks requiring GPU acceleration

Training time: Both pipelines complete in ~60 seconds regardless of the time window (24h, 168h, or 720h). The Prometheus query and data collection time dominates, not the hours of data requested.

GPU availability: If your cluster doesn’t have GPU nodes, use model-training-pipeline with anomaly-detector for all training examples.

Data Sources

The platform supports three data source modes for model training:

Synthetic Data (`DATA_SOURCE=synthetic`)

Use case: Development, testing, CI/CD, when Prometheus is unavailable

✅ Fast and reproducible
✅ Known anomaly labels for validation
✅ No external dependencies
⚠️ May not capture real cluster patterns

params:
  - name: data-source
    value: "synthetic"

Prometheus Data (`DATA_SOURCE=prometheus`)

Use case: Production training with real cluster metrics

✅ Real cluster behavior patterns
✅ Adapts to actual workload characteristics
✅ Improves model accuracy
⚠️ Requires Prometheus access
⚠️ Real anomalies are rare (<1%)

params:
  - name: data-source
    value: "prometheus"

Training notebooks automatically:

Fetch real metrics from Prometheus (80% of data)
Inject synthetic anomalies (20% of data) for balanced training
Combine datasets for robust model training

Hybrid Data (`DATA_SOURCE=hybrid`)

Use case: Staging, validation, best of both worlds

✅ 50% Prometheus + 50% synthetic
✅ Balanced representation
✅ Good for validation environments

params:
  - name: data-source
    value: "hybrid"

Recommendation: Use prometheus mode for production scheduled training to ensure models learn real cluster patterns.

Choosing the Right Pipeline

The platform provides two dedicated training pipelines optimized for different resource requirements:

CPU Pipeline (`model-training-pipeline`)

Use for: anomaly-detector and custom CPU-based models

Features:

Runs on any node (no GPU required)
Uses model-storage-pvc (CephFS shared storage)
Faster startup (no GPU node scheduling)
Suitable for Isolation Forest, XGBoost, traditional ML algorithms

Example:

oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: train-cpu-model-
  namespace: self-healing-platform
spec:
  pipelineRef:
    name: model-training-pipeline
  params:
    - name: model-name
      value: "anomaly-detector"
    - name: notebook-path
      value: "notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb"
    - name: data-source
      value: "prometheus"
    - name: training-hours
      value: "168"
    - name: inference-service-name
      value: "anomaly-detector"
    - name: git-url
      value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
    - name: git-ref
      value: "main"
  timeout: 30m
EOF

GPU Pipeline (`model-training-pipeline-gpu`)

Use for: predictive-analytics and custom GPU-accelerated models

Features:

Requires GPU nodes with nvidia.com/gpu.present: "true" label
Uses model-storage-gpu-pvc (GP3 storage - GPU nodes may not have CephFS drivers)
Automatic GPU tolerations and nodeSelector
Suitable for LSTM, neural networks, deep learning models

Example:

oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: train-gpu-model-
  namespace: self-healing-platform
spec:
  pipelineRef:
    name: model-training-pipeline-gpu
  params:
    - name: model-name
      value: "predictive-analytics"
    - name: notebook-path
      value: "notebooks/02-anomaly-detection/05-predictive-analytics-kserve.ipynb"
    - name: data-source
      value: "prometheus"
    - name: training-hours
      value: "720"
    - name: inference-service-name
      value: "predictive-analytics"
    - name: git-url
      value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
    - name: git-ref
      value: "main"
  timeout: 45m
EOF

Pipeline Mismatch: Using the wrong pipeline for a model will cause failures:

❌ model-training-pipeline + predictive-analytics = sed error (no GPU support)
❌ model-training-pipeline-gpu + anomaly-detector = unnecessary GPU requirement
✅ Match the pipeline to the model’s resource requirements

See ADR-053 for the architectural decision: https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/053-tekton-model-training-pipelines.md

Automated Scheduled Training

The platform automatically retrains models weekly via CronJobs:

Automated Training Schedule:

Anomaly Detector: Sundays at 2:00 AM UTC (168h training window)
Predictive Analytics: Sundays at 3:00 AM UTC (720h training window)

The CronJobs will trigger automatically on schedule. You can view the configuration and past runs below.

Anomaly Detector (Weekly, Sunday 2 AM UTC)

# View CronJob configuration
oc get cronjob weekly-anomaly-detector-training -n self-healing-platform -o yaml

# View recent training runs
oc get pipelineruns -n self-healing-platform -l model-name=anomaly-detector

# Check latest training job
tkn pipelinerun logs -n self-healing-platform $(oc get pipelinerun -n self-healing-platform \
  -l model-name=anomaly-detector --sort-by=.metadata.creationTimestamp -o name | tail -1)

Predictive Analytics (Weekly, Sunday 3 AM UTC)

# View CronJob configuration
oc get cronjob weekly-predictive-analytics-training -n self-healing-platform -o yaml

# View recent training runs
oc get pipelineruns -n self-healing-platform -l model-name=predictive-analytics

Monitoring Training Runs

Check Pipeline Status

# List all pipeline runs
tkn pipelinerun list -n self-healing-platform

# Watch specific run (replace with actual name)
tkn pipelinerun logs train-anomaly-detector-abc123 -f -n self-healing-platform

Verify Model Deployment

# Check InferenceService status
oc get inferenceservice anomaly-detector -n self-healing-platform

# Check predictor pod status
oc get pods -l serving.kserve.io/inferenceservice=anomaly-detector \
  -n self-healing-platform

# View model file details
oc exec -n self-healing-platform deployment/anomaly-detector-predictor -- \
  ls -lh /mnt/models/anomaly-detector/model.pkl

Test Model Endpoint

Test the anomaly detector using the coordination-engine API:

# Test anomaly detection with sample data (45 features required)
oc exec -n self-healing-platform deployment/coordination-engine -- \
  curl -s -X POST http://localhost:8080/api/v1/detect \
    -H 'Content-Type: application/json' \
    -d '{"model": "anomaly-detector", "instances": [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]}'

Expected response:

{"predictions":[-1],"model_name":"anomaly-detector"}

The anomaly detector returns -1 when it detects an anomaly and 1 when data appears normal. Test data with all zeros is flagged as an anomaly (expected behavior).

Coordination Engine API Validation

The coordination-engine provides a unified API for validating and interacting with ML models. This section demonstrates the key validation endpoints.

Platform Health Check

Check the overall platform health and dependencies:

# Detailed health check with dependencies
oc exec -n self-healing-platform deployment/coordination-engine -- \
  curl -s http://localhost:8080/api/v1/health

This returns:

status: Overall health (ok or degraded)
dependencies: Kubernetes and ML service connectivity
rbac: RBAC permissions validation
uptime: Time since coordination-engine started

Resource Usage Prediction

Test the predictive analytics model to forecast future resource usage:

# Predict future CPU and memory usage
oc exec -n self-healing-platform deployment/coordination-engine -- \
  curl -s -X POST http://localhost:8080/api/v1/predict \
    -H 'Content-Type: application/json' \
    -d '{"model": "predictive-analytics", "instances": [[0.5, 0.5, 0.5, 0.5, 0.5]]}'

Expected response:

{
  "status": "success",
  "predictions": {
    "cpu_percent": 2.06,
    "memory_percent": 38.34
  },
  "current_metrics": {
    "cpu_rolling_mean": 3.44,
    "memory_rolling_mean": 26.92,
    "timestamp": "2026-02-17T20:54:01Z"
  },
  "model_info": {
    "name": "predictive-analytics",
    "confidence": 0.85
  }
}

The coordination-engine API abstracts away KServe implementation details. You don’t need to know pod IPs, service names, or model endpoint URLs - the coordination-engine handles all model communication internally.

Pre-Training Health Check

Before training, let’s check the current state of the ML models. During platform deployment, the NotebookValidationJobs automatically executed the training notebooks and saved model artifacts to the shared storage. However, due to deployment timing, the predictor pods may have started before the models were written — meaning the models exist on disk but aren’t loaded by the serving containers.

Step 1: Check Platform Health and Model Registry

Check the coordination-engine health and registered models:

# Check platform health
echo "=== Platform Health ==="
oc exec -n self-healing-platform deployment/coordination-engine -- \
  curl -s http://localhost:8080/health

# List registered models
echo -e "\n=== Registered Models ==="
oc exec -n self-healing-platform deployment/coordination-engine -- \
  curl -s http://localhost:8080/api/v1/models

Expected output:

=== Platform Health ===
{"status":"ok","version":"ocp-4.18-eada2fc"}

=== Registered Models ===
{"models":["predictive-analytics","anomaly-detector"],"count":2}

If the models list is empty, the predictor pods started before the training notebooks finished writing the model files. This is expected on a fresh deployment.

Step 2: Verify Model Files Exist on Storage

Even if the predictors haven’t loaded the models, the files should exist on the shared PVC:

# Check model artifacts on the shared PVC
for model in anomaly-detector predictive-analytics; do
  echo "=== ${model} ==="
  oc exec -n self-healing-platform \
    $(oc get pod -n self-healing-platform \
      -l serving.kserve.io/inferenceservice=${model} \
      -o jsonpath='{.items[0].metadata.name}') -- \
    ls -lh /mnt/models/${model}/ 2>/dev/null || echo "Directory not found"
done

You should see model.pkl files in each directory. These were trained on synthetic data by the NotebookValidationJobs during deployment.

Step 3: Check Predictor Logs

If the models aren’t loaded, the predictor logs will show why:

oc logs -n self-healing-platform \
  -l serving.kserve.io/inferenceservice=anomaly-detector \
  --tail=5

A common message is failed to locate model file — this means the predictor attempted to load the model at startup before the file was written.

Why does this happen? The InferenceService predictor pods (sync-wave 2) start before the NotebookValidationJobs (sync-wave 3+) complete training. The KServe sklearn server loads models only at startup and does not retry. After the training notebooks finish and write the model files, the predictors need to be restarted to pick them up.

In the next section, you’ll retrain the models using Tekton pipelines — which will write fresh model files and restart the predictors automatically.

Hands-On Exercise: Train Your First Model

Now that you’ve seen the current model state, let’s train the anomaly detector with a short training window:

Step 1: Start Training

oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: workshop-train-
  namespace: self-healing-platform
  labels:
    workshop: self-healing
spec:
  pipelineRef:
    name: model-training-pipeline
  params:
    - name: model-name
      value: "anomaly-detector"
    - name: notebook-path
      value: "notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb"
    - name: data-source
      value: "synthetic"
    - name: training-hours
      value: "24"
    - name: inference-service-name
      value: "anomaly-detector"
    - name: git-url
      value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
    - name: git-ref
      value: "main"
  timeout: 15m
EOF

Step 2: Watch the Training

# Get the pipeline run name
PIPELINERUN=$(oc get pipelinerun -n self-healing-platform -l workshop=self-healing \
  --sort-by=.metadata.creationTimestamp -o name | tail -1)

echo "Watching: $PIPELINERUN"
tkn pipelinerun logs $PIPELINERUN -f -n self-healing-platform

Step 3: Verify the Model

Once training completes:

# Check InferenceService
oc get inferenceservice anomaly-detector -n self-healing-platform

Expected output:

NAME               URL                                              READY
anomaly-detector   http://anomaly-detector-predictor-default.svc    True

Troubleshooting

Model Training Fails

Symptoms: Pipeline run fails, NotebookValidationJob shows error

Diagnosis:

# Check pipeline logs
tkn pipelinerun logs <pipelinerun-name> -f -n self-healing-platform

# Check NotebookValidationJob status
oc get notebookvalidationjobs -n self-healing-platform
oc describe notebookvalidationjob <job-name> -n self-healing-platform

Common causes:

Insufficient memory (increase memoryLimit)
Prometheus unavailable (check connectivity)
Git repository inaccessible (verify URL)
Notebook syntax errors (test locally first)

Model Not Loaded (Empty Model Registry)

Symptoms: Predictor pod is Running, InferenceService shows Ready, but GET /v1/models returns {"models":[]} and predictions return ModelNotFound

Diagnosis:

# Check model registry
oc exec -n self-healing-platform deployment/coordination-engine -- \
  curl -s http://localhost:8080/api/v1/models

# Check predictor logs for startup errors
oc logs -n self-healing-platform \
  -l serving.kserve.io/inferenceservice=anomaly-detector \
  --tail=10

Cause: The predictor pod started before model files were written to the shared PVC. The KServe sklearn server loads models once at startup and does not retry.

Fix: Retrain the model (which writes a new model file and triggers a predictor restart), or manually restart the predictor:

oc rollout restart deployment/anomaly-detector-predictor -n self-healing-platform
oc rollout restart deployment/predictive-analytics-predictor -n self-healing-platform

Model Won’t Load

Symptoms: Predictor pod crashes, OOMKilled, CrashLoopBackOff

Diagnosis:

# Check predictor pod logs
oc logs -n self-healing-platform \
  -l serving.kserve.io/inferenceservice=anomaly-detector

# Check model file exists
oc exec -n self-healing-platform deployment/anomaly-detector-predictor -- \
  ls -lh /mnt/models/anomaly-detector/

Common causes:

Model file corrupted (retrain model)
Model too large (increase predictor memory)
Incompatible sklearn version (check runtime image)

Prometheus Data Issues

Symptoms: Training falls back to synthetic data

OpenShift Prometheus requires bearer token authentication over HTTPS port 9091.

Diagnosis:

# Check Prometheus accessibility with bearer token
oc exec -n self-healing-platform deployment/coordination-engine -- sh -c '
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sk -H "Authorization: Bearer $TOKEN" \
  "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/status/config" | head -c 200
'

Summary

In this module, you learned:

✅ How to train models manually with PipelineRuns
✅ Different training time windows and when to use them
✅ Data sources: synthetic, prometheus, hybrid
✅ How to monitor training and verify deployments
✅ Troubleshooting common training issues

Next Steps

Now that you have trained models, let’s use them!

In Module 2: End-to-End Self-Healing with Lightspeed, you’ll:

Chat with your cluster using natural language
Deploy sample applications
Predict future resource usage
Break things on purpose and watch AI fix them!

Module 1: ML Model Training with Tekton

Overview

Quick Start: Train a Model

Manual Training with Default Settings (24h Data)

Manual Training with Custom Time Window (GPU Pipeline)

GPU-Accelerated Training with NotebookValidationJob

Training Time Windows

Data Sources

Synthetic Data (DATA_SOURCE=synthetic)

Prometheus Data (DATA_SOURCE=prometheus)

Hybrid Data (DATA_SOURCE=hybrid)

Choosing the Right Pipeline

CPU Pipeline (model-training-pipeline)

GPU Pipeline (model-training-pipeline-gpu)

Automated Scheduled Training

Anomaly Detector (Weekly, Sunday 2 AM UTC)

Predictive Analytics (Weekly, Sunday 3 AM UTC)

Monitoring Training Runs

Check Pipeline Status

Verify Model Deployment

Test Model Endpoint

Coordination Engine API Validation

Platform Health Check

Resource Usage Prediction

Pre-Training Health Check

Step 1: Check Platform Health and Model Registry

Step 2: Verify Model Files Exist on Storage

Step 3: Check Predictor Logs

Hands-On Exercise: Train Your First Model

Step 1: Start Training

Step 2: Watch the Training

Step 3: Verify the Model

Troubleshooting

Model Training Fails

Model Not Loaded (Empty Model Registry)

Model Won’t Load

Prometheus Data Issues

Summary

Next Steps

Synthetic Data (`DATA_SOURCE=synthetic`)

Prometheus Data (`DATA_SOURCE=prometheus`)

Hybrid Data (`DATA_SOURCE=hybrid`)

CPU Pipeline (`model-training-pipeline`)

GPU Pipeline (`model-training-pipeline-gpu`)