Module 1: ML Model Training with Tekton
Overview
This module demonstrates how to train and deploy machine learning models using Tekton pipelines. Automated model training ensures models stay current with cluster behavior, improving prediction accuracy and anomaly detection reliability.
What you’ll learn:
-
Train models manually with custom time windows
-
Schedule automated weekly retraining
-
Integrate real Prometheus metrics with synthetic data
-
Validate model health before deployment
-
Add your own custom models
Quick Start: Train a Model
Manual Training with Default Settings (24h Data)
Train the anomaly detector with 24 hours of recent data:
oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: train-anomaly-detector-
namespace: self-healing-platform
spec:
pipelineRef:
name: model-training-pipeline
params:
- name: model-name
value: "anomaly-detector"
- name: notebook-path
value: "notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb"
- name: data-source
value: "prometheus"
- name: training-hours
value: "24"
- name: inference-service-name
value: "anomaly-detector"
- name: git-url
value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
- name: git-ref
value: "main"
timeout: 30m
EOF
Monitor the training progress:
# Watch pipeline execution
tkn pipelinerun logs -f -n self-healing-platform
# Check training job status
oc get notebookvalidationjobs -n self-healing-platform
# View model file
oc exec -n self-healing-platform deployment/anomaly-detector-predictor -- \
ls -lh /mnt/models/anomaly-detector/
Manual Training with Custom Time Window (GPU Pipeline)
Train the predictive analytics model with 30 days of data for capturing seasonal patterns.
|
New Architecture: As of ADR-053, the platform uses two separate pipelines:
This separation provides better resource isolation and removes the sed patching issues from the old unified pipeline. |
oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: train-predictive-analytics-
namespace: self-healing-platform
spec:
pipelineRef:
name: model-training-pipeline-gpu
params:
- name: model-name
value: "predictive-analytics"
- name: notebook-path
value: "notebooks/02-anomaly-detection/05-predictive-analytics-kserve.ipynb"
- name: data-source
value: "prometheus"
- name: training-hours
value: "720"
- name: inference-service-name
value: "predictive-analytics"
- name: git-url
value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
- name: git-ref
value: "main"
timeout: 45m
EOF
Requirements:
-
GPU nodes with
nvidia.com/gpu.present: "true"label -
GPU tolerations configured (automatically handled by GPU pipeline)
-
Uses
model-storage-gpu-pvc(GP3 storage) instead of model-storage-pvc (CephFS)
|
Don’t have GPU nodes? Use the CPU pipeline with anomaly-detector for all examples:
|
GPU-Accelerated Training with NotebookValidationJob
For faster training, use GPU resources:
oc create -f - <<'EOF'
apiVersion: mlops.mlops.dev/v1alpha1
kind: NotebookValidationJob
metadata:
name: train-predictive-gpu
namespace: self-healing-platform
labels:
model-name: predictive-analytics
spec:
notebook:
git:
ref: main
url: https://github.com/KubeHeal/openshift-aiops-platform.git
path: notebooks/02-anomaly-detection/05-predictive-analytics-kserve.ipynb
podConfig:
containerImage: image-registry.openshift-image-registry.svc:5000/self-healing-platform/notebook-validator:latest
env:
- name: DATA_SOURCE
value: prometheus
- name: PROMETHEUS_URL
value: https://prometheus-k8s.openshift-monitoring.svc:9091
- name: TRAINING_HOURS
value: "168"
- name: MODEL_NAME
value: predictive-analytics
nodeSelector:
nvidia.com/gpu.present: "true"
resources:
limits:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 4Gi
nvidia.com/gpu: "1"
serviceAccountName: self-healing-workbench
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumeMounts:
- mountPath: /mnt/models
name: model-storage
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-storage-gpu-pvc
timeout: 45m
EOF
|
GPU nodes may not have CephFS drivers. Use |
Training Time Windows
Choose the appropriate time window based on your use case:
| Duration | Hours | Use Case | Example |
|---|---|---|---|
1 day |
24 |
Quick iteration, development, testing |
Testing notebook changes |
1 week |
168 |
Weekly retraining, production anomaly detection |
Anomaly detector scheduled training |
30 days |
720 |
Initial training, seasonal patterns, forecasting |
Predictive analytics scheduled training |
Recommended defaults:
-
Anomaly Detector (
model-training-pipeline): 168h (1 week) - Captures weekly patterns without excessive noise -
Predictive Analytics (
model-training-pipeline-gpu): 720h (30 days) - Captures monthly trends and seasonality
Pipeline Selection Guide:
| Model | Pipeline | Requirements |
|---|---|---|
anomaly-detector |
|
CPU-only, works on any node |
predictive-analytics |
|
Requires GPU nodes, uses GP3 storage |
Custom CPU models |
|
Any custom notebook for CPU training |
Custom GPU models |
|
Custom notebooks requiring GPU acceleration |
|
Training time: Both pipelines complete in ~60 seconds regardless of the time window (24h, 168h, or 720h). The Prometheus query and data collection time dominates, not the hours of data requested. GPU availability: If your cluster doesn’t have GPU nodes, use |
Data Sources
The platform supports three data source modes for model training:
Synthetic Data (DATA_SOURCE=synthetic)
Use case: Development, testing, CI/CD, when Prometheus is unavailable
-
✅ Fast and reproducible
-
✅ Known anomaly labels for validation
-
✅ No external dependencies
-
⚠️ May not capture real cluster patterns
params:
- name: data-source
value: "synthetic"
Prometheus Data (DATA_SOURCE=prometheus)
Use case: Production training with real cluster metrics
-
✅ Real cluster behavior patterns
-
✅ Adapts to actual workload characteristics
-
✅ Improves model accuracy
-
⚠️ Requires Prometheus access
-
⚠️ Real anomalies are rare (<1%)
params:
- name: data-source
value: "prometheus"
Training notebooks automatically:
-
Fetch real metrics from Prometheus (80% of data)
-
Inject synthetic anomalies (20% of data) for balanced training
-
Combine datasets for robust model training
Hybrid Data (DATA_SOURCE=hybrid)
Use case: Staging, validation, best of both worlds
-
✅ 50% Prometheus + 50% synthetic
-
✅ Balanced representation
-
✅ Good for validation environments
params:
- name: data-source
value: "hybrid"
|
Recommendation: Use |
Choosing the Right Pipeline
The platform provides two dedicated training pipelines optimized for different resource requirements:
CPU Pipeline (model-training-pipeline)
Use for: anomaly-detector and custom CPU-based models
Features:
-
Runs on any node (no GPU required)
-
Uses
model-storage-pvc(CephFS shared storage) -
Faster startup (no GPU node scheduling)
-
Suitable for Isolation Forest, XGBoost, traditional ML algorithms
Example:
oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: train-cpu-model-
namespace: self-healing-platform
spec:
pipelineRef:
name: model-training-pipeline
params:
- name: model-name
value: "anomaly-detector"
- name: notebook-path
value: "notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb"
- name: data-source
value: "prometheus"
- name: training-hours
value: "168"
- name: inference-service-name
value: "anomaly-detector"
- name: git-url
value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
- name: git-ref
value: "main"
timeout: 30m
EOF
GPU Pipeline (model-training-pipeline-gpu)
Use for: predictive-analytics and custom GPU-accelerated models
Features:
-
Requires GPU nodes with
nvidia.com/gpu.present: "true"label -
Uses
model-storage-gpu-pvc(GP3 storage - GPU nodes may not have CephFS drivers) -
Automatic GPU tolerations and nodeSelector
-
Suitable for LSTM, neural networks, deep learning models
Example:
oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: train-gpu-model-
namespace: self-healing-platform
spec:
pipelineRef:
name: model-training-pipeline-gpu
params:
- name: model-name
value: "predictive-analytics"
- name: notebook-path
value: "notebooks/02-anomaly-detection/05-predictive-analytics-kserve.ipynb"
- name: data-source
value: "prometheus"
- name: training-hours
value: "720"
- name: inference-service-name
value: "predictive-analytics"
- name: git-url
value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
- name: git-ref
value: "main"
timeout: 45m
EOF
|
Pipeline Mismatch: Using the wrong pipeline for a model will cause failures:
See ADR-053 for the architectural decision: https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/053-tekton-model-training-pipelines.md |
Automated Scheduled Training
The platform automatically retrains models weekly via CronJobs:
|
Automated Training Schedule:
The CronJobs will trigger automatically on schedule. You can view the configuration and past runs below. |
Anomaly Detector (Weekly, Sunday 2 AM UTC)
# View CronJob configuration
oc get cronjob weekly-anomaly-detector-training -n self-healing-platform -o yaml
# View recent training runs
oc get pipelineruns -n self-healing-platform -l model-name=anomaly-detector
# Check latest training job
tkn pipelinerun logs -n self-healing-platform $(oc get pipelinerun -n self-healing-platform \
-l model-name=anomaly-detector --sort-by=.metadata.creationTimestamp -o name | tail -1)
Monitoring Training Runs
Check Pipeline Status
# List all pipeline runs
tkn pipelinerun list -n self-healing-platform
# Watch specific run (replace with actual name)
tkn pipelinerun logs train-anomaly-detector-abc123 -f -n self-healing-platform
Verify Model Deployment
# Check InferenceService status
oc get inferenceservice anomaly-detector -n self-healing-platform
# Check predictor pod status
oc get pods -l serving.kserve.io/inferenceservice=anomaly-detector \
-n self-healing-platform
# View model file details
oc exec -n self-healing-platform deployment/anomaly-detector-predictor -- \
ls -lh /mnt/models/anomaly-detector/model.pkl
Test Model Endpoint
Test the anomaly detector using the coordination-engine API:
# Test anomaly detection with sample data (45 features required)
oc exec -n self-healing-platform deployment/coordination-engine -- \
curl -s -X POST http://localhost:8080/api/v1/detect \
-H 'Content-Type: application/json' \
-d '{"model": "anomaly-detector", "instances": [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]}'
Expected response:
{"predictions":[-1],"model_name":"anomaly-detector"}
|
The anomaly detector returns |
Coordination Engine API Validation
The coordination-engine provides a unified API for validating and interacting with ML models. This section demonstrates the key validation endpoints.
Platform Health Check
Check the overall platform health and dependencies:
# Detailed health check with dependencies
oc exec -n self-healing-platform deployment/coordination-engine -- \
curl -s http://localhost:8080/api/v1/health
This returns:
-
status: Overall health (
okordegraded) -
dependencies: Kubernetes and ML service connectivity
-
rbac: RBAC permissions validation
-
uptime: Time since coordination-engine started
Resource Usage Prediction
Test the predictive analytics model to forecast future resource usage:
# Predict future CPU and memory usage
oc exec -n self-healing-platform deployment/coordination-engine -- \
curl -s -X POST http://localhost:8080/api/v1/predict \
-H 'Content-Type: application/json' \
-d '{"model": "predictive-analytics", "instances": [[0.5, 0.5, 0.5, 0.5, 0.5]]}'
Expected response:
{
"status": "success",
"predictions": {
"cpu_percent": 2.06,
"memory_percent": 38.34
},
"current_metrics": {
"cpu_rolling_mean": 3.44,
"memory_rolling_mean": 26.92,
"timestamp": "2026-02-17T20:54:01Z"
},
"model_info": {
"name": "predictive-analytics",
"confidence": 0.85
}
}
|
The coordination-engine API abstracts away KServe implementation details. You don’t need to know pod IPs, service names, or model endpoint URLs - the coordination-engine handles all model communication internally. |
Pre-Training Health Check
Before training, let’s check the current state of the ML models. During platform deployment, the NotebookValidationJobs automatically executed the training notebooks and saved model artifacts to the shared storage. However, due to deployment timing, the predictor pods may have started before the models were written — meaning the models exist on disk but aren’t loaded by the serving containers.
Step 1: Check Platform Health and Model Registry
Check the coordination-engine health and registered models:
# Check platform health
echo "=== Platform Health ==="
oc exec -n self-healing-platform deployment/coordination-engine -- \
curl -s http://localhost:8080/health
# List registered models
echo -e "\n=== Registered Models ==="
oc exec -n self-healing-platform deployment/coordination-engine -- \
curl -s http://localhost:8080/api/v1/models
Expected output:
=== Platform Health ===
{"status":"ok","version":"ocp-4.18-eada2fc"}
=== Registered Models ===
{"models":["predictive-analytics","anomaly-detector"],"count":2}
If the models list is empty, the predictor pods started before the training notebooks finished writing the model files. This is expected on a fresh deployment.
Step 2: Verify Model Files Exist on Storage
Even if the predictors haven’t loaded the models, the files should exist on the shared PVC:
# Check model artifacts on the shared PVC
for model in anomaly-detector predictive-analytics; do
echo "=== ${model} ==="
oc exec -n self-healing-platform \
$(oc get pod -n self-healing-platform \
-l serving.kserve.io/inferenceservice=${model} \
-o jsonpath='{.items[0].metadata.name}') -- \
ls -lh /mnt/models/${model}/ 2>/dev/null || echo "Directory not found"
done
You should see model.pkl files in each directory. These were trained on synthetic data by the NotebookValidationJobs during deployment.
Step 3: Check Predictor Logs
If the models aren’t loaded, the predictor logs will show why:
oc logs -n self-healing-platform \
-l serving.kserve.io/inferenceservice=anomaly-detector \
--tail=5
A common message is failed to locate model file — this means the predictor attempted to load the model at startup before the file was written.
|
Why does this happen? The InferenceService predictor pods (sync-wave 2) start before the NotebookValidationJobs (sync-wave 3+) complete training. The KServe sklearn server loads models only at startup and does not retry. After the training notebooks finish and write the model files, the predictors need to be restarted to pick them up. In the next section, you’ll retrain the models using Tekton pipelines — which will write fresh model files and restart the predictors automatically. |
Hands-On Exercise: Train Your First Model
Now that you’ve seen the current model state, let’s train the anomaly detector with a short training window:
Step 1: Start Training
oc create -f - <<EOF
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: workshop-train-
namespace: self-healing-platform
labels:
workshop: self-healing
spec:
pipelineRef:
name: model-training-pipeline
params:
- name: model-name
value: "anomaly-detector"
- name: notebook-path
value: "notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb"
- name: data-source
value: "synthetic"
- name: training-hours
value: "24"
- name: inference-service-name
value: "anomaly-detector"
- name: git-url
value: "https://github.com/KubeHeal/openshift-aiops-platform.git"
- name: git-ref
value: "main"
timeout: 15m
EOF
Troubleshooting
Model Training Fails
Symptoms: Pipeline run fails, NotebookValidationJob shows error
Diagnosis:
# Check pipeline logs
tkn pipelinerun logs <pipelinerun-name> -f -n self-healing-platform
# Check NotebookValidationJob status
oc get notebookvalidationjobs -n self-healing-platform
oc describe notebookvalidationjob <job-name> -n self-healing-platform
Common causes:
-
Insufficient memory (increase
memoryLimit) -
Prometheus unavailable (check connectivity)
-
Git repository inaccessible (verify URL)
-
Notebook syntax errors (test locally first)
Model Not Loaded (Empty Model Registry)
Symptoms: Predictor pod is Running, InferenceService shows Ready, but GET /v1/models returns {"models":[]} and predictions return ModelNotFound
Diagnosis:
# Check model registry
oc exec -n self-healing-platform deployment/coordination-engine -- \
curl -s http://localhost:8080/api/v1/models
# Check predictor logs for startup errors
oc logs -n self-healing-platform \
-l serving.kserve.io/inferenceservice=anomaly-detector \
--tail=10
Cause: The predictor pod started before model files were written to the shared PVC. The KServe sklearn server loads models once at startup and does not retry.
Fix: Retrain the model (which writes a new model file and triggers a predictor restart), or manually restart the predictor:
oc rollout restart deployment/anomaly-detector-predictor -n self-healing-platform
oc rollout restart deployment/predictive-analytics-predictor -n self-healing-platform
Model Won’t Load
Symptoms: Predictor pod crashes, OOMKilled, CrashLoopBackOff
Diagnosis:
# Check predictor pod logs
oc logs -n self-healing-platform \
-l serving.kserve.io/inferenceservice=anomaly-detector
# Check model file exists
oc exec -n self-healing-platform deployment/anomaly-detector-predictor -- \
ls -lh /mnt/models/anomaly-detector/
Common causes:
-
Model file corrupted (retrain model)
-
Model too large (increase predictor memory)
-
Incompatible sklearn version (check runtime image)
Prometheus Data Issues
Symptoms: Training falls back to synthetic data
|
OpenShift Prometheus requires bearer token authentication over HTTPS port 9091. |
Diagnosis:
# Check Prometheus accessibility with bearer token
oc exec -n self-healing-platform deployment/coordination-engine -- sh -c '
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sk -H "Authorization: Bearer $TOKEN" \
"https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/status/config" | head -c 200
'
Summary
In this module, you learned:
-
✅ How to train models manually with PipelineRuns
-
✅ Different training time windows and when to use them
-
✅ Data sources: synthetic, prometheus, hybrid
-
✅ How to monitor training and verify deployments
-
✅ Troubleshooting common training issues