Module 4: Extra Credit - Advanced ML & Custom Models

Overview

Congratulations on completing the core workshop! 🎉

This extra credit module is for participants who want to go deeper into the ML capabilities of the platform. You’ll work directly in the OpenShift AI Jupyter environment with hands-on notebooks.

What you’ll explore:

  • Advanced ML techniques: LSTM neural networks, ensemble methods

  • Building and deploying your own custom anomaly detection models

  • MLOps best practices for model versioning and lifecycle management

This module requires access to the Jupyter Workbench. Ensure you can access:

  • OpenShift AI Dashboard

  • Jupyter Notebook environment in self-healing-platform

Accessing the Jupyter Environment

Option 1: Via OpenShift AI Dashboard

  1. Open the OpenShift Console: https://console-openshift-console.apps.{guid}.example.com

  2. Navigate to ApplicationsRed Hat OpenShift AI

  3. Click Data Science Projects

  4. Select self-healing-platform project

  5. Click Workbenchesself-healing-workbench

  6. Click Open to launch JupyterLab

Option 2: Direct URL

Access the workbench directly at:

{jupyter_url}

Option 3: Port Forward (CLI)

oc port-forward self-healing-workbench-0 8888:8888 -n self-healing-platform
# Open http://localhost:8888

Part 1: Advanced ML Techniques

Exercise 1.1: LSTM Neural Networks

Notebook: notebooks/02-anomaly-detection/03-lstm-based-prediction.ipynb

LSTM (Long Short-Term Memory) networks excel at learning patterns in time series data - perfect for predicting cluster behavior.

What you’ll learn:

  • How LSTM networks capture temporal dependencies

  • Building sequence-to-sequence prediction models

  • Training on Prometheus metrics time series

  • Comparing LSTM vs. traditional methods

Key concepts:

Time Series Data → LSTM Encoder → Hidden State → LSTM Decoder → Predictions
     [t-n...t]                                                    [t+1...t+m]

Steps:

  1. In Jupyter, navigate to notebooks/02-anomaly-detection/

  2. Open 03-lstm-based-prediction.ipynb

  3. Run all cells sequentially

  4. Observe how the model learns temporal patterns

  5. Compare predictions with actual values

Expected outcomes:

  • Trained LSTM model for resource prediction

  • Understanding of sequence length and window size tradeoffs

  • Comparison metrics: MAE, RMSE, R²

Exercise 1.2: Ensemble Anomaly Detection

Notebook: notebooks/02-anomaly-detection/04-ensemble-anomaly-methods.ipynb

Ensemble methods combine multiple detection algorithms for more robust anomaly detection.

What you’ll learn:

  • Voting classifiers for anomaly consensus

  • Stacking multiple algorithms (Isolation Forest + One-Class SVM + LOF)

  • Weighted ensembles based on algorithm confidence

  • When to use ensemble vs. single models

Ensemble architecture:

                    ┌─────────────────────┐
                    │  Isolation Forest   │──┐
                    └─────────────────────┘  │
                                             │
Input Metrics ──────┤ One-Class SVM │────────┼──→ Voting → Final Decision
                    └─────────────────────┘  │      (majority/weighted)
                    ┌─────────────────────┐  │
                    │  Local Outlier      │──┘
                    │  Factor (LOF)       │
                    └─────────────────────┘

Steps:

  1. Open notebooks/02-anomaly-detection/04-ensemble-anomaly-methods.ipynb

  2. Execute the notebook cells

  3. Observe how different algorithms vote on anomalies

  4. Compare precision/recall of ensemble vs. individual models

Challenge exercise:

  • Modify the voting weights to favor algorithms with higher precision

  • Add a fourth algorithm (DBSCAN) to the ensemble

  • Test with injected synthetic anomalies

Part 2: Building Custom Models

Exercise 2.1: KServe Model Onboarding

Notebook: notebooks/00-setup/01-kserve-model-onboarding.ipynb

Learn the complete process of taking a trained model and deploying it to KServe for real-time inference.

What you’ll learn:

  • Model serialization formats (joblib, pickle, ONNX)

  • KServe InferenceService specification

  • Storage configuration (PVC, S3)

  • Health probes and scaling

Steps:

  1. Open notebooks/00-setup/01-kserve-model-onboarding.ipynb

  2. Follow the guided onboarding process

  3. Deploy a sample model to KServe

  4. Test the inference endpoint

Exercise 2.2: Deploy Your Own Model

Notebook: notebooks/04-model-serving/kserve-model-deployment.ipynb

Now deploy a model you’ve trained to the platform!

Challenge: Create a custom anomaly detector

  1. Choose your algorithm: Use one from Exercise 1 or create your own

  2. Train on your data: Use Prometheus metrics from your cluster

  3. Package the model: Save using joblib

  4. Deploy to KServe: Create InferenceService

  5. Integrate with Coordination Engine: Update model registry

Model template:

import joblib
from sklearn.ensemble import IsolationForest
import numpy as np

# Train your custom model
model = IsolationForest(
    n_estimators=200,
    contamination=0.05,
    random_state=42
)
model.fit(your_training_data)

# Save the model
joblib.dump(model, '/mnt/models/my-custom-detector/model.pkl')

print("✅ Model saved!")

InferenceService template:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-custom-detector
  namespace: {namespace}
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: pvc://model-storage-pvc/my-custom-detector/
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "1"
          memory: "2Gi"

Exercise 2.3: MLOps Model Versioning

Notebook: notebooks/04-model-serving/model-versioning-mlops.ipynb

Learn production MLOps practices for managing model versions.

What you’ll learn:

  • Model versioning strategies

  • A/B testing with canary deployments

  • Rollback procedures

  • Performance monitoring and drift detection

Key MLOps concepts:

Concept Description

Model Registry

Central catalog of all model versions with metadata

Canary Deployment

Route small % of traffic to new model version

Shadow Mode

New model runs alongside production, results compared

Drift Detection

Monitor for data/concept drift that degrades performance

Part 3: Integration Challenges

Challenge 3.1: End-to-End Custom Model Pipeline

Create a complete pipeline that:

  1. Collects metrics from Prometheus (last 7 days)

  2. Trains a custom LSTM model

  3. Deploys to KServe

  4. Registers with Coordination Engine

  5. Tests via Lightspeed query

Success criteria:

  • Model deployed and READY in KServe

  • Lightspeed can query your model: "Use my-custom-detector to analyze the cluster"

  • Predictions return within 100ms

Challenge 3.2: Scheduled Retraining

Set up automated weekly retraining for your custom model:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: retrain-my-custom-detector
  namespace: {namespace}
spec:
  schedule: "0 3 * * 0"  # Sundays 3 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: trainer
            image: image-registry.openshift-image-registry.svc:5000/{namespace}/notebook-validator:latest
            env:
            - name: NOTEBOOK_PATH
              value: "notebooks/02-anomaly-detection/my-custom-training.ipynb"
            - name: MODEL_NAME
              value: "my-custom-detector"

Notebook Reference

Notebook Purpose Difficulty

03-lstm-based-prediction.ipynb

LSTM neural network for time series

⭐⭐⭐

04-ensemble-anomaly-methods.ipynb

Ensemble anomaly detection

⭐⭐⭐

01-kserve-model-onboarding.ipynb

Model onboarding to KServe

⭐⭐

kserve-model-deployment.ipynb

Full deployment workflow

⭐⭐

model-versioning-mlops.ipynb

MLOps best practices

⭐⭐⭐

synthetic-anomaly-generation.ipynb

Generate test anomalies

model-performance-monitoring.ipynb

Monitor deployed models

⭐⭐

Summary

In this extra credit module, you explored:

  • LSTM Networks - Deep learning for time series prediction

  • Ensemble Methods - Combining algorithms for robust detection

  • Custom Model Deployment - Full KServe deployment workflow

  • MLOps Practices - Versioning, canary deployments, monitoring

Next Steps

Want to go even further?

  • Contribute: Add your custom model to the platform repository

  • Blog: Write about your experience with the workshop

  • Extend: Build MCP tools that expose your custom model to Lightspeed

Resources


Congratulations on completing the Extra Credit! 🏆

You now have the skills to extend the Self-Healing Platform with your own custom ML models.