Module 5: Notebook Catalog & Use Cases
Overview
The OpenShift AI Ops Platform includes 33+ Jupyter notebooks covering every aspect of the self-healing pipeline. This module provides a comprehensive catalog of all notebooks, organized by use case.
Purpose of this module:
-
Understand what each notebook category does
-
Learn when to use which notebook
-
Explore notebooks not covered in the main workshop
-
Find the right notebook for your specific use case
Accessing Notebooks
Via OpenShift AI Dashboard
-
Open the OpenShift Console: https://console-openshift-console.apps.{guid}.example.com
-
Navigate to Applications → Red Hat OpenShift AI
-
Click Data Science Projects →
self-healing-platform -
Click Workbenches → self-healing-workbench → Open
Category 00: Setup & Validation
| Notebook | Purpose | When to Use |
|---|---|---|
|
Validates cluster prerequisites, operators, GPU availability, storage |
First notebook to run - before any other work |
|
Step-by-step guide to deploying a model to KServe |
When adding a new model to the platform |
|
Configures Python environment, installs packages |
When setting up a new workbench |
Key Concepts:
-
Platform readiness validation ensures all operators and storage are configured
-
KServe onboarding covers model formats (joblib, ONNX, TensorFlow)
-
Environment setup is idempotent - safe to run multiple times
Category 01: Data Collection
| Notebook | Purpose | When to Use |
|---|---|---|
|
Queries Prometheus for CPU, memory, network metrics |
Building training datasets for ML models |
|
Extracts Kubernetes events (pod crashes, scaling, failures) |
Correlating events with anomalies |
|
Parses container logs, extracts error patterns |
Root cause analysis workflows |
|
Demonstrates feature engineering for ML |
Preparing data for model training |
|
Generates synthetic anomalies for testing |
Testing anomaly detection without breaking production |
Use Case: Building a Training Dataset
1. prometheus-metrics-collection.ipynb → Collect 7 days of metrics
2. openshift-events-analysis.ipynb → Extract failure events
3. feature-store-demo.ipynb → Engineer features
4. synthetic-anomaly-generation.ipynb → Add labeled anomalies
Category 02: Anomaly Detection
| Notebook | Purpose | When to Use |
|---|---|---|
|
Implements Isolation Forest for point anomaly detection |
General-purpose anomaly detection (fast, explainable) |
|
Time series methods (ARIMA, Prophet-style) |
Detecting anomalies in metric trends |
|
LSTM neural network for sequence prediction |
Complex temporal patterns (requires GPU) |
|
Combines multiple algorithms via voting |
High-precision detection (reduces false positives) |
|
Deploys prediction model to KServe |
Making models available for real-time inference |
Algorithm Selection Guide:
| Scenario | Recommended Notebook | Why |
|---|---|---|
Quick start, simple anomalies |
|
Fast training, no GPU needed, explainable |
Time-based patterns (daily cycles) |
|
Captures seasonality and trends |
Complex multi-variate patterns |
|
Deep learning captures complex relationships |
Production deployment |
|
Combines models for robust detection |
Category 03: Self-Healing Logic
| Notebook | Purpose | When to Use |
|---|---|---|
|
Implements deterministic remediation rules |
Known issues with known fixes |
|
ML-based action selection |
Novel issues requiring intelligent decisions |
|
Combines rules + AI for complete workflows |
Production self-healing pipelines |
The Hybrid Approach:
Incoming Anomaly
│
▼
┌──────────────────┐
│ Rule Matcher │───→ Known Issue? ───→ Apply Rule-Based Fix
│ (Deterministic) │ │
└──────────────────┘ │
│ No Match │
▼ │
┌──────────────────┐ │
│ AI Decision │───→ Novel Issue? ───→ AI-Recommended Fix
│ (ML-Based) │ │
└──────────────────┘ │
│ │
└───────────────────────────────────────────┘
│
▼
Coordination Engine
(Conflict Resolution)
Category 04: Model Serving
| Notebook | Purpose | When to Use |
|---|---|---|
|
Full KServe deployment workflow |
Deploying trained models to production |
|
Creates inference pipelines with pre/post processing |
Complex inference workflows |
|
Implements model versioning, A/B testing |
Production MLOps practices |
Deployment Workflow:
# Train model (from 02-anomaly-detection)
python notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb
# Deploy to KServe
python notebooks/04-model-serving/kserve-model-deployment.ipynb
# Set up versioning
python notebooks/04-model-serving/model-versioning-mlops.ipynb
Category 05: End-to-End Scenarios
| Notebook | Purpose | When to Use |
|---|---|---|
|
Full platform demonstration |
Demos, presentations, learning |
|
Detects and remediates CrashLoopBackOff |
Specific use case: pod failures |
|
Detects CPU/memory pressure before OOM |
Specific use case: resource issues |
|
Detects and responds to network anomalies |
Specific use case: network issues |
Scenario Selection:
-
Demos: Start with
complete-platform-demo.ipynb -
Learning: Work through each scenario notebook
-
Production: Use scenarios as templates for your specific needs
Category 06: MCP & Lightspeed Integration
| Notebook | Purpose | When to Use |
|---|---|---|
|
Tests MCP server functionality |
Debugging MCP issues |
|
Demonstrates Lightspeed API usage |
Programmatic Lightspeed access |
|
Integrates with LlamaStack for local LLMs |
Running with local models (no cloud API) |
|
Complete troubleshooting workflow via AI |
Advanced AI-assisted debugging |
When to use each:
-
mcp-server-integration.ipynb: MCP tools not working? Start here
-
openshift-lightspeed-integration.ipynb: Want Python API access? Use this
-
llamastack-integration.ipynb: Air-gapped environment? Use local LLMs
-
end-to-end-troubleshooting-workflow.ipynb: Complex issues? AI-guided resolution
Category 07: Monitoring & Operations
| Notebook | Purpose | When to Use |
|---|---|---|
|
Sets up custom Prometheus metrics |
Adding platform observability |
|
Tracks model accuracy, drift detection |
Production ML monitoring |
|
Measures self-healing effectiveness |
Reporting and continuous improvement |
Operational Metrics:
-
Model Performance: Accuracy, latency, throughput
-
Healing Success: MTTR, success rate, false positive rate
-
Platform Health: Pod restarts, resource usage, error rates
Category 08: Advanced Scenarios
| Notebook | Purpose | When to Use |
|---|---|---|
|
Automates security incident detection and response |
Security operations teams |
|
Predicts future capacity needs |
Capacity planning and cost optimization |
|
Identifies resource waste, right-sizing |
FinOps and efficiency improvements |
Advanced Use Cases:
-
Security Teams: Automate detection of suspicious activity, unauthorized access
-
Platform Teams: Predict scaling needs before peak load
-
FinOps Teams: Identify over-provisioned resources, optimize costs
Quick Reference: Finding the Right Notebook
| I want to… | Use this notebook |
|---|---|
Validate my cluster is ready |
|
Collect metrics for training |
|
Build a quick anomaly detector |
|
Deploy a model to production |
|
See a complete demo |
|
Debug Lightspeed issues |
|
Monitor model performance |
|
Plan for future capacity |
|
Notebook Development Tips
Running Notebooks
# Run interactively in JupyterLab
# Open the notebook and click "Run All"
# Run from command line
jupyter nbconvert --to notebook --execute notebook.ipynb --output output.ipynb
# Run via NotebookValidationJob (automated)
oc apply -f - <<EOF
apiVersion: notebooks.kubeflow.org/v1alpha1
kind: NotebookValidationJob
metadata:
name: run-anomaly-detection
spec:
notebookPath: /opt/app-root/src/notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb
EOF
Modifying Notebooks
-
Open in JupyterLab
-
Modify cells as needed
-
Test by running all cells
-
Clear outputs before committing:
jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace notebook.ipynb
Creating New Notebooks
Follow the standard structure:
# ============================================================
# HEADER SECTION
# ============================================================
# Title: [Descriptive Title]
# Purpose: [What this notebook does]
# Prerequisites: [Required setup]
# Expected Outcomes: [What you'll achieve]
# ============================================================
# SETUP SECTION
# ============================================================
import sys
sys.path.append('../utils')
from common_functions import setup_environment
env = setup_environment()
# ============================================================
# IMPLEMENTATION SECTION
# ============================================================
# Your code here...
# ============================================================
# VALIDATION SECTION
# ============================================================
# Verify results...
# ============================================================
# CLEANUP SECTION
# ============================================================
# Resource cleanup...
Summary
In this module, you explored:
-
✅ 8 Notebook Categories - From setup to advanced scenarios
-
✅ 33+ Notebooks - Complete catalog with use cases
-
✅ Quick Reference - Finding the right notebook for your task
-
✅ Development Tips - Running, modifying, creating notebooks