23  Diabetes Dataset Utilities

SpotOptim provides convenient utilities for working with the sklearn diabetes dataset, including PyTorch Dataset and DataLoader implementations. These utilities simplify data loading, preprocessing, and model training for regression tasks.

23.1 Overview

The diabetes dataset contains 10 baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements) for 442 diabetes patients. The target is a quantitative measure of disease progression one year after baseline.

Module: spotoptim.data.diabetes

Key Components:

  • DiabetesDataset: PyTorch Dataset class
  • get_diabetes_dataloaders(): Convenience function for complete data pipeline

23.2 Quick Start

23.2.1 Basic Usage

from spotoptim.data import get_diabetes_dataloaders
from sklearn.datasets import load_diabetes
from spotoptim.data.diabetes import DiabetesDataset
import numpy as np

# Load data
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target.reshape(-1, 1)

# Now create the dataset
dataset = DiabetesDataset(X, y, transform=None, target_transform=None)
# Load data with default settings
train_loader, test_loader, scaler = get_diabetes_dataloaders()

# Iterate through batches
for batch_X, batch_y in train_loader:
    print(f"Batch features: {batch_X.shape}")  # (32, 10)
    print(f"Batch targets: {batch_y.shape}")   # (32, 1)
    break
Batch features: torch.Size([32, 10])
Batch targets: torch.Size([32, 1])

23.2.2 Training a Model

import torch
import torch.nn as nn
from spotoptim.data import get_diabetes_dataloaders
from spotoptim.nn.linear_regressor import LinearRegressor

# Load data
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    test_size=0.2,
    batch_size=32,
    scale_features=True,
    random_state=42
)

# Create model
model = LinearRegressor(
    input_dim=10,
    output_dim=1,
    l1=64,
    num_hidden_layers=2,
    activation="ReLU"
)

# Setup training
criterion = nn.MSELoss()
optimizer = model.get_optimizer("Adam", lr=0.01)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    
    for batch_X, batch_y in train_loader:
        # Forward pass
        predictions = model(batch_X)
        loss = criterion(predictions, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
    
    avg_train_loss = train_loss / len(train_loader)
    
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}: Loss = {avg_train_loss:.4f}")

# Evaluation
model.eval()
test_loss = 0.0

with torch.no_grad():
    for batch_X, batch_y in test_loader:
        predictions = model(batch_X)
        loss = criterion(predictions, batch_y)
        test_loss += loss.item()

avg_test_loss = test_loss / len(test_loader)
print(f"Test MSE: {avg_test_loss:.4f}")
Epoch 20/100: Loss = 31837.4149
Epoch 40/100: Loss = 27657.6397
Epoch 60/100: Loss = 30533.8840
Epoch 80/100: Loss = 35510.9040
Epoch 100/100: Loss = 27727.0009
Test MSE: 26488.8945

23.3 Function Reference

23.3.1 get_diabetes_dataloaders()

Loads the sklearn diabetes dataset and returns configured PyTorch DataLoaders.

Signature:

get_diabetes_dataloaders(
    test_size=0.2,
    batch_size=32,
    shuffle_train=True,
    shuffle_test=False,
    random_state=42,
    scale_features=True,
    num_workers=0,
    pin_memory=False
)
(<torch.utils.data.dataloader.DataLoader at 0x11973d0f0>,
 <torch.utils.data.dataloader.DataLoader at 0x1197d9eb0>,
 StandardScaler())

Parameters:

Parameter Type Default Description
test_size float 0.2 Proportion of dataset for testing (0.0 to 1.0)
batch_size int 32 Number of samples per batch
shuffle_train bool True Whether to shuffle training data
shuffle_test bool False Whether to shuffle test data
random_state int 42 Random seed for train/test split
scale_features bool True Whether to standardize features
num_workers int 0 Number of subprocesses for data loading
pin_memory bool False Whether to pin memory (useful for GPU)

Returns:

  • train_loader (DataLoader): Training data loader
  • test_loader (DataLoader): Test data loader
  • scaler (StandardScaler or None): Fitted scaler if scale_features=True, else None

Example:

from spotoptim.data import get_diabetes_dataloaders

# Custom configuration
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    test_size=0.3,
    batch_size=64,
    shuffle_train=True,
    scale_features=True,
    random_state=123
)

print(f"Training batches: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")
print(f"Scaler mean: {scaler.mean_[:3]}")  # First 3 features
Training batches: 5
Test batches: 3
Scaler mean: [-0.00056537  0.00132258  0.00027836]

23.4 DiabetesDataset Class

PyTorch Dataset implementation for the diabetes dataset.

Signature:

DiabetesDataset(X, y, transform=None, target_transform=None)
<spotoptim.data.diabetes.DiabetesDataset at 0x119713550>

Parameters:

  • X (np.ndarray): Feature matrix of shape (n_samples, n_features)
  • y (np.ndarray): Target values of shape (n_samples,) or (n_samples, 1)
  • transform (callable, optional): Transform to apply to features
  • target_transform (callable, optional): Transform to apply to targets

Attributes:

  • X (torch.Tensor): Feature tensor (n_samples, n_features)
  • y (torch.Tensor): Target tensor (n_samples, 1)
  • n_features (int): Number of features (10 for diabetes)
  • n_samples (int): Number of samples

Methods:

  • __len__(): Returns number of samples
  • __getitem__(idx): Returns tuple (features, target) for given index

23.4.1 Manual Dataset Creation

from spotoptim.data import DiabetesDataset
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader

# Load raw data
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create datasets
train_dataset = DiabetesDataset(X_train, y_train)
test_dataset = DiabetesDataset(X_test, y_test)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Inspect dataset
print(f"Dataset size: {len(train_dataset)}")
print(f"Features shape: {train_dataset.X.shape}")
print(f"Targets shape: {train_dataset.y.shape}")

# Get a sample
features, target = train_dataset[0]
print(f"Sample features: {features.shape}")  # (10,)
print(f"Sample target: {target.shape}")      # (1,)
Dataset size: 353
Features shape: torch.Size([353, 10])
Targets shape: torch.Size([353, 1])
Sample features: torch.Size([10])
Sample target: torch.Size([1])

23.5 Advanced Usage

23.5.1 Custom Transforms

from spotoptim.data import DiabetesDataset
from sklearn.datasets import load_diabetes
import torch

# Define custom transforms
def add_noise(x):
    """Add Gaussian noise to features."""
    return x + torch.randn_like(x) * 0.01

def log_transform(y):
    """Apply log transform to target."""
    return torch.log1p(y)

# Load data
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Create dataset with transforms
dataset = DiabetesDataset(
    X, y,
    transform=add_noise,
    target_transform=log_transform
)

# Transforms are applied when accessing items
features, target = dataset[0]

23.5.2 Different Train/Test Splits

from spotoptim.data import get_diabetes_dataloaders

# 70/30 split
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    test_size=0.3,
    random_state=42
)
print(f"Training samples: {len(train_loader.dataset)}")  # ~310
print(f"Test samples: {len(test_loader.dataset)}")       # ~132

# 90/10 split
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    test_size=0.1,
    random_state=42
)
print(f"Training samples: {len(train_loader.dataset)}")  # ~398
print(f"Test samples: {len(test_loader.dataset)}")       # ~44
Training samples: 309
Test samples: 133
Training samples: 397
Test samples: 45

23.5.3 Without Feature Scaling

from spotoptim.data import get_diabetes_dataloaders

# Load without scaling (useful for tree-based models)
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    scale_features=False
)

print(f"Scaler: {scaler}")  # None

# Data is in original scale
for batch_X, batch_y in train_loader:
    print(f"Mean: {batch_X.mean(dim=0)[:3]}")  # Non-zero values
    break
Scaler: None
Mean: tensor([-0.0097, -0.0029, -0.0085])

23.5.4 Larger Batch Sizes

from spotoptim.data import get_diabetes_dataloaders

# Larger batches for faster training (if memory allows)
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    batch_size=128
)
print(f"Batches per epoch: {len(train_loader)}")  # Fewer batches

# Smaller batches for more gradient updates
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    batch_size=8
)
print(f"Batches per epoch: {len(train_loader)}")  # More batches
Batches per epoch: 3
Batches per epoch: 45

23.5.5 GPU Training with Pin Memory

import torch
from spotoptim.data import get_diabetes_dataloaders

# Enable pin_memory for faster GPU transfer
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    batch_size=32,
    pin_memory=True  # Set to True when using GPU
)

# Move model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Training loop with GPU
for batch_X, batch_y in train_loader:
    # Data is already pinned, faster transfer to GPU
    batch_X = batch_X.to(device, non_blocking=True)
    batch_y = batch_y.to(device, non_blocking=True)
    
    # ... training code ...
/Users/bartz/workspace/spotoptim-cookbook/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:692: UserWarning:

'pin_memory' argument is set as true but not supported on MPS now, device pinned memory won't be used.

23.6 Complete Training Example

Here’s a complete example showing data loading, model training, and evaluation:

import torch
import torch.nn as nn
from spotoptim.data import get_diabetes_dataloaders
from spotoptim.nn.linear_regressor import LinearRegressor

def train_diabetes_model():
    """Train a neural network on the diabetes dataset."""
    
    # Load data
    train_loader, test_loader, scaler = get_diabetes_dataloaders(
        test_size=0.2,
        batch_size=32,
        scale_features=True,
        random_state=42
    )
    
    # Create model
    model = LinearRegressor(
        input_dim=10,
        output_dim=1,
        l1=128,
        num_hidden_layers=3,
        activation="ReLU"
    )
    
    # Setup training
    criterion = nn.MSELoss()
    optimizer = model.get_optimizer("Adam", lr=0.001, weight_decay=1e-5)
    
    # Training configuration
    num_epochs = 200
    best_test_loss = float('inf')
    
    print("Starting training...")
    print(f"Training samples: {len(train_loader.dataset)}")
    print(f"Test samples: {len(test_loader.dataset)}")
    print(f"Batches per epoch: {len(train_loader)}")
    print("-" * 60)
    
    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0.0
        
        for batch_X, batch_y in train_loader:
            # Forward pass
            predictions = model(batch_X)
            loss = criterion(predictions, batch_y)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        avg_train_loss = train_loss / len(train_loader)
        
        # Evaluation phase
        model.eval()
        test_loss = 0.0
        
        with torch.no_grad():
            for batch_X, batch_y in test_loader:
                predictions = model(batch_X)
                loss = criterion(predictions, batch_y)
                test_loss += loss.item()
        
        avg_test_loss = test_loss / len(test_loader)
        
        # Track best model
        if avg_test_loss < best_test_loss:
            best_test_loss = avg_test_loss
            # Could save model here: torch.save(model.state_dict(), 'best_model.pt')
        
        # Print progress
        if (epoch + 1) % 20 == 0:
            print(f"Epoch {epoch+1:3d}/{num_epochs}: "
                  f"Train Loss = {avg_train_loss:.4f}, "
                  f"Test Loss = {avg_test_loss:.4f}")
    
    print("-" * 60)
    print(f"Training complete!")
    print(f"Best test loss: {best_test_loss:.4f}")
    
    return model, best_test_loss

# Run training
if __name__ == "__main__":
    model, best_loss = train_diabetes_model()
Starting training...
Training samples: 353
Test samples: 89
Batches per epoch: 12
------------------------------------------------------------
Epoch  20/200: Train Loss = 31260.0397, Test Loss = 26633.8405
Epoch  40/200: Train Loss = 30873.6089, Test Loss = 26629.1842
Epoch  60/200: Train Loss = 29560.2482, Test Loss = 26624.4290
Epoch  80/200: Train Loss = 27745.5570, Test Loss = 26619.4974
Epoch 100/200: Train Loss = 28170.9631, Test Loss = 26614.3301
Epoch 120/200: Train Loss = 28036.1819, Test Loss = 26608.9928
Epoch 140/200: Train Loss = 31515.8442, Test Loss = 26603.3861
Epoch 160/200: Train Loss = 32822.3197, Test Loss = 26597.5514
Epoch 180/200: Train Loss = 32395.4159, Test Loss = 26591.4531
Epoch 200/200: Train Loss = 29824.4967, Test Loss = 26585.1758
------------------------------------------------------------
Training complete!
Best test loss: 26585.1758

23.7 Integration with SpotOptim

Use the diabetes dataset for hyperparameter optimization with SpotOptim:

import numpy as np
import torch
import torch.nn as nn
from spotoptim import SpotOptim
from spotoptim.data import get_diabetes_dataloaders
from spotoptim.nn.linear_regressor import LinearRegressor

def evaluate_model(X):
    """Objective function for SpotOptim.
    
    Args:
        X: Array of hyperparameters [lr, l1, num_hidden_layers]
        
    Returns:
        Array of validation losses
    """
    results = []
    
    for params in X:
        lr, l1, num_hidden_layers = params
        lr = 10 ** lr  # Log scale for learning rate
        l1 = int(l1)
        num_hidden_layers = int(num_hidden_layers)
        
        # Load data
        train_loader, test_loader, _ = get_diabetes_dataloaders(
            test_size=0.2,
            batch_size=32,
            random_state=42
        )
        
        # Create model
        model = LinearRegressor(
            input_dim=10,
            output_dim=1,
            l1=l1,
            num_hidden_layers=num_hidden_layers,
            activation="ReLU"
        )
        
        # Train briefly
        criterion = nn.MSELoss()
        optimizer = model.get_optimizer("Adam", lr=lr)
        
        num_epochs = 50
        for epoch in range(num_epochs):
            model.train()
            for batch_X, batch_y in train_loader:
                predictions = model(batch_X)
                loss = criterion(predictions, batch_y)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
        
        # Evaluate
        model.eval()
        test_loss = 0.0
        with torch.no_grad():
            for batch_X, batch_y in test_loader:
                predictions = model(batch_X)
                loss = criterion(predictions, batch_y)
                test_loss += loss.item()
        
        results.append(test_loss / len(test_loader))
    
    return np.array(results)

# Optimize hyperparameters
optimizer = SpotOptim(
    fun=evaluate_model,
    bounds=[
        (-4, -2),   # log10(lr): 0.0001 to 0.01
        (16, 128),  # l1: number of neurons
        (0, 4)      # num_hidden_layers
    ],
    var_type=["float", "int", "int"],
    max_iter=30,
    n_initial=10,
    seed=42,
    verbose=True
)

result = optimizer.optimize()
print(f"Best hyperparameters found:")
print(f"  Learning rate: {10**result.x[0]:.6f}")
print(f"  Hidden neurons (l1): {int(result.x[1])}")
print(f"  Hidden layers: {int(result.x[2])}")
print(f"  Best MSE: {result.fun:.4f}")
TensorBoard logging disabled
Initial best: f(x) = 26543.408203
Iteration 1: f(x) = 26580.779948
Iteration 2: f(x) = 26630.720703
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 3: f(x) = 26630.039062
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 4: f(x) = 26623.877604
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 5: f(x) = 26555.312500
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 6: f(x) = 26592.140625
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 7: f(x) = 26565.575521
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 8: f(x) = 26633.102865
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 9: f(x) = 26633.738281
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 10: f(x) = 26575.945964
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 11: f(x) = 26690.525391
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 12: New best f(x) = 26533.076823
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 13: f(x) = 26625.047526
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 14: f(x) = 26702.458984
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 15: f(x) = 26578.651042
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 16: f(x) = 26622.776693
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 17: f(x) = 26628.791667
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 18: f(x) = 26672.611979
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 19: f(x) = 26624.714844
  Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 20: f(x) = 26646.041016
Best hyperparameters found:
  Learning rate: 0.007645
  Hidden neurons (l1): 96
  Hidden layers: 3
  Best MSE: 26533.0768

23.8 Best Practices

23.8.1 1. Always Use Feature Scaling

# Good: Features are standardized
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    scale_features=True
)

Neural networks typically perform better with normalized inputs.

23.8.2 2. Set Random Seeds for Reproducibility

# Reproducible train/test splits
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    random_state=42
)

# Also set PyTorch seed
import torch
torch.manual_seed(42)
<torch._C.Generator at 0x113059b70>

23.8.3 3. Don’t Shuffle Test Data

# Good: Test data in consistent order
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    shuffle_train=True,   # Shuffle training data
    shuffle_test=False    # Don't shuffle test data
)

This ensures consistent evaluation metrics across runs.

23.8.4 4. Choose Appropriate Batch Size

# Small dataset (442 samples) - moderate batch size works well
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    batch_size=32  # Good balance for this dataset
)

Too large: Fewer gradient updates per epoch
Too small: Noisy gradients, slower training

23.8.5 5. Save the Scaler for Production

import pickle
import numpy as np
from spotoptim.data import get_diabetes_dataloaders

# Train with scaling
train_loader, test_loader, scaler = get_diabetes_dataloaders(
    scale_features=True
)

# Save scaler for production use
with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

# Later: Load and use on new data
with open('scaler.pkl', 'rb') as f:
    loaded_scaler = pickle.load(f)

# Create some example new data (same shape as diabetes features)
new_data = np.random.randn(5, 10)  # 5 samples, 10 features
new_data_scaled = loaded_scaler.transform(new_data)

print(f"Original data shape: {new_data.shape}")
print(f"Scaled data shape: {new_data_scaled.shape}")
print(f"Scaled data mean: {new_data_scaled.mean(axis=0)[:3]}")  # Should be close to 0
Original data shape: (5, 10)
Scaled data shape: (5, 10)
Scaled data mean: [-2.34235239  4.19590764 -9.5169612 ]

23.9 Troubleshooting

23.9.1 Issue: Out of Memory

Solution: Reduce batch size or disable pin_memory

train_loader, test_loader, scaler = get_diabetes_dataloaders(
    batch_size=16,      # Smaller batches
    pin_memory=False    # Disable if not using GPU
)

23.9.2 Issue: Different Data Ranges

Symptom: Model not converging, loss is NaN

Solution: Ensure feature scaling is enabled

train_loader, test_loader, scaler = get_diabetes_dataloaders(
    scale_features=True  # Must be True for neural networks
)

23.9.3 Issue: Non-Reproducible Results

Solution: Set all random seeds

import torch
import numpy as np

# Set all seeds
torch.manual_seed(42)
np.random.seed(42)

train_loader, test_loader, scaler = get_diabetes_dataloaders(
    random_state=42,
    shuffle_train=False  # Disable shuffle for full reproducibility
)

23.9.4 Issue: Slow Data Loading

Solution: Use multiple workers (if not on Windows)

train_loader, test_loader, scaler = get_diabetes_dataloaders(
    num_workers=4,      # Use 4 subprocesses
    pin_memory=True     # Enable for GPU
)

Note: On Windows, set num_workers=0 to avoid multiprocessing issues.

23.10 Summary

The diabetes dataset utilities in SpotOptim provide:

  • Easy data loading: One function call gets complete data pipeline
  • PyTorch integration: Native Dataset and DataLoader support
  • Preprocessing included: Automatic feature scaling and train/test splitting
  • Flexible configuration: Control batch size, splitting, scaling, and more
  • Production ready: Save scalers and ensure reproducibility

23.11 Jupyter Notebook

Note