SpotOptim provides convenient utilities for working with the sklearn diabetes dataset, including PyTorch Dataset and DataLoader implementations. These utilities simplify data loading, preprocessing, and model training for regression tasks.
28.1 Overview
The diabetes dataset contains 10 baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements) for 442 diabetes patients. The target is a quantitative measure of disease progression one year after baseline.
Module: spotoptim.data.diabetes
Key Components:
DiabetesDataset: PyTorch Dataset class
get_diabetes_dataloaders(): Convenience function for complete data pipeline
28.2 Quick Start
28.2.1 Basic Usage
from spotoptim.data import get_diabetes_dataloadersfrom sklearn.datasets import load_diabetesfrom spotoptim.data.diabetes import DiabetesDatasetimport numpy as np# Load datadiabetes = load_diabetes()X = diabetes.datay = diabetes.target.reshape(-1, 1)# Now create the datasetdataset = DiabetesDataset(X, y, transform=None, target_transform=None)# Load data with default settingstrain_loader, test_loader, scaler = get_diabetes_dataloaders()# Iterate through batchesfor batch_X, batch_y in train_loader:print(f"Batch features: {batch_X.shape}") # (32, 10)print(f"Batch targets: {batch_y.shape}") # (32, 1)break
Epoch 20/100: Loss = 27779.8639
Epoch 40/100: Loss = 29823.1112
Epoch 60/100: Loss = 32081.4837
Epoch 80/100: Loss = 27363.1013
Epoch 100/100: Loss = 29269.0827
Test MSE: 26495.5527
28.3 Function Reference
28.3.1 get_diabetes_dataloaders()
Loads the sklearn diabetes dataset and returns configured PyTorch DataLoaders.
Training samples: 309
Test samples: 133
Training samples: 397
Test samples: 45
28.5.3 Without Feature Scaling
from spotoptim.data import get_diabetes_dataloaders# Load without scaling (useful for tree-based models)train_loader, test_loader, scaler = get_diabetes_dataloaders( scale_features=False)print(f"Scaler: {scaler}") # None# Data is in original scalefor batch_X, batch_y in train_loader:print(f"Mean: {batch_X.mean(dim=0)[:3]}") # Non-zero valuesbreak
from spotoptim.data import get_diabetes_dataloaders# Larger batches for faster training (if memory allows)train_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=128)print(f"Batches per epoch: {len(train_loader)}") # Fewer batches# Smaller batches for more gradient updatestrain_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=8)print(f"Batches per epoch: {len(train_loader)}") # More batches
Batches per epoch: 3
Batches per epoch: 45
28.5.5 GPU Training with Pin Memory
import torchfrom spotoptim.data import get_diabetes_dataloaders# Enable pin_memory for faster GPU transfertrain_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=32, pin_memory=True# Set to True when using GPU)# Move model to GPUdevice = torch.device("cuda"if torch.cuda.is_available() else"cpu")model = model.to(device)# Training loop with GPUfor batch_X, batch_y in train_loader:# Data is already pinned, faster transfer to GPU batch_X = batch_X.to(device, non_blocking=True) batch_y = batch_y.to(device, non_blocking=True)# ... training code ...
/Users/bartz/workspace/spotoptim-cookbook/.venv/lib/python3.14/site-packages/torch/utils/data/dataloader.py:692: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, device pinned memory won't be used.
warnings.warn(warn_msg)
28.6 Complete Training Example
Here’s a complete example showing data loading, model training, and evaluation:
import torchimport torch.nn as nnfrom spotoptim.data import get_diabetes_dataloadersfrom spotoptim.nn.linear_regressor import LinearRegressordef train_diabetes_model():"""Train a neural network on the diabetes dataset."""# Load data train_loader, test_loader, scaler = get_diabetes_dataloaders( test_size=0.2, batch_size=32, scale_features=True, random_state=42 )# Create model model = LinearRegressor( input_dim=10, output_dim=1, l1=128, num_hidden_layers=3, activation="ReLU" )# Setup training criterion = nn.MSELoss() optimizer = model.get_optimizer("Adam", lr=0.001, weight_decay=1e-5)# Training configuration num_epochs =200 best_test_loss =float('inf')print("Starting training...")print(f"Training samples: {len(train_loader.dataset)}")print(f"Test samples: {len(test_loader.dataset)}")print(f"Batches per epoch: {len(train_loader)}")print("-"*60)for epoch inrange(num_epochs):# Training phase model.train() train_loss =0.0for batch_X, batch_y in train_loader:# Forward pass predictions = model(batch_X) loss = criterion(predictions, batch_y)# Backward pass optimizer.zero_grad() loss.backward() optimizer.step() train_loss += loss.item() avg_train_loss = train_loss /len(train_loader)# Evaluation phase model.eval() test_loss =0.0with torch.no_grad():for batch_X, batch_y in test_loader: predictions = model(batch_X) loss = criterion(predictions, batch_y) test_loss += loss.item() avg_test_loss = test_loss /len(test_loader)# Track best modelif avg_test_loss < best_test_loss: best_test_loss = avg_test_loss# Could save model here: torch.save(model.state_dict(), 'best_model.pt')# Print progressif (epoch +1) %20==0:print(f"Epoch {epoch+1:3d}/{num_epochs}: "f"Train Loss = {avg_train_loss:.4f}, "f"Test Loss = {avg_test_loss:.4f}")print("-"*60)print(f"Training complete!")print(f"Best test loss: {best_test_loss:.4f}")return model, best_test_loss# Run trainingif__name__=="__main__": model, best_loss = train_diabetes_model()
Starting training...
Training samples: 353
Test samples: 89
Batches per epoch: 12
------------------------------------------------------------
Epoch 20/200: Train Loss = 28524.1568, Test Loss = 26619.8444
Epoch 40/200: Train Loss = 31027.1209, Test Loss = 26615.1947
Epoch 60/200: Train Loss = 32874.5234, Test Loss = 26610.5618
Epoch 80/200: Train Loss = 29759.0776, Test Loss = 26605.9232
Epoch 100/200: Train Loss = 27922.8774, Test Loss = 26601.1465
Epoch 120/200: Train Loss = 29921.5124, Test Loss = 26596.2734
Epoch 140/200: Train Loss = 30215.6427, Test Loss = 26591.2500
Epoch 160/200: Train Loss = 28893.3213, Test Loss = 26586.0690
Epoch 180/200: Train Loss = 28117.9351, Test Loss = 26580.8014
Epoch 200/200: Train Loss = 33354.7769, Test Loss = 26575.3841
------------------------------------------------------------
Training complete!
Best test loss: 26575.3841
28.7 Integration with SpotOptim
Use the diabetes dataset for hyperparameter optimization with SpotOptim:
import numpy as npimport torchimport torch.nn as nnfrom spotoptim import SpotOptimfrom spotoptim.data import get_diabetes_dataloadersfrom spotoptim.nn.linear_regressor import LinearRegressordef evaluate_model(X):"""Objective function for SpotOptim. Args: X: Array of hyperparameters [lr, l1, num_hidden_layers] Returns: Array of validation losses """ results = []for params in X: lr, l1, num_hidden_layers = params lr =10** lr # Log scale for learning rate l1 =int(l1) num_hidden_layers =int(num_hidden_layers)# Load data train_loader, test_loader, _ = get_diabetes_dataloaders( test_size=0.2, batch_size=32, random_state=42 )# Create model model = LinearRegressor( input_dim=10, output_dim=1, l1=l1, num_hidden_layers=num_hidden_layers, activation="ReLU" )# Train briefly criterion = nn.MSELoss() optimizer = model.get_optimizer("Adam", lr=lr) num_epochs =50for epoch inrange(num_epochs): model.train()for batch_X, batch_y in train_loader: predictions = model(batch_X) loss = criterion(predictions, batch_y) optimizer.zero_grad() loss.backward() optimizer.step()# Evaluate model.eval() test_loss =0.0with torch.no_grad():for batch_X, batch_y in test_loader: predictions = model(batch_X) loss = criterion(predictions, batch_y) test_loss += loss.item() results.append(test_loss /len(test_loader))return np.array(results)# Optimize hyperparametersoptimizer = SpotOptim( fun=evaluate_model, bounds=[ (-4, -2), # log10(lr): 0.0001 to 0.01 (16, 128), # l1: number of neurons (0, 4) # num_hidden_layers ], var_type=["float", "int", "int"], max_iter=30, n_initial=10, seed=42, verbose=True)result = optimizer.optimize()print(f"Best hyperparameters found:")print(f" Learning rate: {10**result.x[0]:.6f}")print(f" Hidden neurons (l1): {int(result.x[1])}")print(f" Hidden layers: {int(result.x[2])}")print(f" Best MSE: {result.fun:.4f}")
# Good: Features are standardizedtrain_loader, test_loader, scaler = get_diabetes_dataloaders( scale_features=True)
Neural networks typically perform better with normalized inputs.
28.8.2 2. Set Random Seeds for Reproducibility
# Reproducible train/test splitstrain_loader, test_loader, scaler = get_diabetes_dataloaders( random_state=42)# Also set PyTorch seedimport torchtorch.manual_seed(42)
<torch._C.Generator at 0x10f209df0>
28.8.3 3. Don’t Shuffle Test Data
# Good: Test data in consistent ordertrain_loader, test_loader, scaler = get_diabetes_dataloaders( shuffle_train=True, # Shuffle training data shuffle_test=False# Don't shuffle test data)
This ensures consistent evaluation metrics across runs.
28.8.4 4. Choose Appropriate Batch Size
# Small dataset (442 samples) - moderate batch size works welltrain_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=32# Good balance for this dataset)
Too large: Fewer gradient updates per epoch
Too small: Noisy gradients, slower training
28.8.5 5. Save the Scaler for Production
import pickleimport numpy as npfrom spotoptim.data import get_diabetes_dataloaders# Train with scalingtrain_loader, test_loader, scaler = get_diabetes_dataloaders( scale_features=True)# Save scaler for production usewithopen('scaler.pkl', 'wb') as f: pickle.dump(scaler, f)# Later: Load and use on new datawithopen('scaler.pkl', 'rb') as f: loaded_scaler = pickle.load(f)# Create some example new data (same shape as diabetes features)new_data = np.random.randn(5, 10) # 5 samples, 10 featuresnew_data_scaled = loaded_scaler.transform(new_data)print(f"Original data shape: {new_data.shape}")print(f"Scaled data shape: {new_data_scaled.shape}")print(f"Scaled data mean: {new_data_scaled.mean(axis=0)[:3]}") # Should be close to 0
Original data shape: (5, 10)
Scaled data shape: (5, 10)
Scaled data mean: [7.03934676 5.01490242 3.47123618]
28.9 Troubleshooting
28.9.1 Issue: Out of Memory
Solution: Reduce batch size or disable pin_memory
train_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=16, # Smaller batches pin_memory=False# Disable if not using GPU)
28.9.2 Issue: Different Data Ranges
Symptom: Model not converging, loss is NaN
Solution: Ensure feature scaling is enabled
train_loader, test_loader, scaler = get_diabetes_dataloaders( scale_features=True# Must be True for neural networks)
28.9.3 Issue: Non-Reproducible Results
Solution: Set all random seeds
import torchimport numpy as np# Set all seedstorch.manual_seed(42)np.random.seed(42)train_loader, test_loader, scaler = get_diabetes_dataloaders( random_state=42, shuffle_train=False# Disable shuffle for full reproducibility)
28.9.4 Issue: Slow Data Loading
Solution: Use multiple workers (if not on Windows)
train_loader, test_loader, scaler = get_diabetes_dataloaders( num_workers=4, # Use 4 subprocesses pin_memory=True# Enable for GPU)
Note: On Windows, set num_workers=0 to avoid multiprocessing issues.
28.10 Summary
The diabetes dataset utilities in SpotOptim provide:
Easy data loading: One function call gets complete data pipeline
PyTorch integration: Native Dataset and DataLoader support
Preprocessing included: Automatic feature scaling and train/test splitting
Flexible configuration: Control batch size, splitting, scaling, and more
Production ready: Save scalers and ensure reproducibility