SpotOptim provides convenient utilities for working with the sklearn diabetes dataset, including PyTorch Dataset and DataLoader implementations. These utilities simplify data loading, preprocessing, and model training for regression tasks.
23.1 Overview
The diabetes dataset contains 10 baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements) for 442 diabetes patients. The target is a quantitative measure of disease progression one year after baseline.
Module: spotoptim.data.diabetes
Key Components:
DiabetesDataset: PyTorch Dataset class
get_diabetes_dataloaders(): Convenience function for complete data pipeline
23.2 Quick Start
23.2.1 Basic Usage
from spotoptim.data import get_diabetes_dataloadersfrom sklearn.datasets import load_diabetesfrom spotoptim.data.diabetes import DiabetesDatasetimport numpy as np# Load datadiabetes = load_diabetes()X = diabetes.datay = diabetes.target.reshape(-1, 1)# Now create the datasetdataset = DiabetesDataset(X, y, transform=None, target_transform=None)# Load data with default settingstrain_loader, test_loader, scaler = get_diabetes_dataloaders()# Iterate through batchesfor batch_X, batch_y in train_loader:print(f"Batch features: {batch_X.shape}") # (32, 10)print(f"Batch targets: {batch_y.shape}") # (32, 1)break
Epoch 20/100: Loss = 31837.4149
Epoch 40/100: Loss = 27657.6397
Epoch 60/100: Loss = 30533.8840
Epoch 80/100: Loss = 35510.9040
Epoch 100/100: Loss = 27727.0009
Test MSE: 26488.8945
23.3 Function Reference
23.3.1 get_diabetes_dataloaders()
Loads the sklearn diabetes dataset and returns configured PyTorch DataLoaders.
Training samples: 309
Test samples: 133
Training samples: 397
Test samples: 45
23.5.3 Without Feature Scaling
from spotoptim.data import get_diabetes_dataloaders# Load without scaling (useful for tree-based models)train_loader, test_loader, scaler = get_diabetes_dataloaders( scale_features=False)print(f"Scaler: {scaler}") # None# Data is in original scalefor batch_X, batch_y in train_loader:print(f"Mean: {batch_X.mean(dim=0)[:3]}") # Non-zero valuesbreak
from spotoptim.data import get_diabetes_dataloaders# Larger batches for faster training (if memory allows)train_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=128)print(f"Batches per epoch: {len(train_loader)}") # Fewer batches# Smaller batches for more gradient updatestrain_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=8)print(f"Batches per epoch: {len(train_loader)}") # More batches
Batches per epoch: 3
Batches per epoch: 45
23.5.5 GPU Training with Pin Memory
import torchfrom spotoptim.data import get_diabetes_dataloaders# Enable pin_memory for faster GPU transfertrain_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=32, pin_memory=True# Set to True when using GPU)# Move model to GPUdevice = torch.device("cuda"if torch.cuda.is_available() else"cpu")model = model.to(device)# Training loop with GPUfor batch_X, batch_y in train_loader:# Data is already pinned, faster transfer to GPU batch_X = batch_X.to(device, non_blocking=True) batch_y = batch_y.to(device, non_blocking=True)# ... training code ...
/Users/bartz/workspace/spotoptim-cookbook/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:692: UserWarning:
'pin_memory' argument is set as true but not supported on MPS now, device pinned memory won't be used.
23.6 Complete Training Example
Here’s a complete example showing data loading, model training, and evaluation:
import torchimport torch.nn as nnfrom spotoptim.data import get_diabetes_dataloadersfrom spotoptim.nn.linear_regressor import LinearRegressordef train_diabetes_model():"""Train a neural network on the diabetes dataset."""# Load data train_loader, test_loader, scaler = get_diabetes_dataloaders( test_size=0.2, batch_size=32, scale_features=True, random_state=42 )# Create model model = LinearRegressor( input_dim=10, output_dim=1, l1=128, num_hidden_layers=3, activation="ReLU" )# Setup training criterion = nn.MSELoss() optimizer = model.get_optimizer("Adam", lr=0.001, weight_decay=1e-5)# Training configuration num_epochs =200 best_test_loss =float('inf')print("Starting training...")print(f"Training samples: {len(train_loader.dataset)}")print(f"Test samples: {len(test_loader.dataset)}")print(f"Batches per epoch: {len(train_loader)}")print("-"*60)for epoch inrange(num_epochs):# Training phase model.train() train_loss =0.0for batch_X, batch_y in train_loader:# Forward pass predictions = model(batch_X) loss = criterion(predictions, batch_y)# Backward pass optimizer.zero_grad() loss.backward() optimizer.step() train_loss += loss.item() avg_train_loss = train_loss /len(train_loader)# Evaluation phase model.eval() test_loss =0.0with torch.no_grad():for batch_X, batch_y in test_loader: predictions = model(batch_X) loss = criterion(predictions, batch_y) test_loss += loss.item() avg_test_loss = test_loss /len(test_loader)# Track best modelif avg_test_loss < best_test_loss: best_test_loss = avg_test_loss# Could save model here: torch.save(model.state_dict(), 'best_model.pt')# Print progressif (epoch +1) %20==0:print(f"Epoch {epoch+1:3d}/{num_epochs}: "f"Train Loss = {avg_train_loss:.4f}, "f"Test Loss = {avg_test_loss:.4f}")print("-"*60)print(f"Training complete!")print(f"Best test loss: {best_test_loss:.4f}")return model, best_test_loss# Run trainingif__name__=="__main__": model, best_loss = train_diabetes_model()
Starting training...
Training samples: 353
Test samples: 89
Batches per epoch: 12
------------------------------------------------------------
Epoch 20/200: Train Loss = 31260.0397, Test Loss = 26633.8405
Epoch 40/200: Train Loss = 30873.6089, Test Loss = 26629.1842
Epoch 60/200: Train Loss = 29560.2482, Test Loss = 26624.4290
Epoch 80/200: Train Loss = 27745.5570, Test Loss = 26619.4974
Epoch 100/200: Train Loss = 28170.9631, Test Loss = 26614.3301
Epoch 120/200: Train Loss = 28036.1819, Test Loss = 26608.9928
Epoch 140/200: Train Loss = 31515.8442, Test Loss = 26603.3861
Epoch 160/200: Train Loss = 32822.3197, Test Loss = 26597.5514
Epoch 180/200: Train Loss = 32395.4159, Test Loss = 26591.4531
Epoch 200/200: Train Loss = 29824.4967, Test Loss = 26585.1758
------------------------------------------------------------
Training complete!
Best test loss: 26585.1758
23.7 Integration with SpotOptim
Use the diabetes dataset for hyperparameter optimization with SpotOptim:
import numpy as npimport torchimport torch.nn as nnfrom spotoptim import SpotOptimfrom spotoptim.data import get_diabetes_dataloadersfrom spotoptim.nn.linear_regressor import LinearRegressordef evaluate_model(X):"""Objective function for SpotOptim. Args: X: Array of hyperparameters [lr, l1, num_hidden_layers] Returns: Array of validation losses """ results = []for params in X: lr, l1, num_hidden_layers = params lr =10** lr # Log scale for learning rate l1 =int(l1) num_hidden_layers =int(num_hidden_layers)# Load data train_loader, test_loader, _ = get_diabetes_dataloaders( test_size=0.2, batch_size=32, random_state=42 )# Create model model = LinearRegressor( input_dim=10, output_dim=1, l1=l1, num_hidden_layers=num_hidden_layers, activation="ReLU" )# Train briefly criterion = nn.MSELoss() optimizer = model.get_optimizer("Adam", lr=lr) num_epochs =50for epoch inrange(num_epochs): model.train()for batch_X, batch_y in train_loader: predictions = model(batch_X) loss = criterion(predictions, batch_y) optimizer.zero_grad() loss.backward() optimizer.step()# Evaluate model.eval() test_loss =0.0with torch.no_grad():for batch_X, batch_y in test_loader: predictions = model(batch_X) loss = criterion(predictions, batch_y) test_loss += loss.item() results.append(test_loss /len(test_loader))return np.array(results)# Optimize hyperparametersoptimizer = SpotOptim( fun=evaluate_model, bounds=[ (-4, -2), # log10(lr): 0.0001 to 0.01 (16, 128), # l1: number of neurons (0, 4) # num_hidden_layers ], var_type=["float", "int", "int"], max_iter=30, n_initial=10, seed=42, verbose=True)result = optimizer.optimize()print(f"Best hyperparameters found:")print(f" Learning rate: {10**result.x[0]:.6f}")print(f" Hidden neurons (l1): {int(result.x[1])}")print(f" Hidden layers: {int(result.x[2])}")print(f" Best MSE: {result.fun:.4f}")
TensorBoard logging disabled
Initial best: f(x) = 26543.408203
Iteration 1: f(x) = 26580.779948
Iteration 2: f(x) = 26630.720703
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 3: f(x) = 26630.039062
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 4: f(x) = 26623.877604
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 5: f(x) = 26555.312500
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 6: f(x) = 26592.140625
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 7: f(x) = 26565.575521
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 8: f(x) = 26633.102865
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 9: f(x) = 26633.738281
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 10: f(x) = 26575.945964
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 11: f(x) = 26690.525391
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 12: New best f(x) = 26533.076823
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 13: f(x) = 26625.047526
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 14: f(x) = 26702.458984
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 15: f(x) = 26578.651042
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 16: f(x) = 26622.776693
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 17: f(x) = 26628.791667
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 18: f(x) = 26672.611979
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 19: f(x) = 26624.714844
Attempt 2/10: Previous point was duplicate after rounding, trying fallback
Acquisition failure: Using random space-filling design as fallback.
Iteration 20: f(x) = 26646.041016
Best hyperparameters found:
Learning rate: 0.007645
Hidden neurons (l1): 96
Hidden layers: 3
Best MSE: 26533.0768
23.8 Best Practices
23.8.1 1. Always Use Feature Scaling
# Good: Features are standardizedtrain_loader, test_loader, scaler = get_diabetes_dataloaders( scale_features=True)
Neural networks typically perform better with normalized inputs.
23.8.2 2. Set Random Seeds for Reproducibility
# Reproducible train/test splitstrain_loader, test_loader, scaler = get_diabetes_dataloaders( random_state=42)# Also set PyTorch seedimport torchtorch.manual_seed(42)
<torch._C.Generator at 0x113059b70>
23.8.3 3. Don’t Shuffle Test Data
# Good: Test data in consistent ordertrain_loader, test_loader, scaler = get_diabetes_dataloaders( shuffle_train=True, # Shuffle training data shuffle_test=False# Don't shuffle test data)
This ensures consistent evaluation metrics across runs.
23.8.4 4. Choose Appropriate Batch Size
# Small dataset (442 samples) - moderate batch size works welltrain_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=32# Good balance for this dataset)
Too large: Fewer gradient updates per epoch
Too small: Noisy gradients, slower training
23.8.5 5. Save the Scaler for Production
import pickleimport numpy as npfrom spotoptim.data import get_diabetes_dataloaders# Train with scalingtrain_loader, test_loader, scaler = get_diabetes_dataloaders( scale_features=True)# Save scaler for production usewithopen('scaler.pkl', 'wb') as f: pickle.dump(scaler, f)# Later: Load and use on new datawithopen('scaler.pkl', 'rb') as f: loaded_scaler = pickle.load(f)# Create some example new data (same shape as diabetes features)new_data = np.random.randn(5, 10) # 5 samples, 10 featuresnew_data_scaled = loaded_scaler.transform(new_data)print(f"Original data shape: {new_data.shape}")print(f"Scaled data shape: {new_data_scaled.shape}")print(f"Scaled data mean: {new_data_scaled.mean(axis=0)[:3]}") # Should be close to 0
Original data shape: (5, 10)
Scaled data shape: (5, 10)
Scaled data mean: [-2.34235239 4.19590764 -9.5169612 ]
23.9 Troubleshooting
23.9.1 Issue: Out of Memory
Solution: Reduce batch size or disable pin_memory
train_loader, test_loader, scaler = get_diabetes_dataloaders( batch_size=16, # Smaller batches pin_memory=False# Disable if not using GPU)
23.9.2 Issue: Different Data Ranges
Symptom: Model not converging, loss is NaN
Solution: Ensure feature scaling is enabled
train_loader, test_loader, scaler = get_diabetes_dataloaders( scale_features=True# Must be True for neural networks)
23.9.3 Issue: Non-Reproducible Results
Solution: Set all random seeds
import torchimport numpy as np# Set all seedstorch.manual_seed(42)np.random.seed(42)train_loader, test_loader, scaler = get_diabetes_dataloaders( random_state=42, shuffle_train=False# Disable shuffle for full reproducibility)
23.9.4 Issue: Slow Data Loading
Solution: Use multiple workers (if not on Windows)
train_loader, test_loader, scaler = get_diabetes_dataloaders( num_workers=4, # Use 4 subprocesses pin_memory=True# Enable for GPU)
Note: On Windows, set num_workers=0 to avoid multiprocessing issues.
23.10 Summary
The diabetes dataset utilities in SpotOptim provide:
Easy data loading: One function call gets complete data pipeline
PyTorch integration: Native Dataset and DataLoader support
Preprocessing included: Automatic feature scaling and train/test splitting
Flexible configuration: Control batch size, splitting, scaling, and more
Production ready: Save scalers and ensure reproducibility