This tutorial demonstrates the usage of TorchStandardScaler and its integration with TorchObjective in SpotOptim.
28.1 Introduction
Scaling input features is a crucial step in machine learning, especially for models trained with gradient-based optimization like Neural Networks. SpotOptim provides a convenient TorchStandardScaler (mimicking sklearn’s StandardScaler) that handles PyTorch tensors correctly.
Additionally, TorchObjective can automatically apply this scaler to your data when use_scaler=True is set during initialization. This ensures that:
Scaling is fit only on the training set (preventing data leakage).
The same scaling transformation is applied to validation and test data.
The process is seamless and integrated into the objective evaluation.
28.2 Using TorchStandardScaler Directly
You can use the scaler independently for any PyTorch data processing tasks.
The most convenient usage is within TorchObjective.
When you initialize TorchObjective with use_scaler=True, it automatically intercepts the data preparation phase. It scales the features provided in the SpotData object before they are loaded into the PyTorch DataLoader.
28.3.1 Example: Scaling Effect
The following fully self-contained example demonstrates the effect of scaling. We will train a simple Linear Regression model on a dataset with vastly different feature scales.
Without Scaling: The model might struggle to converge or require a very carefully tuned learning rate.
With Scaling: The model typically converges faster and more reliably.
import torchimport torch.nn as nnimport numpy as npimport randomfrom spotoptim.core.experiment import ExperimentControlfrom spotoptim.function.torch_objective import TorchObjectivefrom spotoptim.hyperparameters import ParameterSetfrom spotoptim.core.data import SpotDataFromArraydef set_seed(seed=42): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed)# 1. Create a Dataset with Disparate Scales# Feature 1: range [0, 1]# Feature 2: range [0, 1000]set_seed(42)n_samples =200X1 = np.random.rand(n_samples, 1)X2 = np.random.rand(n_samples, 1) *1000.0X = np.hstack([X1, X2])# Target: y = 2*x1 + 0.005*x2 + noise# Note: Coefficient for x2 is small because x2 is largey =2* X1 +0.005* X2 + np.random.normal(0, 0.1, size=(n_samples, 1))# Manual split for validationsplit_idx =int(n_samples *0.8)X_train, X_val = X[:split_idx], X[split_idx:]y_train, y_val = y[:split_idx], y[split_idx:]dataset = SpotDataFromArray(x_train=X_train, y_train=y_train, x_val=X_val, y_val=y_val)# 2. Define a Simple Modelclass LinearReg(nn.Module):def__init__(self, input_dim, output_dim, **kwargs):super().__init__()self.fc = nn.Linear(input_dim, output_dim)def forward(self, x):returnself.fc(x)# 3. Setup Experiment# We will use SGD which is sensitive to feature scalingparams = ParameterSet()params.add_float("lr", 1e-5, 1e-1, default=1e-3)exp = ExperimentControl( experiment_name="scaling_test", model_class=LinearReg, dataset=dataset, hyperparameters=params, metrics=["val_loss"], epochs=50, batch_size=32, device="cpu", # Force CPU for simplicity in example loss_function=nn.MSELoss())# 4. Run Without Scalingprint("--- No Scaling ---")objective_no_scale = TorchObjective(exp, seed=42, use_scaler=False)# Evaluate with a learning rate that might be tricky for unscaled data# lr=0.001 is often too high for feature x2 (scale 1000) -> gradients will be hugeres_no_scale = objective_no_scale(np.array([[1e-4]])) # Use small LR to avoid explosion immediatelyprint(f"Val Loss (No Scale): {res_no_scale[0, 0]:.4f}")# 5. Run With Scalingprint("\n--- With Scaling ---")objective_scaled = TorchObjective(exp, seed=42, use_scaler=True)# With scaling, standard LRs like 0.01 or 0.001 work fineres_scaled = objective_scaled(np.array([[1e-2]])) # Can use larger LRprint(f"Val Loss (Scaled): {res_scaled[0, 0]:.4f}")# Comparison with same LR (if stable)# A very small LR is needed for unscaled data to not diverge, # but that makes convergence slow for the small feature x1.# Scaled data allows balanced learning.
--- No Scaling ---
Val Loss (No Scale): 127379.6953
--- With Scaling ---
Val Loss (Scaled): 2.3018
28.3.2 Explanation
In the example above, X2 has values up to 1000.
Without Scaling: The gradients with respect to weights for X2 will be 1000x larger than for X1. To prevent divergence, the learning rate must be very small (e.g., 1e-4 or 1e-5). However, this small learning rate will make learning the weight for X1 extremely slow.
With Scaling: Both X1 and X2 are transformed to have approximately zero mean and unit variance. The error surface becomes more spherical (isotropic), allowing standard learning rates (e.g., 1e-2) to efficiently train weights for both features simultaneously.
By simply adding use_scaler=True, TorchObjective handles this best practice for you automatically.