model_selection.utils_common
model_selection.utils_common
Common validation and initialization utilities for model selection.
Functions
| Name | Description |
|---|---|
| check_backtesting_input | This is a helper function to check most inputs of backtesting functions in |
| check_one_step_ahead_input | This is a helper function to check most inputs of hyperparameter tuning |
| initialize_lags_grid | Initialize lags grid and lags label for model selection. |
| select_n_jobs_backtesting | Select the optimal number of jobs to use in the backtesting process. This |
check_backtesting_input
model_selection.utils_common.check_backtesting_input(
forecaster,
cv,
metric,
add_aggregated_metric=True,
y=None,
series=None,
exog=None,
interval=None,
interval_method='bootstrapping',
alpha=None,
n_boot=250,
use_in_sample_residuals=True,
use_binned_residuals=True,
random_state=123,
return_predictors=False,
freeze_params=True,
n_jobs='auto',
show_progress=True,
suppress_warnings=False,
)This is a helper function to check most inputs of backtesting functions in modules model_selection.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| forecaster | object | Forecaster model. | required |
| cv | object | TimeSeriesFold object with the information needed to split the data into folds. | required |
| metric | str | Callable | list[str | Callable] | Metric used to quantify the goodness of fit of the model. | required |
| add_aggregated_metric | bool | If True, the aggregated metrics (average, weighted average and pooling) over all levels are also returned (only multiseries). |
True |
| y | pd.Series | None | Training time series for uni-series forecasters. | None |
| series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | Training time series for multi-series forecasters. | None |
| exog | pd.Series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | None | Exogenous variables. | None |
| interval | float | list[float] | tuple[float] | str | object | None | Specifies whether probabilistic predictions should be estimated and the method to use. The following options are supported: - If float, represents the nominal (expected) coverage (between 0 and 1). For instance, interval=0.95 corresponds to [2.5, 97.5] percentiles. - If list or tuple: Sequence of percentiles to compute, each value must be between 0 and 100 inclusive. For example, a 95% confidence interval can be specified as interval = [2.5, 97.5] or multiple percentiles (e.g. 10, 50 and 90) as interval = [10, 50, 90]. - If ‘bootstrapping’ (str): n_boot bootstrapping predictions will be generated. - If scipy.stats distribution object, the distribution parameters will be estimated for each prediction. - If None, no probabilistic predictions are estimated. |
None |
| interval_method | str | Technique used to estimate prediction intervals. Available options: - ‘bootstrapping’: Bootstrapping is used to generate prediction intervals. - ‘conformal’: Employs the conformal prediction split method for interval estimation. | 'bootstrapping' |
| alpha | float | None | The confidence intervals used in ForecasterStats are (1 - alpha) %. | None |
| n_boot | int | Number of bootstrapping iterations to perform when estimating prediction intervals. | 250 |
| use_in_sample_residuals | bool | If True, residuals from the training data are used as proxy of prediction error to create prediction intervals. If False, out_sample_residuals are used if they are already stored inside the forecaster. |
True |
| use_binned_residuals | bool | If True, residuals are selected based on the predicted values (binned selection). If False, residuals are selected randomly. |
True |
| random_state | int | Seed for the random number generator to ensure reproducibility. | 123 |
| return_predictors | bool | If True, the predictors used to make the predictions are also returned. |
False |
| n_jobs | int | str | The number of jobs to run in parallel. If -1, then the number of jobs is set to the number of cores. If ‘auto’, n_jobs is set using the function select_n_jobs_fit_forecaster. |
'auto' |
| freeze_params | bool | Determines whether to freeze the model parameters after the first fit for estimators that perform automatic model selection. - If True, the model parameters found during the first fit (e.g., order and seasonal_order for Arima, or smoothing parameters for Ets) are reused in all subsequent refits. This avoids re-running the automatic selection procedure in each fold and reduces runtime. - If False, automatic model selection is performed independently in each refit, allowing parameters to adapt across folds. This increases runtime and adds a params column to the output with the parameters selected per fold. |
True |
| show_progress | bool | Whether to show a progress bar. | True |
| suppress_warnings | bool | If True, spotforecast warnings will be suppressed during the backtesting process. |
False |
Returns
| Name | Type | Description |
|---|---|---|
| None | None |
Examples
>>> import pandas as pd
>>> from spotforecast2.model_selection.utils_common import check_backtesting_input
>>> from spotforecast2_safe.forecaster.recursive import ForecasterRecursive
>>> from spotforecast2.model_selection import TimeSeriesFold
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.metrics import mean_squared_error
>>> y = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> forecaster = ForecasterRecursive(LinearRegression(), lags=2)
>>> cv = TimeSeriesFold(
... steps=3,
... initial_train_size=5,
... gap=0,
... refit=False,
... fixed_train_size=False,
... allow_incomplete_fold=True
... )
>>> check_backtesting_input(
... forecaster=forecaster,
... cv=cv,
... metric=mean_squared_error,
... y=y
... )check_one_step_ahead_input
model_selection.utils_common.check_one_step_ahead_input(
forecaster,
cv,
metric,
y=None,
series=None,
exog=None,
show_progress=True,
suppress_warnings=False,
)This is a helper function to check most inputs of hyperparameter tuning functions in modules model_selection when using a OneStepAheadFold.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| forecaster | object | Forecaster model. | required |
| cv | object | OneStepAheadFold object with the information needed to split the data into folds. | required |
| metric | str | Callable | list[str | Callable] | Metric used to quantify the goodness of fit of the model. | required |
| y | pd.Series | None | Training time series for uni-series forecasters. | None |
| series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | Training time series for multi-series forecasters. | None |
| exog | pd.Series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | None | Exogenous variables. | None |
| show_progress | bool | Whether to show a progress bar. | True |
| suppress_warnings | bool | If True, spotforecast warnings will be suppressed during the hyperparameter search. |
False |
Returns
| Name | Type | Description |
|---|---|---|
| None | None |
Examples
>>> import pandas as pd
>>> from spotforecast2.model_selection.utils_common import check_one_step_ahead_input
>>> from spotforecast2_safe.forecaster.recursive import ForecasterRecursive
>>> from spotforecast2.model_selection import OneStepAheadFold
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.metrics import mean_squared_error
>>> y = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> forecaster = ForecasterRecursive(LinearRegression(), lags=2)
>>> cv = OneStepAheadFold(
... initial_train_size=5,
... return_all_predictions=False
... )
>>> check_one_step_ahead_input(
... forecaster=forecaster,
... cv=cv,
... metric=mean_squared_error,
... y=y
... )initialize_lags_grid
model_selection.utils_common.initialize_lags_grid(forecaster, lags_grid=None)Initialize lags grid and lags label for model selection.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| forecaster | object | Forecaster model. ForecasterRecursive, ForecasterDirect, ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate. | required |
| lags_grid | list[int | list[int] | np.ndarray[int] | range[int]] | dict[str, list[int | list[int] | np.ndarray[int] | range[int]]] | None | Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If dict, the keys are used as labels in the results DataFrame, and the values are used as the lists of lags to try. |
None |
Returns
| Name | Type | Description |
|---|---|---|
| tuple | tuple[dict[str, int], str] | (lags_grid, lags_label) - lags_grid (dict): Dictionary with lags configuration for each iteration. - lags_label (str): Label for lags representation in the results object. |
Examples
>>> from spotforecast2.model_selection.utils_common import initialize_lags_grid
>>> from spotforecast2_safe.forecaster.recursive import ForecasterRecursive
>>> from sklearn.linear_model import LinearRegression
>>> forecaster = ForecasterRecursive(LinearRegression(), lags=2)
>>> lags_grid = [2, 4]
>>> lags_grid, lags_label = initialize_lags_grid(forecaster, lags_grid)
>>> print(lags_grid)
{'2': 2, '4': 4}
>>> print(lags_label)
valuesselect_n_jobs_backtesting
model_selection.utils_common.select_n_jobs_backtesting(forecaster, refit)Select the optimal number of jobs to use in the backtesting process. This selection is based on heuristics and is not guaranteed to be optimal.
The number of jobs is chosen as follows:
- If
refitis an integer, thenn_jobs = 1. This is because parallelization doesn’t work with intermittent refit. - If forecaster is ‘ForecasterRecursive’ and estimator is a linear estimator, then
n_jobs = 1. - If forecaster is ‘ForecasterRecursive’ and estimator is not a linear estimator then
n_jobs = cpu_count() - 1. - If forecaster is ‘ForecasterDirect’ or ‘ForecasterDirectMultiVariate’ and
refit = True, thenn_jobs = cpu_count() - 1. - If forecaster is ‘ForecasterDirect’ or ‘ForecasterDirectMultiVariate’ and
refit = False, thenn_jobs = 1. - If forecaster is ‘ForecasterRecursiveMultiSeries’, then
n_jobs = cpu_count() - 1. - If forecaster is ‘ForecasterStats’ or ‘ForecasterEquivalentDate’, then
n_jobs = 1. - If estimator is a
LGBMRegressor(n_jobs=1), thenn_jobs = cpu_count() - 1. - If estimator is a
LGBMRegressorwith internal n_jobs != 1, thenn_jobs = 1. This is becauselightgbmis highly optimized for gradient boosting and parallelizes operations at a very fine-grained level, making additional parallelization unnecessary and potentially harmful due to resource contention.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| forecaster | object | Forecaster model. | required |
| refit | bool | int | If the forecaster is refitted during the backtesting process. | required |
Returns
| Name | Type | Description |
|---|---|---|
| int | int | The number of jobs to run in parallel. |
Examples
>>> from spotforecast2.model_selection.utils_common import select_n_jobs_backtesting
>>> from spotforecast2_safe.forecaster.recursive import ForecasterRecursive
>>> from sklearn.linear_model import LinearRegression
>>> forecaster = ForecasterRecursive(LinearRegression(), lags=2)
>>> select_n_jobs_backtesting(forecaster, refit=True)
1