model_selection.split_ts_cv

model_selection.split_ts_cv

Time series cross-validation splitting.

Classes

Name Description
TimeSeriesFold Class to split time series data into train and test folds.

TimeSeriesFold

model_selection.split_ts_cv.TimeSeriesFold(
    steps,
    initial_train_size=None,
    fold_stride=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Class to split time series data into train and test folds.

When used within a backtesting or hyperparameter search, the arguments ‘initial_train_size’, ‘window_size’ and ‘differentiation’ are not required as they are automatically set by the backtesting or hyperparameter search functions.

Parameters

Name Type Description Default
steps int Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size. required
initial_train_size int | str | pd.Timestamp | None Number of observations used for initial training. - If None or 0, the initial forecaster is not trained in the first fold. - If an integer, the number of observations used for initial training. - If a date string or pandas Timestamp, it is the last date included in the initial training set. Defaults to None. None
fold_stride int | None Number of observations that the start of the test set advances between consecutive folds. - If None, it defaults to the same value as steps, meaning that folds are placed back-to-back without overlap. - If fold_stride < steps, test sets overlap and multiple forecasts will be generated for the same observations. - If fold_stride > steps, gaps are left between consecutive test sets. Defaults to None. None
window_size int | None Number of observations needed to generate the autoregressive predictors. Defaults to None. None
differentiation int | None Number of observations to use for differentiation. This is used to extend the last_window as many observations as the differentiation order. Defaults to None. None
refit bool | int Whether to refit the forecaster in each fold. - If True, the forecaster is refitted in each fold. - If False, the forecaster is trained only in the first fold. - If an integer, the forecaster is trained in the first fold and then refitted every refit folds. Defaults to False. False
fixed_train_size bool Whether the training size is fixed or increases in each fold. Defaults to True. True
gap int Number of observations between the end of the training set and the start of the test set. Defaults to 0. 0
skip_folds int | list[int] | None Number of folds to skip. - If an integer, every ‘skip_folds’-th is returned. - If a list, the indexes of the folds to skip. For example, if skip_folds=3 and there are 10 folds, the returned folds are 0, 3, 6, and 9. If skip_folds=[1, 2, 3], the returned folds are 0, 4, 5, 6, 7, 8, and 9. Defaults to None. None
allow_incomplete_fold bool Whether to allow the last fold to include fewer observations than steps. If False, the last fold is excluded if it is incomplete. Defaults to True. True
return_all_indexes bool Whether to return all indexes or only the start and end indexes of each fold. Defaults to False. False
verbose bool Whether to print information about generated folds. Defaults to True. True

Attributes

Name Type Description
steps Number of observations used to be predicted in each fold.
initial_train_size Number of observations used for initial training. If None or 0, the initial forecaster is not trained in the first fold.
fold_stride Number of observations that the start of the test set advances between consecutive folds.
overlapping_folds Whether the folds overlap.
window_size Number of observations needed to generate the autoregressive predictors.
differentiation Number of observations to use for differentiation. This is used to extend the last_window as many observations as the differentiation order.
refit Whether to refit the forecaster in each fold.
fixed_train_size Whether the training size is fixed or increases in each fold.
gap Number of observations between the end of the training set and the start of the test set.
skip_folds Number of folds to skip.
allow_incomplete_fold Whether to allow the last fold to include fewer observations than steps.
return_all_indexes Whether to return all indexes or only the start and end indexes of each fold.
verbose Whether to print information about generated folds.

Examples

Basic usage with fixed train size:

>>> import pandas as pd
>>> import numpy as np
>>> from spotforecast2_safe.model_selection import TimeSeriesFold
>>> # Create sample time series data
>>> dates = pd.date_range('2020-01-01', periods=100, freq='D')
>>> y = pd.Series(np.arange(100), index=dates)
>>> # Create fold splitter
>>> cv = TimeSeriesFold(
...     steps=10,
...     initial_train_size=50,
...     refit=True,
...     fixed_train_size=True
... )
>>> # Get folds
>>> folds = cv.split(y)
>>> print(f"Number of folds: {len(folds)}")
Number of folds: 4

Overlapping folds with custom stride:

>>> cv = TimeSeriesFold(
...     steps=30,
...     initial_train_size=50,
...     fold_stride=7,
...     fixed_train_size=False
... )
>>> folds = cv.split(y)
>>> # First test fold covers [50, 80), second [57, 87), etc.

Return as pandas DataFrame:

>>> cv = TimeSeriesFold(steps=10, initial_train_size=50)
>>> folds_df = cv.split(y, as_pandas=True)
>>> print(folds_df.columns.tolist())
['fold', 'train_start', 'train_end', 'last_window_start', 'last_window_end', 'test_start', 'test_end', 'test_start_with_gap', 'test_end_with_gap', 'fit_forecaster']

Skip folds for faster evaluation:

>>> cv = TimeSeriesFold(
...     steps=5,
...     initial_train_size=50,
...     skip_folds=2
... )
>>> folds = cv.split(y)
>>> # Returns folds 0, 2, 4, 6, ...

Note

Returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. For example, if the input series is X = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], the initial_train_size = 3, window_size = 2, steps = 4, and gap = 1, the output of the first fold will: [0, [0, 3], [1, 3], [3, 8], [4, 8], True].

The first element is the fold number, the first list [0, 3] indicates that the training set goes from the first to the third observation. The second list [1, 3] indicates that the last window seen by the forecaster during training goes from the second to the third observation. The third list [3, 8] indicates that the test set goes from the fourth to the eighth observation. The fourth list [4, 8] indicates that the test set including the gap goes from the fifth to the eighth observation. The boolean False indicates that the forecaster should not be trained in this fold.

Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

As an example, with initial_train_size=50, steps=30, and fold_stride=7, the first test fold will cover observations [50, 80), the second fold [57, 87), and the third fold [64, 94). This configuration produces multiple forecasts for the same observations, which is often desirable in rolling-origin evaluation.

Methods

Name Description
split Split the time series data into train and test folds.
split
model_selection.split_ts_cv.TimeSeriesFold.split(X, as_pandas=False)

Split the time series data into train and test folds.

Parameters
Name Type Description Default
X pd.Series | pd.DataFrame | pd.Index | dict[str, pd.Series | pd.DataFrame] Time series data or index to split. Can be a pandas Series, DataFrame, Index, or a dictionary of Series/DataFrames. required
as_pandas bool If True, the folds are returned as a DataFrame. This is useful to visualize the folds in a more interpretable way. Defaults to False. False
Returns
Name Type Description
list | pd.DataFrame A list of lists containing the indices (position) for each fold, or a
list | pd.DataFrame DataFrame if as_pandas=True. Each list contains 4 lists and a boolean
list | pd.DataFrame with the following information:
list | pd.DataFrame - fold: fold number.
list | pd.DataFrame - [train_start, train_end]: list with the start and end positions of the training set.
list | pd.DataFrame - [last_window_start, last_window_end]: list with the start and end positions of the last window seen by the forecaster during training. The last window is used to generate the lags use as predictors. If differentiation is included, the interval is extended as many observations as the differentiation order. If the argument window_size is None, this list is empty.
list | pd.DataFrame - [test_start, test_end]: list with the start and end positions of the test set. These are the observations used to evaluate the forecaster.
list | pd.DataFrame - [test_start_with_gap, test_end_with_gap]: list with the start and end positions of the test set including the gap. The gap is the number of observations between the end of the training set and the start of the test set.
list | pd.DataFrame - fit_forecaster: boolean indicating whether the forecaster should be fitted in this fold.
Note

The returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc.

If as_pandas is True, the folds are returned as a DataFrame with the following columns: ‘fold’, ‘train_start’, ‘train_end’, ‘last_window_start’, ‘last_window_end’, ‘test_start’, ‘test_end’, ‘test_start_with_gap’, ‘test_end_with_gap’, ‘fit_forecaster’.

Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.