model_selection.split_ts_cv

model_selection.split_ts_cv

Time series cross-validation splitting.

Classes

Name	Description
TimeSeriesFold	Class to split time series data into train and test folds.

TimeSeriesFold

model_selection.split_ts_cv.TimeSeriesFold(
    steps,
    initial_train_size=None,
    fold_stride=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Class to split time series data into train and test folds.

When used within a backtesting or hyperparameter search, the arguments ‘initial_train_size’, ‘window_size’ and ‘differentiation’ are not required as they are automatically set by the backtesting or hyperparameter search functions.

Parameters

Name	Type	Description	Default
steps	int	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.	required
initial_train_size	int \| str \| pd.Timestamp \| None	Number of observations used for initial training. - If `None` or 0, the initial forecaster is not trained in the first fold. - If an integer, the number of observations used for initial training. - If a date string or pandas Timestamp, it is the last date included in the initial training set. Defaults to None.	`None`
fold_stride	int \| None	Number of observations that the start of the test set advances between consecutive folds. - If `None`, it defaults to the same value as `steps`, meaning that folds are placed back-to-back without overlap. - If `fold_stride < steps`, test sets overlap and multiple forecasts will be generated for the same observations. - If `fold_stride > steps`, gaps are left between consecutive test sets. Defaults to None.	`None`
window_size	int \| None	Number of observations needed to generate the autoregressive predictors. Defaults to None.	`None`
differentiation	int \| None	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order. Defaults to None.	`None`
refit	bool \| int	Whether to refit the forecaster in each fold. - If `True`, the forecaster is refitted in each fold. - If `False`, the forecaster is trained only in the first fold. - If an integer, the forecaster is trained in the first fold and then refitted every `refit` folds. Defaults to False.	`False`
fixed_train_size	bool	Whether the training size is fixed or increases in each fold. Defaults to True.	`True`
gap	int	Number of observations between the end of the training set and the start of the test set. Defaults to 0.	`0`
skip_folds	int \| list[int] \| None	Number of folds to skip. - If an integer, every ‘skip_folds’-th is returned. - If a list, the indexes of the folds to skip. For example, if `skip_folds=3` and there are 10 folds, the returned folds are 0, 3, 6, and 9. If `skip_folds=[1, 2, 3]`, the returned folds are 0, 4, 5, 6, 7, 8, and 9. Defaults to None.	`None`
allow_incomplete_fold	bool	Whether to allow the last fold to include fewer observations than `steps`. If `False`, the last fold is excluded if it is incomplete. Defaults to True.	`True`
return_all_indexes	bool	Whether to return all indexes or only the start and end indexes of each fold. Defaults to False.	`False`
verbose	bool	Whether to print information about generated folds. Defaults to True.	`True`

Attributes

Name	Type	Description
steps		Number of observations used to be predicted in each fold.
initial_train_size		Number of observations used for initial training. If `None` or 0, the initial forecaster is not trained in the first fold.
fold_stride		Number of observations that the start of the test set advances between consecutive folds.
overlapping_folds		Whether the folds overlap.
window_size		Number of observations needed to generate the autoregressive predictors.
differentiation		Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.
refit		Whether to refit the forecaster in each fold.
fixed_train_size		Whether the training size is fixed or increases in each fold.
gap		Number of observations between the end of the training set and the start of the test set.
skip_folds		Number of folds to skip.
allow_incomplete_fold		Whether to allow the last fold to include fewer observations than `steps`.
return_all_indexes		Whether to return all indexes or only the start and end indexes of each fold.
verbose		Whether to print information about generated folds.

Examples

Basic usage with fixed train size:

>>> import pandas as pd
>>> import numpy as np
>>> from spotforecast2_safe.model_selection import TimeSeriesFold
>>> # Create sample time series data
>>> dates = pd.date_range('2020-01-01', periods=100, freq='D')
>>> y = pd.Series(np.arange(100), index=dates)
>>> # Create fold splitter
>>> cv = TimeSeriesFold(
...     steps=10,
...     initial_train_size=50,
...     refit=True,
...     fixed_train_size=True
... )
>>> # Get folds
>>> folds = cv.split(y)
>>> print(f"Number of folds: {len(folds)}")
Number of folds: 4

Overlapping folds with custom stride:

>>> cv = TimeSeriesFold(
...     steps=30,
...     initial_train_size=50,
...     fold_stride=7,
...     fixed_train_size=False
... )
>>> folds = cv.split(y)
>>> # First test fold covers [50, 80), second [57, 87), etc.

Return as pandas DataFrame:

>>> cv = TimeSeriesFold(steps=10, initial_train_size=50)
>>> folds_df = cv.split(y, as_pandas=True)
>>> print(folds_df.columns.tolist())
['fold', 'train_start', 'train_end', 'last_window_start', 'last_window_end', 'test_start', 'test_end', 'test_start_with_gap', 'test_end_with_gap', 'fit_forecaster']

Skip folds for faster evaluation:

>>> cv = TimeSeriesFold(
...     steps=5,
...     initial_train_size=50,
...     skip_folds=2
... )
>>> folds = cv.split(y)
>>> # Returns folds 0, 2, 4, 6, ...

Note

Returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. For example, if the input series is X = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], the initial_train_size = 3, window_size = 2, steps = 4, and gap = 1, the output of the first fold will: [0, [0, 3], [1, 3], [3, 8], [4, 8], True].

The first element is the fold number, the first list [0, 3] indicates that the training set goes from the first to the third observation. The second list [1, 3] indicates that the last window seen by the forecaster during training goes from the second to the third observation. The third list [3, 8] indicates that the test set goes from the fourth to the eighth observation. The fourth list [4, 8] indicates that the test set including the gap goes from the fifth to the eighth observation. The boolean False indicates that the forecaster should not be trained in this fold.

Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

As an example, with initial_train_size=50, steps=30, and fold_stride=7, the first test fold will cover observations [50, 80), the second fold [57, 87), and the third fold [64, 94). This configuration produces multiple forecasts for the same observations, which is often desirable in rolling-origin evaluation.

Methods

Name	Description
split	Split the time series data into train and test folds.

split

model_selection.split_ts_cv.TimeSeriesFold.split(X, as_pandas=False)

Split the time series data into train and test folds.

Parameters

Name	Type	Description	Default
X	pd.Series \| pd.DataFrame \| pd.Index \| dict[str, pd.Series \| pd.DataFrame]	Time series data or index to split. Can be a pandas Series, DataFrame, Index, or a dictionary of Series/DataFrames.	required
as_pandas	bool	If True, the folds are returned as a DataFrame. This is useful to visualize the folds in a more interpretable way. Defaults to False.	`False`

Returns

Name	Type	Description
	list \| pd.DataFrame	A list of lists containing the indices (position) for each fold, or a
	list \| pd.DataFrame	DataFrame if `as_pandas=True`. Each list contains 4 lists and a boolean
	list \| pd.DataFrame	with the following information:
	list \| pd.DataFrame	- fold: fold number.
	list \| pd.DataFrame	- [train_start, train_end]: list with the start and end positions of the training set.
	list \| pd.DataFrame	- [last_window_start, last_window_end]: list with the start and end positions of the last window seen by the forecaster during training. The last window is used to generate the lags use as predictors. If `differentiation` is included, the interval is extended as many observations as the differentiation order. If the argument `window_size` is `None`, this list is empty.
	list \| pd.DataFrame	- [test_start, test_end]: list with the start and end positions of the test set. These are the observations used to evaluate the forecaster.
	list \| pd.DataFrame	- [test_start_with_gap, test_end_with_gap]: list with the start and end positions of the test set including the gap. The gap is the number of observations between the end of the training set and the start of the test set.
	list \| pd.DataFrame	- fit_forecaster: boolean indicating whether the forecaster should be fitted in this fold.

Note

The returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc.

If as_pandas is True, the folds are returned as a DataFrame with the following columns: ‘fold’, ‘train_start’, ‘train_end’, ‘last_window_start’, ‘last_window_end’, ‘test_start’, ‘test_end’, ‘test_start_with_gap’, ‘test_end_with_gap’, ‘fit_forecaster’.

Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.