model_selection.split_ts_cv
model_selection.split_ts_cv
Time series cross-validation splitting.
Classes
| Name | Description |
|---|---|
| TimeSeriesFold | Class to split time series data into train and test folds. |
TimeSeriesFold
model_selection.split_ts_cv.TimeSeriesFold(
steps,
initial_train_size=None,
fold_stride=None,
window_size=None,
differentiation=None,
refit=False,
fixed_train_size=True,
gap=0,
skip_folds=None,
allow_incomplete_fold=True,
return_all_indexes=False,
verbose=True,
)Class to split time series data into train and test folds.
When used within a backtesting or hyperparameter search, the arguments ‘initial_train_size’, ‘window_size’ and ‘differentiation’ are not required as they are automatically set by the backtesting or hyperparameter search functions.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| steps | int | Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size. | required |
| initial_train_size | int | str | pd.Timestamp | None | Number of observations used for initial training. - If None or 0, the initial forecaster is not trained in the first fold. - If an integer, the number of observations used for initial training. - If a date string or pandas Timestamp, it is the last date included in the initial training set. Defaults to None. |
None |
| fold_stride | int | None | Number of observations that the start of the test set advances between consecutive folds. - If None, it defaults to the same value as steps, meaning that folds are placed back-to-back without overlap. - If fold_stride < steps, test sets overlap and multiple forecasts will be generated for the same observations. - If fold_stride > steps, gaps are left between consecutive test sets. Defaults to None. |
None |
| window_size | int | None | Number of observations needed to generate the autoregressive predictors. Defaults to None. | None |
| differentiation | int | None | Number of observations to use for differentiation. This is used to extend the last_window as many observations as the differentiation order. Defaults to None. |
None |
| refit | bool | int | Whether to refit the forecaster in each fold. - If True, the forecaster is refitted in each fold. - If False, the forecaster is trained only in the first fold. - If an integer, the forecaster is trained in the first fold and then refitted every refit folds. Defaults to False. |
False |
| fixed_train_size | bool | Whether the training size is fixed or increases in each fold. Defaults to True. | True |
| gap | int | Number of observations between the end of the training set and the start of the test set. Defaults to 0. | 0 |
| skip_folds | int | list[int] | None | Number of folds to skip. - If an integer, every ‘skip_folds’-th is returned. - If a list, the indexes of the folds to skip. For example, if skip_folds=3 and there are 10 folds, the returned folds are 0, 3, 6, and 9. If skip_folds=[1, 2, 3], the returned folds are 0, 4, 5, 6, 7, 8, and 9. Defaults to None. |
None |
| allow_incomplete_fold | bool | Whether to allow the last fold to include fewer observations than steps. If False, the last fold is excluded if it is incomplete. Defaults to True. |
True |
| return_all_indexes | bool | Whether to return all indexes or only the start and end indexes of each fold. Defaults to False. | False |
| verbose | bool | Whether to print information about generated folds. Defaults to True. | True |
Attributes
| Name | Type | Description |
|---|---|---|
| steps | Number of observations used to be predicted in each fold. | |
| initial_train_size | Number of observations used for initial training. If None or 0, the initial forecaster is not trained in the first fold. |
|
| fold_stride | Number of observations that the start of the test set advances between consecutive folds. | |
| overlapping_folds | Whether the folds overlap. | |
| window_size | Number of observations needed to generate the autoregressive predictors. | |
| differentiation | Number of observations to use for differentiation. This is used to extend the last_window as many observations as the differentiation order. |
|
| refit | Whether to refit the forecaster in each fold. | |
| fixed_train_size | Whether the training size is fixed or increases in each fold. | |
| gap | Number of observations between the end of the training set and the start of the test set. | |
| skip_folds | Number of folds to skip. | |
| allow_incomplete_fold | Whether to allow the last fold to include fewer observations than steps. |
|
| return_all_indexes | Whether to return all indexes or only the start and end indexes of each fold. | |
| verbose | Whether to print information about generated folds. |
Examples
Basic usage with fixed train size:
>>> import pandas as pd
>>> import numpy as np
>>> from spotforecast2_safe.model_selection import TimeSeriesFold
>>> # Create sample time series data
>>> dates = pd.date_range('2020-01-01', periods=100, freq='D')
>>> y = pd.Series(np.arange(100), index=dates)
>>> # Create fold splitter
>>> cv = TimeSeriesFold(
... steps=10,
... initial_train_size=50,
... refit=True,
... fixed_train_size=True
... )
>>> # Get folds
>>> folds = cv.split(y)
>>> print(f"Number of folds: {len(folds)}")
Number of folds: 4Overlapping folds with custom stride:
>>> cv = TimeSeriesFold(
... steps=30,
... initial_train_size=50,
... fold_stride=7,
... fixed_train_size=False
... )
>>> folds = cv.split(y)
>>> # First test fold covers [50, 80), second [57, 87), etc.Return as pandas DataFrame:
>>> cv = TimeSeriesFold(steps=10, initial_train_size=50)
>>> folds_df = cv.split(y, as_pandas=True)
>>> print(folds_df.columns.tolist())
['fold', 'train_start', 'train_end', 'last_window_start', 'last_window_end', 'test_start', 'test_end', 'test_start_with_gap', 'test_end_with_gap', 'fit_forecaster']Skip folds for faster evaluation:
>>> cv = TimeSeriesFold(
... steps=5,
... initial_train_size=50,
... skip_folds=2
... )
>>> folds = cv.split(y)
>>> # Returns folds 0, 2, 4, 6, ...Note
Returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. For example, if the input series is X = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], the initial_train_size = 3, window_size = 2, steps = 4, and gap = 1, the output of the first fold will: [0, [0, 3], [1, 3], [3, 8], [4, 8], True].
The first element is the fold number, the first list [0, 3] indicates that the training set goes from the first to the third observation. The second list [1, 3] indicates that the last window seen by the forecaster during training goes from the second to the third observation. The third list [3, 8] indicates that the test set goes from the fourth to the eighth observation. The fourth list [4, 8] indicates that the test set including the gap goes from the fifth to the eighth observation. The boolean False indicates that the forecaster should not be trained in this fold.
Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.
As an example, with initial_train_size=50, steps=30, and fold_stride=7, the first test fold will cover observations [50, 80), the second fold [57, 87), and the third fold [64, 94). This configuration produces multiple forecasts for the same observations, which is often desirable in rolling-origin evaluation.
Methods
| Name | Description |
|---|---|
| split | Split the time series data into train and test folds. |
split
model_selection.split_ts_cv.TimeSeriesFold.split(X, as_pandas=False)Split the time series data into train and test folds.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| X | pd.Series | pd.DataFrame | pd.Index | dict[str, pd.Series | pd.DataFrame] | Time series data or index to split. Can be a pandas Series, DataFrame, Index, or a dictionary of Series/DataFrames. | required |
| as_pandas | bool | If True, the folds are returned as a DataFrame. This is useful to visualize the folds in a more interpretable way. Defaults to False. | False |
Returns
| Name | Type | Description |
|---|---|---|
| list | pd.DataFrame | A list of lists containing the indices (position) for each fold, or a | |
| list | pd.DataFrame | DataFrame if as_pandas=True. Each list contains 4 lists and a boolean |
|
| list | pd.DataFrame | with the following information: | |
| list | pd.DataFrame | - fold: fold number. | |
| list | pd.DataFrame | - [train_start, train_end]: list with the start and end positions of the training set. | |
| list | pd.DataFrame | - [last_window_start, last_window_end]: list with the start and end positions of the last window seen by the forecaster during training. The last window is used to generate the lags use as predictors. If differentiation is included, the interval is extended as many observations as the differentiation order. If the argument window_size is None, this list is empty. |
|
| list | pd.DataFrame | - [test_start, test_end]: list with the start and end positions of the test set. These are the observations used to evaluate the forecaster. | |
| list | pd.DataFrame | - [test_start_with_gap, test_end_with_gap]: list with the start and end positions of the test set including the gap. The gap is the number of observations between the end of the training set and the start of the test set. | |
| list | pd.DataFrame | - fit_forecaster: boolean indicating whether the forecaster should be fitted in this fold. |
Note
The returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc.
If as_pandas is True, the folds are returned as a DataFrame with the following columns: ‘fold’, ‘train_start’, ‘train_end’, ‘last_window_start’, ‘last_window_end’, ‘test_start’, ‘test_end’, ‘test_start_with_gap’, ‘test_end_with_gap’, ‘fit_forecaster’.
Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.