Class to split time series data into train and test folds.
When used within a backtesting or hyperparameter search, the arguments ‘initial_train_size’, ‘window_size’ and ‘differentiation’ are not required as they are automatically set by the backtesting or hyperparameter search functions.
Number of observations used for initial training. - If None or 0, the initial forecaster is not trained in the first fold. - If an integer, the number of observations used for initial training. - If a date string or pandas Timestamp, it is the last date included in the initial training set. Defaults to None.
Number of observations that the start of the test set advances between consecutive folds. - If None, it defaults to the same value as steps, meaning that folds are placed back-to-back without overlap. - If fold_stride < steps, test sets overlap and multiple forecasts will be generated for the same observations. - If fold_stride > steps, gaps are left between consecutive test sets. Defaults to None.
Number of observations to use for differentiation. This is used to extend the last_window as many observations as the differentiation order. Defaults to None.
Whether to refit the forecaster in each fold. - If True, the forecaster is refitted in each fold. - If False, the forecaster is trained only in the first fold. - If an integer, the forecaster is trained in the first fold and then refitted every refit folds. Defaults to False.
Number of folds to skip. - If an integer, every ‘skip_folds’-th is returned. - If a list, the indexes of the folds to skip. For example, if skip_folds=3 and there are 10 folds, the returned folds are 0, 3, 6, and 9. If skip_folds=[1, 2, 3], the returned folds are 0, 4, 5, 6, 7, 8, and 9. Defaults to None.
Returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. For example, if the input series is X = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], the initial_train_size = 3, window_size = 2, steps = 4, and gap = 1, the output of the first fold will: [0, [0, 3], [1, 3], [3, 8], [4, 8], True].
The first element is the fold number, the first list [0, 3] indicates that the training set goes from the first to the third observation. The second list [1, 3] indicates that the last window seen by the forecaster during training goes from the second to the third observation. The third list [3, 8] indicates that the test set goes from the fourth to the eighth observation. The fourth list [4, 8] indicates that the test set including the gap goes from the fifth to the eighth observation. The boolean False indicates that the forecaster should not be trained in this fold.
Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.
As an example, with initial_train_size=50, steps=30, and fold_stride=7, the first test fold will cover observations [50, 80), the second fold [57, 87), and the third fold [64, 94). This configuration produces multiple forecasts for the same observations, which is often desirable in rolling-origin evaluation.
- [last_window_start, last_window_end]: list with the start and end positions of the last window seen by the forecaster during training. The last window is used to generate the lags use as predictors. If differentiation is included, the interval is extended as many observations as the differentiation order. If the argument window_size is None, this list is empty.
- [test_start_with_gap, test_end_with_gap]: list with the start and end positions of the test set including the gap. The gap is the number of observations between the end of the training set and the start of the test set.
- fit_forecaster: boolean indicating whether the forecaster should be fitted in this fold.
Note
The returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc.
If as_pandas is True, the folds are returned as a DataFrame with the following columns: ‘fold’, ‘train_start’, ‘train_end’, ‘last_window_start’, ‘last_window_end’, ‘test_start’, ‘test_end’, ‘test_start_with_gap’, ‘test_end_with_gap’, ‘fit_forecaster’.
Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.