splitter.split_ts_cv

splitter.split_ts_cv

Time series cross-validation splitting.

Classes

Name	Description
TimeSeriesFold	Class to split time series data into train and test folds.

TimeSeriesFold

splitter.split_ts_cv.TimeSeriesFold(
    steps,
    initial_train_size=None,
    fold_stride=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Class to split time series data into train and test folds.

When used within a backtesting or hyperparameter search, the arguments ‘initial_train_size’, ‘window_size’ and ‘differentiation’ are not required as they are automatically set by the backtesting or hyperparameter search functions.

Parameters

Name	Type	Description	Default
steps	int	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.	required
initial_train_size	int \| str \| pd.Timestamp \| None	Number of observations used for initial training. - If `None` or 0, the initial forecaster is not trained in the first fold. - If an integer, the number of observations used for initial training. - If a date string or pandas Timestamp, it is the last date included in the initial training set. Defaults to None.	`None`
fold_stride	int \| None	Number of observations that the start of the test set advances between consecutive folds. - If `None`, it defaults to the same value as `steps`, meaning that folds are placed back-to-back without overlap. - If `fold_stride < steps`, test sets overlap and multiple forecasts will be generated for the same observations. - If `fold_stride > steps`, gaps are left between consecutive test sets. Defaults to None.	`None`
window_size	int \| None	Number of observations needed to generate the autoregressive predictors. Defaults to None.	`None`
differentiation	int \| None	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order. Defaults to None.	`None`
refit	bool \| int	Whether to refit the forecaster in each fold. - If `True`, the forecaster is refitted in each fold. - If `False`, the forecaster is trained only in the first fold. - If an integer, the forecaster is trained in the first fold and then refitted every `refit` folds. Defaults to False.	`False`
fixed_train_size	bool	Whether the training size is fixed or increases in each fold. Defaults to True.	`True`
gap	int	Number of observations between the end of the training set and the start of the test set. Defaults to 0.	`0`
skip_folds	int \| list[int] \| None	Number of folds to skip. - If an integer, every ‘skip_folds’-th is returned. - If a list, the indexes of the folds to skip. For example, if `skip_folds=3` and there are 10 folds, the returned folds are 0, 3, 6, and 9. If `skip_folds=[1, 2, 3]`, the returned folds are 0, 4, 5, 6, 7, 8, and 9. Defaults to None.	`None`
allow_incomplete_fold	bool	Whether to allow the last fold to include fewer observations than `steps`. If `False`, the last fold is excluded if it is incomplete. Defaults to True.	`True`
return_all_indexes	bool	Whether to return all indexes or only the start and end indexes of each fold. Defaults to False.	`False`
verbose	bool	Whether to print information about generated folds. Defaults to True.	`True`

Attributes

Name	Type	Description
steps		Number of observations used to be predicted in each fold.
initial_train_size		Number of observations used for initial training. If `None` or 0, the initial forecaster is not trained in the first fold.
fold_stride		Number of observations that the start of the test set advances between consecutive folds.
overlapping_folds		Whether the folds overlap.
window_size		Number of observations needed to generate the autoregressive predictors.
differentiation		Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.
refit		Whether to refit the forecaster in each fold.
fixed_train_size		Whether the training size is fixed or increases in each fold.
gap		Number of observations between the end of the training set and the start of the test set.
skip_folds		Number of folds to skip.
allow_incomplete_fold		Whether to allow the last fold to include fewer observations than `steps`.
return_all_indexes		Whether to return all indexes or only the start and end indexes of each fold.
verbose		Whether to print information about generated folds.

Examples

import warnings

import numpy as np
import pandas as pd

from spotforecast2_safe.exceptions import IgnoredArgumentWarning
from spotforecast2_safe.splitter import TimeSeriesFold

dates = pd.date_range("2020-01-01", periods=100, freq="D")
y = pd.Series(np.arange(100), index=dates)

with warnings.catch_warnings():
    warnings.simplefilter("ignore", IgnoredArgumentWarning)
    cv = TimeSeriesFold(
        steps=10,
        initial_train_size=50,
        refit=True,
        fixed_train_size=True,
        verbose=False,
    )
    folds = cv.split(y)
print(f"Number of folds: {len(folds)}")
assert len(folds) == 5

Number of folds: 5

import warnings

import numpy as np
import pandas as pd

from spotforecast2_safe.exceptions import IgnoredArgumentWarning
from spotforecast2_safe.splitter import TimeSeriesFold

dates = pd.date_range("2020-01-01", periods=100, freq="D")
y = pd.Series(np.arange(100), index=dates)

with warnings.catch_warnings():
    warnings.simplefilter("ignore", IgnoredArgumentWarning)
    cv = TimeSeriesFold(
        steps=30,
        initial_train_size=50,
        fold_stride=7,
        fixed_train_size=False,
        verbose=False,
    )
    folds = cv.split(y)
# First test fold covers [50, 80), second [57, 87), etc.
assert len(folds) == 8

import warnings

import numpy as np
import pandas as pd

from spotforecast2_safe.exceptions import IgnoredArgumentWarning
from spotforecast2_safe.splitter import TimeSeriesFold

dates = pd.date_range("2020-01-01", periods=100, freq="D")
y = pd.Series(np.arange(100), index=dates)

with warnings.catch_warnings():
    warnings.simplefilter("ignore", IgnoredArgumentWarning)
    cv = TimeSeriesFold(steps=10, initial_train_size=50, verbose=False)
    folds_df = cv.split(y, as_pandas=True)
print(folds_df.columns.tolist())
expected_cols = [
    "fold", "train_start", "train_end",
    "last_window_start", "last_window_end",
    "test_start", "test_end",
    "test_start_with_gap", "test_end_with_gap",
    "fit_forecaster",
]
assert folds_df.columns.tolist() == expected_cols

['fold', 'train_start', 'train_end', 'last_window_start', 'last_window_end', 'test_start', 'test_end', 'test_start_with_gap', 'test_end_with_gap', 'fit_forecaster']

import warnings

import numpy as np
import pandas as pd

from spotforecast2_safe.exceptions import IgnoredArgumentWarning
from spotforecast2_safe.splitter import TimeSeriesFold

dates = pd.date_range("2020-01-01", periods=100, freq="D")
y = pd.Series(np.arange(100), index=dates)

with warnings.catch_warnings():
    warnings.simplefilter("ignore", IgnoredArgumentWarning)
    cv = TimeSeriesFold(
        steps=5,
        initial_train_size=50,
        skip_folds=2,
        verbose=False,
    )
    folds = cv.split(y)
# Returns every second fold: 0, 2, 4, ...
assert len(folds) == 5

Note

Returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. For example, if the input series is X = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], the initial_train_size = 3, window_size = 2, steps = 4, and gap = 1, the output of the first fold will: [0, [0, 3], [1, 3], [3, 8], [4, 8], True].

The first element is the fold number, the first list [0, 3] indicates that the training set goes from the first to the third observation. The second list [1, 3] indicates that the last window seen by the forecaster during training goes from the second to the third observation. The third list [3, 8] indicates that the test set goes from the fourth to the eighth observation. The fourth list [4, 8] indicates that the test set including the gap goes from the fifth to the eighth observation. The boolean False indicates that the forecaster should not be trained in this fold.

Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

As an example, with initial_train_size=50, steps=30, and fold_stride=7, the first test fold will cover observations [50, 80), the second fold [57, 87), and the third fold [64, 94). This configuration produces multiple forecasts for the same observations, which is often desirable in rolling-origin evaluation.

Methods

Name	Description
n_folds	Return the number of folds this splitter produces for `X`.
set_params	Set the parameters of the Fold object. Before overwriting the current
split	Split the time series data into train and test folds.

n_folds

splitter.split_ts_cv.TimeSeriesFold.n_folds(X)

Return the number of folds this splitter produces for X.

Convenience wrapper around len(self.split(X)). Use it to report the true fold count, which is distinct from len(metrics_df) returned by backtesting_forecaster (that frame holds a single aggregated row).

Parameters

Name	Type	Description	Default
X	pd.Series \| pd.DataFrame \| pd.Index \| dict	The series, frame, index, or dict of series the backtest runs on.	required

Returns

Name	Type	Description
	int	The number of folds.

Examples

import pandas as pd
from spotforecast2_safe.splitter import TimeSeriesFold

y = pd.Series(
    range(120), index=pd.date_range("2025-01-01", periods=120, freq="h")
)
cv = TimeSeriesFold(steps=24, initial_train_size=48, verbose=False)
print(cv.n_folds(y))

╭─────────────────────────────── IgnoredArgumentWarning ───────────────────────────────╮
│ Last window cannot be calculated because `window_size` is None.                      │
│                                                                                      │
│ Category : spotforecast2.exceptions.IgnoredArgumentWarning                           │
│ Location :                                                                           │
│ /home/runner/work/spotforecast2-safe/spotforecast2-safe/src/spotforecast2_safe/split │
│ ter/split_ts_cv.py:496                                                               │
│ Suppress : warnings.simplefilter('ignore', category=IgnoredArgumentWarning)          │
╰──────────────────────────────────────────────────────────────────────────────────────╯

set_params

splitter.split_ts_cv.TimeSeriesFold.set_params(params)

Set the parameters of the Fold object. Before overwriting the current parameters, the input parameters are validated to ensure correctness.

Parameters

Name	Type	Description	Default
params	dict	Dictionary with the parameters to set.	required

Examples

from spotforecast2_safe.splitter import TimeSeriesFold

cv = TimeSeriesFold(steps=1)
cv.set_params({
    "steps": 2,
    "initial_train_size": 10,
    "fold_stride": 2,
    "window_size": 5,
    "differentiation": 1,
    "refit": True,
    "fixed_train_size": False,
    "gap": 1,
    "skip_folds": 2,
    "allow_incomplete_fold": False,
    "return_all_indexes": True,
    "verbose": False,
})
assert cv.initial_train_size == 10
assert cv.window_size == 5

split

splitter.split_ts_cv.TimeSeriesFold.split(X, as_pandas=False)

Split the time series data into train and test folds.

Parameters

Name	Type	Description	Default
X	pd.Series \| pd.DataFrame \| pd.Index \| dict[str, pd.Series \| pd.DataFrame]	Time series data or index to split. Can be a pandas Series, DataFrame, Index, or a dictionary of Series/DataFrames.	required
as_pandas	bool	If True, the folds are returned as a DataFrame. This is useful to visualize the folds in a more interpretable way. Defaults to False.	`False`

Returns

Name	Type	Description
	list \| pd.DataFrame	A list of lists containing the indices (position) for each fold, or a
	list \| pd.DataFrame	DataFrame if `as_pandas=True`. Each list contains 4 lists and a boolean
	list \| pd.DataFrame	with the following information:
	list \| pd.DataFrame	- fold: fold number.
	list \| pd.DataFrame	- [train_start, train_end]: list with the start and end positions of the training set.
	list \| pd.DataFrame	- [last_window_start, last_window_end]: list with the start and end positions of the last window seen by the forecaster during training. The last window is used to generate the lags use as predictors. If `differentiation` is included, the interval is extended as many observations as the differentiation order. If the argument `window_size` is `None`, this list is empty.
	list \| pd.DataFrame	- [test_start, test_end]: list with the start and end positions of the test set. These are the observations used to evaluate the forecaster.
	list \| pd.DataFrame	- [test_start_with_gap, test_end_with_gap]: list with the start and end positions of the test set including the gap. The gap is the number of observations between the end of the training set and the start of the test set.
	list \| pd.DataFrame	- fit_forecaster: boolean indicating whether the forecaster should be fitted in this fold.

Note

The returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc.

If as_pandas is True, the folds are returned as a DataFrame with the following columns: ‘fold’, ‘train_start’, ‘train_end’, ‘last_window_start’, ‘last_window_end’, ‘test_start’, ‘test_end’, ‘test_start_with_gap’, ‘test_end_with_gap’, ‘fit_forecaster’.

Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

Examples

import pandas as pd
from spotforecast2_safe.splitter import TimeSeriesFold

y = pd.Series(
    range(60), index=pd.date_range("2024-01-01", periods=60, freq="h")
)
cv = TimeSeriesFold(steps=5, initial_train_size=40, verbose=False)

# Default: returns a list of fold descriptors
folds = cv.split(y)
print(f"Number of folds: {len(folds)}")
# Each fold is [fold_idx, [train_start, train_end],
#               [last_window_start, last_window_end],
#               [test_start, test_end],
#               [test_start_with_gap, test_end_with_gap],
#               fit_forecaster]
first = folds[0]
assert first[0] == 0           # fold index
assert first[1] == [0, 40]     # training slice
assert first[-1] is True       # first fold always fits
assert len(folds) == 4

Number of folds: 4

import pandas as pd
from spotforecast2_safe.splitter import TimeSeriesFold

y = pd.Series(
    range(60), index=pd.date_range("2024-01-01", periods=60, freq="h")
)
cv = TimeSeriesFold(steps=5, initial_train_size=40, verbose=False)

# as_pandas=True returns a DataFrame with named columns
df = cv.split(y, as_pandas=True)
print(df[["fold", "train_start", "train_end", "test_start", "test_end"]])
assert list(df.columns[:3]) == ["fold", "train_start", "train_end"]
assert df.shape[0] == 4

   fold  train_start  train_end  test_start  test_end
0     0            0         40          40        45
1     1            0         40          45        50
2     2            0         40          50        55
3     3            0         40          55        60