n2n_predict_with_covariates: A Beginner’s Walkthrough
A step-by-step explanation of the N-to-N covariate forecasting pipeline for readers new to time-series analysis.
What this pipeline does, in one paragraph
The function n2n_predict_with_covariates takes a table of historical measurements (one or more numeric columns indexed by time), augments it with side information that is known both for the past and for the near future (calendar effects, daylight, weather forecasts, public holidays), and then trains one machine-learning forecaster per target column. Each forecaster is asked to predict the next forecast_horizon time steps. The function returns three things: a DataFrame of predictions, a dictionary of metadata that records how the run was configured, and the fitted forecaster objects themselves (so a caller can re-use them later without retraining). The function is deliberately one long pipeline rather than many small calls, because every stage depends tightly on the previous one — and a single entry point makes the whole pipeline easy to embed in a daily batch job.
Vocabulary
The walkthrough below uses a handful of technical terms. Their definitions are kept brief here so that the rest of the page can stay focused on the mechanics of the pipeline. Each definition is cross-referenceable — later sections link back with @def-… where it helps the reader.
Definition 1 (Time series) A sequence of numeric measurements indexed by timestamps at a regular interval (here: one row per hour).
Definition 2 (Target variable) The quantity we want to predict — in this pipeline, any column of the input DataFrame.
Definition 3 (Exogenous variable (covariate)) Any other feature known at the same timestamps as the target and available for the future window — for example, the hour of the day or a weather forecast.
Definition 4 (Lag) A past value of the same series, e.g. lag 1 = “the value one hour ago”, lag 24 = “the value at the same hour yesterday”.
Definition 5 (Forecast horizon) How many time steps ahead the model has to predict (e.g. forecast_horizon=24 means “predict the next 24 hours”).
Definition 6 (Recursive multi-step forecaster) A model that predicts one step ahead, then feeds its own prediction back in as the most recent lag, and repeats. We use it because we want long horizons (24+ steps) without training a separate model per step.
Definition 7 (Train / validation / test split) The historical data is sliced into three temporally ordered chunks: training (used to fit the model), validation (used to compare configurations), and test (held back as unseen data for the final check).
Definition 8 (Outlier) A value that is much larger or smaller than the rest of the series — often a sensor glitch or a one-off event we do not want the model to learn from.
Definition 9 (Imputation) Filling in missing entries so the model sees a complete series.
Definition 10 (Sample weight) A non-negative number attached to each training row that tells the model how much that row should contribute to the loss. Weight 0 effectively ignores the row.
Definition 11 (Cyclical encoding) Replacing a periodic integer like “hour of day” by its sine and cosine on a unit circle, so the model sees that hour 23 and hour 0 are neighbors.
Definition 12 (Persistence (serialization)) Saving a trained model to disk so the next run can load it instead of retraining.
Input validation
Before any computation begins, n2n_predict_with_covariates checks that the five numerical parameters make sense:
Parameter
Allowed range
Error if violated
forecast_horizon
> 0
ValueError: forecast_horizon must be positive, got …
contamination
0 ≤ x ≤ 0.5
ValueError: contamination must be between 0 and 0.5, got …
window_size
> 0
ValueError: window_size must be positive, got …
lags
> 0
ValueError: lags must be positive, got …
train_ratio
0 < x < 1
ValueError: train_ratio must be between 0 and 1, got …
The pipeline follows a fail-fast philosophy: rather than silently coercing a bad value (and producing wrong predictions hours later), it stops at the front door with an informative message. The most consequential check is the one on forecast_horizon: a non-positive value would cause downstream array-slicing to return empty results, which the model would happily train on without complaint.
Stage 1 — Loading and preparing the target series
Why this step exists. A forecasting model can only learn from a clean, regularly spaced time series with a known time zone. The first stage’s job is to produce exactly that, regardless of whether the caller passed a DataFrame or relied on the default demo data set.
When the caller supplies a data argument, the series is forwarded to fetch_data(dataframe=data, timezone=timezone). When no data is supplied, fetch_data reads the bundled demonstration file demo10.csv (located via get_package_data_home()). Either way, the result is a DataFrame with a timezone-aware DatetimeIndex and numeric columns for each target. The column names are immediately stored in target_columns, because they will determine the names of the saved model files and the columns of the prediction DataFrame.
Next, get_start_end(data, forecast_horizon) returns four boundary timestamps:
start, end — the historical window used for training,
cov_start, cov_end — the same window extended forward by forecast_horizon steps, so that covariates can be aligned with the prediction period.
basic_ts_checks(data) then verifies that the index is a DatetimeIndex, strictly monotonically increasing, and free of gaps. Finally, agg_and_resample_data(data) enforces an hourly grid: sub-hourly data is aggregated by mean over each hour bin, and the result is a uniformly spaced table that the lag-based forecaster can consume safely.
Why a strict, gap-free hourly index matters
A recursive forecaster predicts lag 1, lag 2, lag 24, etc. by positional lookup. If the index skips an hour, “lag 24” no longer means “24 hours ago” — it means “24 rows ago”, which might be 25 wall-clock hours. The fail-fast checks here prevent that silent error.
Stage 2 — Outlier detection and removal
Why this step exists. Real-world sensors occasionally produce values that are not signal but noise (a spike from a calibration error, a dip from a brief power loss). Letting the model train on these outliers (see Definition 8) contaminates its idea of “normal” behaviour. We replace them with NaN so the next stage can treat them the same way it treats genuine gaps.
mark_outliers(data, contamination=contamination, random_state=1234) applies an Isolation Forest: the algorithm builds random partitioning trees, and points that are isolated in only a few splits are flagged as anomalous. The contamination parameter is the expected fraction of anomalies per column — 0.01 means “about 1 % of rows should be marked”. The function returns a modified DataFrame (outlier positions set to NaN) and an outlier-label array (Isolation Forest’s convention: -1 for outliers, +1 for inliers). The label array is kept aside so the metadata dictionary at the end of the pipeline can report the number of detected outliers.
The fixed random_state=1234 ensures that two runs on identical input produce identical outlier flags — a core requirement of the reproducible artefact design.
Time-series outliers come in flavours: a one-off spike (instrumentation error), a short burst of “wrong” values (a sensor reboot), or a sustained shift (a real but rare event like a heat wave). The Isolation Forest defaults used here are tuned to catch the first kind. Treat the contamination parameter as a guess at how rare such spikes are — if the data is famously clean, set it low; if it is famously messy, set it higher.
Stage 3 — Imputation and sample weighting
Why this step exists. After stage 2 the series has NaNs wherever it originally had missing rows or where stage 2 just removed an outlier. Lag-based models cannot accept NaN in the input. We fill the gaps via imputation (see Definition 9), but we also remember which rows were imputed so the loss function can pay less attention to them by way of sample weights (see Definition 10).
get_missing_weights(data, window_size=window_size) does two things:
It fills the gaps using both forward-fill and backward-fill so no NaN survives.
It builds a weight series: a binary contamination mask (1 where the value was imputed, 0 otherwise) is propagated forward by window_size positions with a rolling maximum. Any training row whose lag window touches an imputed value receives weight 0. All other rows receive weight 1. The downstream forecaster will then see the filled-in values but down-weight their gradient contribution to zero.
The resulting weights_series is wrapped in a WeightFunction object, not passed as a plain array. This indirection solves three problems at once:
The training matrix produced by ForecasterRecursive._create_train_X_y is shorter than the original series by max(lags) rows. A positional array would misalign.
A callable that receives X_train.index does a label-based lookup into the Series and is immune to that offset regardless of lags.
WeightFunction is a plain class, so it is picklable and gets saved to disk together with the forecaster (a closure-based weight function would not be).
We could delete every row near a gap, but that breaks the “regular hourly grid” invariant the recursive forecaster needs. By keeping the rows but giving them weight 0, the time axis stays uniform and the loss function still ignores the bad data.
Stage 4 — Exogenous feature engineering
Why this step exists. A target series alone often does not contain enough signal to forecast accurately — but combining it with calendar and weather context typically does. Stage 4 produces four feature DataFrames, each indexed on the extended timeline [start, cov_end] so that the feature matrix is defined both over the training window and over the prediction window.
The four feature categories, in the order they will be concatenated in stage 5, are:
Calendar features — get_calendar_features(start, cov_end, freq="h", timezone=timezone) uses feature_engine.DatetimeFeatures to extract month, week, day_of_week, and hour from the index alone. Because these are derived from the index, they are always complete — no imputation needed.
Day/night features — get_day_night_features(start, cov_end, location, ...) computes sunrise and sunset times for the configured geographic location using the astral library. Sunrise and sunset hours are rounded to the nearest hour, and an is_daylight flag equals 1 during the hours between sunrise and sunset. Per-day sunrise/sunset values are cached so multi-year series do not recompute the solar position for every hourly row.
Weather features — get_weather_features(...) fetches weather data for the configured latitude/longitude and applies a sliding window transform that computes rolling mean, max, and min over one-day and seven-day windows for each numeric weather column. The helper itself resolves missing values (backward fill + curate step) before returning, so callers can treat the result as gap-free.
Holiday features — get_holiday_features(...) calls the holidays library for the configured country_code and state, validates the returned dates with an internal curate step, and reindexes the binary flag to the hourly grid with fill_value=0. The model sees a clean 0 / 1 column it can split on directly.
Example 3 (Daylight flag for a June day)
import pandas as pdimport numpy as npdates = pd.date_range("2020-06-01", periods=24, freq="h", tz="UTC")sunrise_hour =4# approximate June sunrise at 51°Nsunset_hour =21# approximate June sunset at 51°Nis_daylight = np.where( (dates.hour >= sunrise_hour) & (dates.hour < sunset_hour), 1, 0)df_sun = pd.DataFrame({"hour": dates.hour, "is_daylight": is_daylight}, index=dates)print(df_sun.groupby("is_daylight")["hour"].apply(list).to_string())
Stage 5 — Combining and encoding exogenous features
Why this step exists. The four DataFrames from stage 4 need to be joined into one matrix and then transformed so the model can learn from the periodic structure of calendar variables.
The pipeline concatenates the four feature DataFrames column-wise in this exact order:
It then runs a single hard check: if any NaN survived into exogenous_features, a ValueError is raised reporting the count of missing entries. This guard is the catch-all for any silent gap left by a feature helper.
Two transformations follow:
Cyclical encoding — apply_cyclical_encoding(data, drop_original=False) replaces each periodic integer column (month, week, day_of_week, hour, sunrise_hour, sunset_hour) by its sine and cosine on a unit circle scaled to the natural period of the column. Because drop_original=False, the original integer columns are kept alongside the new sine/cosine columns — the model can use either representation depending on which is more informative.
Interaction terms — create_interaction_features(exogenous_features, weather_aligned) uses sklearn.preprocessing.PolynomialFeatures with interaction_only=True to produce pairwise products of the calendar cyclical columns with weather-window columns, raw weather columns, and (if present) the holiday column. These columns are prefixed with poly_ and capture joint effects such as “cold morning and weekday peak” that neither feature can describe on its own.
Why sine and cosine instead of just the hour number?
If you feed an integer hour 0–23 directly to a tree-based model, hour 23 and hour 0 look very far apart numerically — even though they are neighbours on a clock. Sine and cosine wrap the integer onto a unit circle, so adjacent hours stay adjacent in the encoded space and the model does not learn a spurious discontinuity at midnight.
Stage 6 — Feature selection
Why this step exists. The combined matrix from stage 5 contains many columns the caller may not want. Stage 6 builds the final list of column names that will be passed as exog to the forecaster.
select_exogenous_features(exogenous_features, weather_aligned, include_weather_windows, include_holiday_features, include_poly_features) applies these rules:
Always included: the cyclical sine/cosine columns (matched by the regular expression _sin$|_cos$) and the raw weather columns (which are exactly the columns of weather_aligned).
Included only if include_weather_windows=True: the rolling-window weather features.
Included only if include_holiday_features=True: the binary is_holiday column.
Included only if include_poly_features=True: the poly_-prefixed interaction features.
The final list is deduplicated with the order-preserving idiom list(dict.fromkeys(exog_features)) so that no column name appears twice (a duplicate column would crash the forecaster’s lag-matrix construction).
Stage 7 — Merging target and exogenous data
Why this step exists. The forecaster expects one DataFrame with the target column and a separate DataFrame of exogenous features, both indexed on the same training window — and a separate exogenous slice covering the future prediction window.
data_with_exog — an inner join of the target columns and the selected exogenous columns over [start, end]. The inner join is the belt-and-braces guarantee that only rows where both sides are defined survive; any residual misalignment is dropped here.
exo_tmp — the slice of the exogenous feature matrix that covers [start, end]. Useful for inspection.
exo_pred — the slice that covers (end, cov_end], the future window used at prediction time.
Every column of data_with_exog is cast to float32 to halve memory consumption during lag-matrix construction without measurable loss in predictive accuracy for this class of model.
Stage 8 — Train / validation / test split
Why this step exists. The model must be evaluated on data it has never seen. Random shuffling is dangerous in time series — it lets the model “learn from the future”. The split here is purely temporal — see Definition 7.
split_rel_train_val_test(data_with_exog, perc_train=train_ratio, perc_val=1.0 - train_ratio) slices the rows in order: the first train_ratio fraction goes to training, the next half of the remainder to validation, the rest to test. With the default train_ratio=0.8, that is roughly 80 % training and 10 % each for validation and test.
The boundary timestamp end_validation — the last index of the combined train + validation segment — is the cutoff used for forecaster.fit(...), so the test set stays genuinely unseen during training.
Total rows : 100
Train rows : 80 (80%)
Val rows : 9
Test rows : 11
Train ends : 2020-01-04 07:00:00
Why temporal, not random?
If you shuffled rows before splitting, the model might be trained on hour 14 of 1 March 2024 and tested on hour 13 of the same day — i.e. it would know the “answer” sits one step before the “question”. A temporal split forces the model to forecast a real future from a real past.
Stage 9 — Training or loading recursive forecasters
Why this step exists. Retraining a forecaster takes time. For repeated calls with the same configuration (a typical operational pattern), we want to load previously trained models from disk instead.
The decision is governed by two parameters:
force_train (default True): if True, all forecasters are retrained from scratch and any cached models are overwritten. Set force_train=False to skip training when cached models exist.
model_dir: directory where the joblib .pkl files live. If None (the default), the pipeline uses get_cache_home() / "forecasters" — i.e. ~/spotforecast2_cache/forecasters unless overridden by the SPOTFORECAST2_CACHE environment variable.
When force_train=Falseand the model directory exists, load_forecasters(target_columns, model_dir) attempts to deserialise one ForecasterRecursive object per target column. Targets whose model file is missing are collected into targets_to_train and trained below; the others are reused.
When training is required, a ForecasterRecursive is constructed for each target in targets_to_train (see Definition 6 for the underlying idea):
forecaster = ForecasterRecursive( estimator=estimator, # default: LGBMRegressor(random_state=1234, verbose=-1) lags=lags, # how many past values to use as features window_features=RollingFeatures( stats=["mean"], window_sizes=window_size, ), # adds a rolling-mean feature alongside the lags weight_func=weight_func, # the WeightFunction from stage 3)forecaster.fit( y=data_with_exog[target].loc[:end_validation].squeeze(), exog=data_with_exog[exog_features].loc[:end_validation],)
Note that the fit uses :end_validation, not the end of the full dataset — the test portion is held out so a downstream evaluation step can score the model on truly unseen data.
After training, save_forecasters(forecasters, model_dir) serialises each fitted object with joblib, one file per target column. Because WeightFunction is a regular class (not a closure), it survives the pickle/unpickle round trip together with the rest of the model.
A show_progress=True flag wraps the per-target loop in a tqdm progress bar — useful for runs with many target columns.
What a recursive forecaster does, one step at a time
At step 1, the forecaster sees the most recent lags values of the target plus the exogenous features for step 1 and predicts the value for step 1. At step 2, it shifts the lag window by one — using the just-predicted step-1 value as the most recent lag — and predicts step 2. It keeps doing this for forecast_horizon steps. That is why the exogenous matrix for the future window must be fully known before prediction starts: the forecaster cannot ask “what will the temperature be at step 5?” mid-way through.
Stage 10 — Prediction
Why this step exists. This is the payoff: turn the trained forecasters and the future exogenous matrix into actual forecasts.
predict_multivariate(recursive_forecasters, steps_ahead=forecast_horizon,exog=exo_pred[exog_features],show_progress=show_progress) iterates over the trained forecasters dictionary and calls .predict(steps=horizon,exog=...) on each. The exogenous matrix passed to every forecaster is the same slice exo_pred[exog_features] covering exactly forecast_horizon steps after end_validation. The result is assembled into a predictions DataFrame with one column per target and forecast_horizon rows.
Predictions shape : (24, 11)
First timestamp : 2023-01-02 00:00:00
Last timestamp : 2023-01-02 23:00:00
col_1
col_2
col_3
col_4
col_5
col_6
col_7
col_8
col_9
col_10
col_11
2023-01-02 00:00:00
0.3047
-1.0400
0.7505
0.9406
-1.9510
-1.3022
0.1278
-0.3162
-0.0168
-0.8530
0.8794
2023-01-02 01:00:00
0.7778
0.0660
1.1272
0.4675
-0.8593
0.3688
-0.9589
0.8785
-0.0499
-0.1849
-0.6809
2023-01-02 02:00:00
1.2225
-0.1545
-0.4283
-0.3521
0.5323
0.3654
0.4127
0.4308
2.1416
-0.4064
-0.5122
Metadata and return values
The function returns a three-element tuple. The first element is the predictions DataFrame described above. The second element is a metadata dictionary that records every parameter and intermediate shape of the run:
forecast_horizon
target_columns
exog_features
n_exog_features
train_size, val_size, test_size
data_shape_original, data_shape_merged
training_end
prediction_start, prediction_end
lags, window_size, contamination
n_outliers (computed as outliers.sum() for the Series case, or len(outliers) otherwise)
This dictionary is a self-contained audit record — it lets a future reader reconstruct the run’s configuration without rerunning the pipeline.
The third element is the recursive_forecasters dictionary keyed by target column name. Returning the fitted objects lets the caller inspect internal state (feature names, lag matrices, fitted estimator parameters) or call .predict again with a different horizon, without retraining.
End-to-end example
The cell below runs the entire pipeline with very small parameters (forecast_horizon=2, lags=4, window_size=8), using tempfile.mkdtemp() as the model directory so the run does not pollute the user’s persistent cache.
The same call with force_train=False and the same model_dir would load the just-trained forecasters from disk instead of retraining — the core caching mechanism behind the pipeline’s speedup on repeated runs.
The complete execution flow
A single call to n2n_predict_with_covariates follows this invariant sequence:
Input validation runs first.
The target data is loaded and cleaned (Stage 1).
Outlier positions are replaced with NaN (Stage 2).
Missing values are imputed and sample weights are computed (Stage 3).
Four categories of exogenous features are constructed over the full extended time window (Stage 4).
The features are concatenated, validated for completeness, cyclically encoded, and augmented with interaction terms (Stage 5).
A feature-selection step reduces the matrix to the columns relevant to the configured pipeline variant (Stage 6).
The selected features are merged with the target data over the historical window (Stage 7).
The merged data is split temporally into train, validation, and test segments (Stage 8).
Forecasters are either loaded from disk or trained from scratch and then persisted (Stage 9).
Predictions are generated for the forecast_horizon steps beyond the training end (Stage 10).
Every public parameter is validated at entry, every exogenous feature matrix is checked for completeness before it enters the model, and every fitted object is serialised in a form that survives process boundaries. These invariants make the pipeline safe to embed in automated batch jobs where a silent failure would not be discovered until long after the prediction window has closed.