manager.features.select_top_poly_features

manager.features.select_top_poly_features(
    poly_features,
    y,
    max_poly_features=10,
    random_state=123,
    n_jobs=-1,
    mi_sample_size=4000,
)

Rank polynomial interaction columns by mutual information, keep the top K.

Polynomial expansion (create_interaction_features with degree >= 2) can emit hundreds or thousands of poly_* columns. This helper caps that set: it scores each candidate column by its mutual information with the target and returns the names of the max_poly_features highest-scoring columns. Mutual information is estimated with mutual_info_regression, seeded by random_state so the selection is reproducible.

The k-nearest-neighbour estimator behind mutual_info_regression is the dominant cost of the whole exogenous-feature pipeline on realistic inputs (thousands of candidate columns over years of hourly data). Two knobs keep it fast: the scoring runs in parallel across candidate columns (n_jobs), and long series are scored on a reproducible row subsample (mi_sample_size) instead of every observation.

Parameters

Name Type Description Default
poly_features pd.DataFrame DataFrame containing only the candidate poly_* interaction columns to rank. required
y pd.Series Target series. It is inner-joined to poly_features on the index and rows with missing values are dropped before scoring. required
max_poly_features int Maximum number of columns to keep. When this is <= 0 or the candidate count does not exceed it, all columns are returned unchanged. Defaults to 10. 10
random_state int Seed forwarded to mutual_info_regression (and to the row subsampling, see mi_sample_size) for a deterministic estimate. Defaults to 123. 123
n_jobs Optional[int] Number of parallel jobs forwarded to mutual_info_regression, which scores candidate columns independently. -1 (the default) uses all cores; None runs single-threaded. Parallelism does not change the scores, so the selected columns are identical for every n_jobs value. -1
mi_sample_size Optional[int] Maximum number of rows used for the mutual-information estimate. When the joined frame is longer, a uniform random subsample of this size (drawn without replacement, seeded by random_state) is scored instead — a large speed-up on multi-year hourly series. The subsampled estimate can rank borderline columns differently from a full-data estimate, so the kept set may differ; pass None to score every row (the pre-15.8 behaviour). Must be a positive integer or None. Defaults to 4000. 4000

Returns

Name Type Description
List[str] List[str]: Names of the selected poly_* columns, ordered from
List[str] highest to lowest mutual information. Returns every input column (in its
List[str] original order) when no capping is required.

Raises

Name Type Description
ValueError If poly_features and y share no overlapping, non-missing rows, or if mi_sample_size is neither None nor a positive integer.

Examples

import numpy as np
import pandas as pd
from spotforecast2_safe.manager.features import select_top_poly_features

rng = np.random.default_rng(0)
idx = pd.date_range("2024-01-01", periods=500, freq="h", tz="UTC")
signal = rng.normal(0, 1, 500)
y = pd.Series(signal, index=idx, name="target")
poly = pd.DataFrame(
    {
        "poly_a": signal + rng.normal(0, 0.01, 500),  # informative
        "poly_b": signal * 0.5 + rng.normal(0, 0.1, 500),
        "poly_c": rng.normal(0, 1, 500),  # noise
        "poly_d": rng.normal(0, 1, 500),  # noise
    },
    index=idx,
)
top = select_top_poly_features(poly, y, max_poly_features=2)
print("kept:", top)
assert top[0] == "poly_a"
assert len(top) == 2
kept: ['poly_a', 'poly_b']