manager.features.select_top_poly_features

manager.features.select_top_poly_features(
    poly_features,
    y,
    max_poly_features=10,
    random_state=123,
    n_jobs=-1,
    mi_sample_size=4000,
)

Rank polynomial interaction columns by mutual information, keep the top K.

Polynomial expansion (create_interaction_features with degree >= 2) can emit hundreds or thousands of poly_* columns. This helper caps that set: it scores each candidate column by its mutual information with the target and returns the names of the max_poly_features highest-scoring columns. Mutual information is estimated with mutual_info_regression, seeded by random_state so the selection is reproducible.

The k-nearest-neighbour estimator behind mutual_info_regression is the dominant cost of the whole exogenous-feature pipeline on realistic inputs (thousands of candidate columns over years of hourly data). Two knobs keep it fast: the scoring runs in parallel across candidate columns (n_jobs), and long series are scored on a reproducible row subsample (mi_sample_size) instead of every observation.

Parameters

Name	Type	Description	Default
poly_features	pd.DataFrame	DataFrame containing only the candidate `poly_*` interaction columns to rank.	required
y	pd.Series	Target series. It is inner-joined to poly_features on the index and rows with missing values are dropped before scoring.	required
max_poly_features	int	Maximum number of columns to keep. When this is `<= 0` or the candidate count does not exceed it, all columns are returned unchanged. Defaults to `10`.	`10`
random_state	int	Seed forwarded to `mutual_info_regression` (and to the row subsampling, see mi_sample_size) for a deterministic estimate. Defaults to `123`.	`123`
n_jobs	Optional[int]	Number of parallel jobs forwarded to `mutual_info_regression`, which scores candidate columns independently. `-1` (the default) uses all cores; `None` runs single-threaded. Parallelism does not change the scores, so the selected columns are identical for every n_jobs value.	`-1`
mi_sample_size	Optional[int]	Maximum number of rows used for the mutual-information estimate. When the joined frame is longer, a uniform random subsample of this size (drawn without replacement, seeded by random_state) is scored instead — a large speed-up on multi-year hourly series. The subsampled estimate can rank borderline columns differently from a full-data estimate, so the kept set may differ; pass `None` to score every row (the pre-15.8 behaviour). Must be a positive integer or `None`. Defaults to `4000`.	`4000`

Returns

Name	Type	Description
	List[str]	List[str]: Names of the selected `poly_*` columns, ordered from
	List[str]	highest to lowest mutual information. Returns every input column (in its
	List[str]	original order) when no capping is required.

Raises

Name	Type	Description
	ValueError	If poly_features and y share no overlapping, non-missing rows, or if mi_sample_size is neither `None` nor a positive integer.

Examples

import numpy as np
import pandas as pd
from spotforecast2_safe.manager.features import select_top_poly_features

rng = np.random.default_rng(0)
idx = pd.date_range("2024-01-01", periods=500, freq="h", tz="UTC")
signal = rng.normal(0, 1, 500)
y = pd.Series(signal, index=idx, name="target")
poly = pd.DataFrame(
    {
        "poly_a": signal + rng.normal(0, 0.01, 500),  # informative
        "poly_b": signal * 0.5 + rng.normal(0, 0.1, 500),
        "poly_c": rng.normal(0, 1, 500),  # noise
        "poly_d": rng.normal(0, 1, 500),  # noise
    },
    index=idx,
)
top = select_top_poly_features(poly, y, max_poly_features=2)
print("kept:", top)
assert top[0] == "poly_a"
assert len(top) == 2

kept: ['poly_a', 'poly_b']