preprocessing.outlier.mark_outliers

preprocessing.outlier.mark_outliers(
    data,
    contamination=0.1,
    random_state=1234,
    verbose=False,
)

Marks outliers as NaN in the dataset using Isolation Forest.

Parameters

Name Type Description Default
data pd.DataFrame The input dataset. required
contamination float The (estimated) proportion of outliers in the dataset. 0.1
random_state int Random seed for reproducibility. Default is 1234. 1234
verbose bool Whether to print additional information. False

Returns

Name Type Description
tuple[pd.DataFrame, np.ndarray] tuple[pd.DataFrame, np.ndarray]: A tuple containing the modified dataset with outliers marked as NaN and the outlier labels.

Examples

import numpy as np
import pandas as pd

from spotforecast2_safe.preprocessing.outlier import mark_outliers

rng = np.random.default_rng(0)
# 50 normal values plus two clear outliers (1000, -1000)
values = np.concatenate([rng.normal(loc=10.0, scale=1.0, size=50), [1000.0, -1000.0]])
data = pd.DataFrame({"load": values})

cleaned_data, outlier_labels = mark_outliers(
    data, contamination=0.05, random_state=42, verbose=True
)
n_nan = cleaned_data["load"].isna().sum()
print(f"Outliers marked as NaN: {n_nan}")
assert n_nan >= 2, "Expected at least the two injected extreme outliers to be marked"
Column 'load': Marked 5.7692% of data points as outliers.
Outliers marked as NaN: 3