preprocessing.outlier.manual_outlier_removal

preprocessing.outlier.manual_outlier_removal(
    data,
    column,
    lower_threshold=None,
    upper_threshold=None,
    verbose=False,
)

Manual outlier removal function.

Parameters

Name Type Description Default
data pd.DataFrame The input dataset. required
column str The column name in which to perform manual outlier removal. required
lower_threshold float | None The lower threshold below which values are considered outliers. If None, no lower threshold is applied. None
upper_threshold float | None The upper threshold above which values are considered outliers. If None, no upper threshold is applied. None
verbose bool Whether to print additional information. False

Returns

Name Type Description
tuple[pd.DataFrame, int] tuple[pd.DataFrame, int]: A tuple containing the modified dataset with outliers marked as NaN and the number of outliers marked.

Examples

import numpy as np
import pandas as pd

from spotforecast2_safe.preprocessing.outlier import manual_outlier_removal

rng = np.random.default_rng(0)
# 20 normal values with two injected boundary violations
values = np.concatenate([rng.uniform(low=100.0, high=600.0, size=20), [10.0, 800.0]])
data = pd.DataFrame({"ABC": values})

cleaned_data, n_outliers = manual_outlier_removal(
    data,
    column="ABC",
    lower_threshold=50,
    upper_threshold=700,
    verbose=True,
)
print(f"Outliers removed: {n_outliers}")
assert n_outliers >= 2, "Expected the two injected boundary violations to be removed"
assert cleaned_data["ABC"].isna().sum() == n_outliers
Manually marked 2 values > 700 or < 50 as outliers in ABC.
Outliers removed: 2