preprocessing.outlier.manual_outlier_removal

preprocessing.outlier.manual_outlier_removal(
    data,
    column,
    lower_threshold=None,
    upper_threshold=None,
    verbose=False,
)

Manual outlier removal function.

Parameters

Name	Type	Description	Default
data	pd.DataFrame	The input dataset.	required
column	str	The column name in which to perform manual outlier removal.	required
lower_threshold	float \| None	The lower threshold below which values are considered outliers. If None, no lower threshold is applied.	`None`
upper_threshold	float \| None	The upper threshold above which values are considered outliers. If None, no upper threshold is applied.	`None`
verbose	bool	Whether to print additional information.	`False`

Returns

Name	Type	Description
	tuple[pd.DataFrame, int]	tuple[pd.DataFrame, int]: A tuple containing the modified dataset with outliers marked as NaN and the number of outliers marked.

Examples

import numpy as np
import pandas as pd

from spotforecast2_safe.preprocessing.outlier import manual_outlier_removal

rng = np.random.default_rng(0)
# 20 normal values with two injected boundary violations
values = np.concatenate([rng.uniform(low=100.0, high=600.0, size=20), [10.0, 800.0]])
data = pd.DataFrame({"ABC": values})

cleaned_data, n_outliers = manual_outlier_removal(
    data,
    column="ABC",
    lower_threshold=50,
    upper_threshold=700,
    verbose=True,
)
print(f"Outliers removed: {n_outliers}")
assert n_outliers >= 2, "Expected the two injected boundary violations to be removed"
assert cleaned_data["ABC"].isna().sum() == n_outliers

Manually marked 2 values > 700 or < 50 as outliers in ABC.
Outliers removed: 2