preprocessing.outlier.get_outliers(
data,
data_original= None ,
contamination= 0.01 ,
random_state= 1234 ,
)
Detect outliers in each column using Isolation Forest.
This function uses scikit-learn’s IsolationForest algorithm to detect outliers in each column of the input DataFrame. The original data (before any NaN values were introduced) can be provided to identify which values were marked as NaN due to outlier detection.
Parameters
data
pd .DataFrame
The input DataFrame to check for outliers.
required
data_original
Optional [pd .DataFrame ]
Optional original DataFrame before outlier marking. If provided, helps identify which values became NaN due to outlier detection. Default: None.
None
contamination
float
The estimated proportion of outliers in the dataset. Default: 0.01.
0.01
random_state
int
Random seed for reproducibility. Default: 1234.
1234
Returns
Dict [str , pd .Series ]
A dictionary mapping column names to Series of outlier values.
Dict [str , pd .Series ]
For columns without outliers, an empty Series is returned.
Raises
ValueError
If data is empty or contains no columns.
Examples
import numpy as np
import pandas as pd
from spotforecast2_safe.preprocessing.outlier import get_outliers
rng = np.random.default_rng(0 )
data = pd.DataFrame({
"A" : np.concatenate([rng.normal(loc= 0.0 , scale= 1.0 , size= 100 ), [10.0 , 11.0 , 12.0 ]]),
"B" : np.concatenate([rng.normal(loc= 5.0 , scale= 2.0 , size= 100 ), [100.0 , 110.0 , 120.0 ]]),
})
data_original = data.copy()
outliers = get_outliers(data_original, contamination= 0.03 )
for col, outlier_vals in outliers.items():
print (f" { col} : { len (outlier_vals)} outliers detected" )
assert len (outliers) == 2
A: 4 outliers detected
B: 4 outliers detected