preprocessing.outlier.get_outliers
preprocessing.outlier.get_outliers(
data,
data_original= None ,
contamination= 0.01 ,
random_state= 1234 ,
)
Detect outliers in each column using Isolation Forest.
This function uses scikit-learn’s IsolationForest algorithm to detect outliers in each column of the input DataFrame. The original data (before any NaN values were introduced) can be provided to identify which values were marked as NaN due to outlier detection.
Parameters
data
pd .DataFrame
The input DataFrame to check for outliers.
required
data_original
Optional [pd .DataFrame ]
Optional original DataFrame before outlier marking. If provided, helps identify which values became NaN due to outlier detection. Default: None.
None
contamination
float
The estimated proportion of outliers in the dataset. Default: 0.01.
0.01
random_state
int
Random seed for reproducibility. Default: 1234.
1234
Returns
Dict [str , pd .Series ]
A dictionary mapping column names to Series of outlier values.
Dict [str , pd .Series ]
For columns without outliers, an empty Series is returned.
Raises
ValueError
If data is empty or contains no columns.
Examples
>>> import pandas as pd
>>> import numpy as np
>>> from spotforecast2_safe.preprocessing.outlier import get_outliers
>>>
>>> # Create sample data with outliers
>>> np.random.seed(42 )
>>> data = pd.DataFrame({
... 'A' : np.concatenate([np.random.normal(0 , 1 , 100 ), [10 , 11 , 12 ]]),
... 'B' : np.concatenate([np.random.normal(5 , 2 , 100 ), [100 , 110 , 120 ]])
... })
>>> data_original = data.copy()
>>>
>>> # Detect outliers
>>> outliers = get_outliers(data_original, contamination= 0.03 )
>>> for col, outlier_vals in outliers.items():
... print (f" { col} : { len (outlier_vals)} outliers detected" )