preprocessing.outlier.mark_outliers

preprocessing.outlier.mark_outliers(
    data,
    contamination=0.1,
    random_state=1234,
    verbose=False,
)

Marks outliers as NaN in the dataset using Isolation Forest.

Parameters

Name Type Description Default
data pd.DataFrame The input dataset. required
contamination float The (estimated) proportion of outliers in the dataset. 0.1
random_state int Random seed for reproducibility. Default is 1234. 1234
verbose bool Whether to print additional information. False

Returns

Name Type Description
tuple[pd.DataFrame, np.ndarray] tuple[pd.DataFrame, np.ndarray]: A tuple containing the modified dataset with outliers marked as NaN and the outlier labels.

Examples

from spotforecast2_safe.data.fetch_data import fetch_data, get_package_data_home
from spotforecast2_safe.preprocessing.outlier import mark_outliers
path_demo = get_package_data_home() / "demo02.csv"
data = fetch_data(filename=path_demo)
print(data.head())
cleaned_data, outlier_labels = mark_outliers(data, contamination=0.1, random_state=42, verbose=True)
print(cleaned_data.head())
print(outlier_labels[:10])
                                  A         B      C  D         E         F  \
DateTime                                                                      
1964-08-02 14:00:00+00:00  0.202969  8.255128  334.0  0  0.111049 -0.121741   
1964-08-02 15:00:00+00:00  0.145975  7.542355  339.0  0 -0.003927  0.103541   
1964-08-02 16:00:00+00:00  0.094389  8.174336  344.0  0  0.043963  0.041291   
1964-08-02 17:00:00+00:00 -0.202353  7.387896  341.0  0  0.067118  0.072999   
1964-08-02 18:00:00+00:00 -0.013810  7.581125  335.0  0 -0.138614 -0.006495   

                                G         H  I   J    K  
DateTime                                                 
1964-08-02 14:00:00+00:00  1597.0  0.067896  0 NaN  0.0  
1964-08-02 15:00:00+00:00  1609.0 -0.093175  0 NaN  0.0  
1964-08-02 16:00:00+00:00  1660.0  0.047823  0 NaN  0.0  
1964-08-02 17:00:00+00:00  1567.0 -0.051628  0 NaN  0.0  
1964-08-02 18:00:00+00:00  1467.0  0.016003  0 NaN  0.0  
Column 'A': Marked 9.9934% of data points as outliers.
Column 'B': Marked 9.9997% of data points as outliers.
Column 'C': Marked 9.9347% of data points as outliers.
Column 'D': Marked 6.5549% of data points as outliers.
Column 'E': Marked 9.9577% of data points as outliers.
Column 'F': Marked 9.9787% of data points as outliers.
Column 'G': Marked 9.9682% of data points as outliers.
Column 'H': Marked 9.9840% of data points as outliers.
Column 'I': Marked 9.9546% of data points as outliers.
Column 'J': Marked 0.0000% of data points as outliers.
Column 'K': Marked 9.9074% of data points as outliers.
                                  A   B      C    D         E         F  \
DateTime                                                                  
1964-08-02 14:00:00+00:00  0.202969 NaN  334.0  0.0  0.111049 -0.121741   
1964-08-02 15:00:00+00:00  0.145975 NaN  339.0  0.0 -0.003927  0.103541   
1964-08-02 16:00:00+00:00  0.094389 NaN  344.0  0.0  0.043963  0.041291   
1964-08-02 17:00:00+00:00       NaN NaN  341.0  0.0  0.067118  0.072999   
1964-08-02 18:00:00+00:00 -0.013810 NaN  335.0  0.0 -0.138614 -0.006495   

                                G         H    I   J    K  
DateTime                                                   
1964-08-02 14:00:00+00:00  1597.0  0.067896  0.0 NaN  0.0  
1964-08-02 15:00:00+00:00  1609.0 -0.093175  0.0 NaN  0.0  
1964-08-02 16:00:00+00:00  1660.0  0.047823  0.0 NaN  0.0  
1964-08-02 17:00:00+00:00  1567.0 -0.051628  0.0 NaN  0.0  
1964-08-02 18:00:00+00:00  1467.0  0.016003  0.0 NaN  0.0  
[1 1 1 1 1 1 1 1 1 1]