preprocessing.outlier_plots
preprocessing.outlier_plots
Functions
visualize_outliers_hist
preprocessing.outlier_plots.visualize_outliers_hist(
data,
data_original,
columns= None ,
contamination= 0.01 ,
random_state= 1234 ,
figsize= (10 , 5 ),
bins= 50 ,
** kwargs,
)
Visualize outliers in DataFrame using stacked histograms.
Creates a histogram for each specified column, displaying both regular data and detected outliers in different colors. Uses IsolationForest for outlier detection.
Parameters
data
pd .DataFrame
The DataFrame with cleaned data (outliers may be NaN).
required
data_original
pd .DataFrame
The original DataFrame before outlier detection.
required
columns
Optional [list [str ]]
List of column names to visualize. If None, all columns are used. Default: None.
None
contamination
float
The estimated proportion of outliers in the dataset. Default: 0.01.
0.01
random_state
int
Random seed for reproducibility. Default: 1234.
1234
figsize
tuple [int , int ]
Figure size as (width, height). Default: (10, 5).
(10, 5)
bins
int
Number of histogram bins. Default: 50.
50
**kwargs
Any
Additional keyword arguments passed to plt.hist() (e.g., color, alpha, edgecolor, etc.).
{}
Returns
None
None. Displays matplotlib figures.
Raises
ValueError
If data or data_original is empty, or if specified columns don’t exist.
ImportError
If matplotlib is not installed.
Examples
>>> import pandas as pd
>>> import numpy as np
>>> from spotforecast2.preprocessing.outlier_plots import visualize_outliers_hist
>>>
>>> # Create sample data
>>> np.random.seed(42 )
>>> data_original = pd.DataFrame({
... 'temperature' : np.concatenate([
... np.random.normal(20 , 5 , 100 ),
... [50 , 60 , 70 ] # outliers
... ]),
... 'humidity' : np.concatenate([
... np.random.normal(60 , 10 , 100 ),
... [95 , 98 , 99 ] # outliers
... ])
... })
>>> data_cleaned = data_original.copy()
>>>
>>> # Visualize outliers
>>> visualize_outliers_hist(
... data_cleaned,
... data_original,
... contamination= 0.03 ,
... figsize= (12 , 5 ),
... alpha= 0.7
... )
visualize_outliers_plotly_scatter
preprocessing.outlier_plots.visualize_outliers_plotly_scatter(
data,
data_original,
columns= None ,
contamination= 0.01 ,
random_state= 1234 ,
** kwargs,
)
Visualize outliers in time series using Plotly scatter plots.
Creates an interactive time series plot for each specified column, showing regular data as a line and detected outliers as scatter points. Uses IsolationForest for outlier detection.
Parameters
data
pd .DataFrame
The DataFrame with cleaned data (outliers may be NaN).
required
data_original
pd .DataFrame
The original DataFrame before outlier detection.
required
columns
Optional [list [str ]]
List of column names to visualize. If None, all columns are used. Default: None.
None
contamination
float
The estimated proportion of outliers in the dataset. Default: 0.01.
0.01
random_state
int
Random seed for reproducibility. Default: 1234.
1234
**kwargs
Any
Additional keyword arguments passed to go.Figure.update_layout() (e.g., template, height, etc.).
{}
Returns
None
None. Displays Plotly figures.
Raises
ValueError
If data or data_original is empty, or if specified columns don’t exist.
ImportError
If plotly is not installed.
Examples
>>> import pandas as pd
>>> import numpy as np
>>> from spotforecast2.preprocessing.outlier_plots import visualize_outliers_plotly_scatter
>>>
>>> # Create sample time series data
>>> np.random.seed(42 )
>>> dates = pd.date_range('2024-01-01' , periods= 103 , freq= 'h' )
>>> data_original = pd.DataFrame({
... 'temperature' : np.concatenate([
... np.random.normal(20 , 5 , 100 ),
... [50 , 60 , 70 ] # outliers
... ]),
... 'humidity' : np.concatenate([
... np.random.normal(60 , 10 , 100 ),
... [95 , 98 , 99 ] # outliers
... ])
... }, index= dates)
>>> data_cleaned = data_original.copy()
>>>
>>> # Visualize outliers
>>> visualize_outliers_plotly_scatter(
... data_cleaned,
... data_original,
... contamination= 0.03 ,
... template= 'plotly_white'
... )