preprocessing.outlier_plots

preprocessing.outlier_plots

Functions

Name Description
visualize_outliers_hist Visualize outliers in DataFrame using stacked histograms.
visualize_outliers_plotly_scatter Visualize outliers in time series using Plotly scatter plots.

visualize_outliers_hist

preprocessing.outlier_plots.visualize_outliers_hist(
    data,
    data_original,
    columns=None,
    contamination=0.01,
    random_state=1234,
    figsize=(10, 5),
    bins=50,
    **kwargs,
)

Visualize outliers in DataFrame using stacked histograms.

Creates a histogram for each specified column, displaying both regular data and detected outliers in different colors. Uses IsolationForest for outlier detection.

Parameters

Name Type Description Default
data pd.DataFrame The DataFrame with cleaned data (outliers may be NaN). required
data_original pd.DataFrame The original DataFrame before outlier detection. required
columns Optional[list[str]] List of column names to visualize. If None, all columns are used. Default: None. None
contamination float The estimated proportion of outliers in the dataset. Default: 0.01. 0.01
random_state int Random seed for reproducibility. Default: 1234. 1234
figsize tuple[int, int] Figure size as (width, height). Default: (10, 5). (10, 5)
bins int Number of histogram bins. Default: 50. 50
**kwargs Any Additional keyword arguments passed to plt.hist() (e.g., color, alpha, edgecolor, etc.). {}

Returns

Name Type Description
None None. Displays matplotlib figures.

Raises

Name Type Description
ValueError If data or data_original is empty, or if specified columns don’t exist.
ImportError If matplotlib is not installed.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from spotforecast2.preprocessing.outlier_plots import visualize_outliers_hist
>>>
>>> # Create sample data
>>> np.random.seed(42)
>>> data_original = pd.DataFrame({
...     'temperature': np.concatenate([
...         np.random.normal(20, 5, 100),
...         [50, 60, 70]  # outliers
...     ]),
...     'humidity': np.concatenate([
...         np.random.normal(60, 10, 100),
...         [95, 98, 99]  # outliers
...     ])
... })
>>> data_cleaned = data_original.copy()
>>>
>>> # Visualize outliers
>>> visualize_outliers_hist(
...     data_cleaned,
...     data_original,
...     contamination=0.03,
...     figsize=(12, 5),
...     alpha=0.7
... )

visualize_outliers_plotly_scatter

preprocessing.outlier_plots.visualize_outliers_plotly_scatter(
    data,
    data_original,
    columns=None,
    contamination=0.01,
    random_state=1234,
    **kwargs,
)

Visualize outliers in time series using Plotly scatter plots.

Creates an interactive time series plot for each specified column, showing regular data as a line and detected outliers as scatter points. Uses IsolationForest for outlier detection.

Parameters

Name Type Description Default
data pd.DataFrame The DataFrame with cleaned data (outliers may be NaN). required
data_original pd.DataFrame The original DataFrame before outlier detection. required
columns Optional[list[str]] List of column names to visualize. If None, all columns are used. Default: None. None
contamination float The estimated proportion of outliers in the dataset. Default: 0.01. 0.01
random_state int Random seed for reproducibility. Default: 1234. 1234
**kwargs Any Additional keyword arguments passed to go.Figure.update_layout() (e.g., template, height, etc.). {}

Returns

Name Type Description
None None. Displays Plotly figures.

Raises

Name Type Description
ValueError If data or data_original is empty, or if specified columns don’t exist.
ImportError If plotly is not installed.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from spotforecast2.preprocessing.outlier_plots import visualize_outliers_plotly_scatter
>>>
>>> # Create sample time series data
>>> np.random.seed(42)
>>> dates = pd.date_range('2024-01-01', periods=103, freq='h')
>>> data_original = pd.DataFrame({
...     'temperature': np.concatenate([
...         np.random.normal(20, 5, 100),
...         [50, 60, 70]  # outliers
...     ]),
...     'humidity': np.concatenate([
...         np.random.normal(60, 10, 100),
...         [95, 98, 99]  # outliers
...     ])
... }, index=dates)
>>> data_cleaned = data_original.copy()
>>>
>>> # Visualize outliers
>>> visualize_outliers_plotly_scatter(
...     data_cleaned,
...     data_original,
...     contamination=0.03,
...     template='plotly_white'
... )