preprocessing._binner

preprocessing._binner

QuantileBinner class for binning data into quantile-based bins.

This module contains the QuantileBinner class which bins data into quantile-based bins using numpy.percentile with optimized performance using numpy.searchsorted.

Classes

Name Description
QuantileBinner Bin data into quantile-based bins using numpy.percentile.

QuantileBinner

preprocessing._binner.QuantileBinner(
    n_bins,
    method='linear',
    subsample=200000,
    dtype=np.float64,
    random_state=789654,
)

Bin data into quantile-based bins using numpy.percentile.

This class is similar to sklearn’s KBinsDiscretizer but optimized for performance using numpy.searchsorted for fast bin assignment. Bin intervals are defined following the convention: bins[i-1] <= x < bins[i]. Values outside the range are clipped to the first or last bin.

Parameters

Name Type Description Default
n_bins int The number of quantile-based bins to create. Must be >= 2. required
method str The method used to compute quantiles, passed to numpy.percentile. Default is ‘linear’. Valid values: “inverse_cdf”, “averaged_inverse_cdf”, “closest_observation”, “interpolated_inverse_cdf”, “hazen”, “weibull”, “linear”, “median_unbiased”, “normal_unbiased”. 'linear'
subsample int Maximum number of samples for computing quantiles. If dataset has more samples, a random subset is used. Default 200000. 200000
dtype type Data type for bin indices. Default is numpy.float64. np.float64
random_state int Random seed for subset generation. Default 789654. 789654

Attributes

Name Type Description
n_bins int Number of bins to create.
method str Quantile computation method.
subsample int Maximum samples for quantile computation.
dtype type Data type for bin indices.
random_state int Random seed.
n_bins_ int Actual number of bins after fitting (may differ from n_bins if duplicate edges are found).
bin_edges_ np.ndarray Edges of the bins learned during fitting.
internal_edges_ np.ndarray Internal edges for optimized bin assignment.
intervals_ dict Mapping from bin index to (lower, upper) interval bounds.

Examples

>>> import numpy as np
>>> from spotforecast2.preprocessing import QuantileBinner
>>>
>>> # Basic usage: create 3 quantile bins
>>> X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> binner = QuantileBinner(n_bins=3)
>>> _ = binner.fit(X)
>>> result = binner.transform(np.array([1.5, 5.5, 9.5]))
>>> print(result)
[0. 1. 2.]
>>>
>>> # Check bin intervals
>>> print(binner.n_bins_)
3
>>> assert len(binner.intervals_) == 3
>>>
>>> # Use fit_transform for one-step operation
>>> X2 = np.array([10, 20, 30, 40, 50])
>>> binner2 = QuantileBinner(n_bins=2)
>>> bins = binner2.fit_transform(X2)
>>> print(bins)
[0. 0. 1. 1. 1.]

Methods

Name Description
fit Learn bin edges based on quantiles from training data.
fit_transform Fit to data, then transform it.
get_params Get parameters of the quantile binner.
set_params Set parameters of the QuantileBinner.
transform Assign new data to learned bins.
fit
preprocessing._binner.QuantileBinner.fit(X, y=None)

Learn bin edges based on quantiles from training data.

Computes quantile-based bin edges using numpy.percentile. If the dataset contains more samples than subsample, a random subset is used. Duplicate edges (which can occur with repeated values) are removed automatically.

Parameters
Name Type Description Default
X np.ndarray Training data (1D numpy array) for computing quantiles. required
y object Ignored. None
Returns
Name Type Description
object Self for method chaining.
Raises
Name Type Description
ValueError If input data X is empty.
Examples
>>> import numpy as np
>>> from spotforecast2.preprocessing import QuantileBinner
>>>
>>> # Fit with basic data
>>> X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> binner = QuantileBinner(n_bins=3)
>>> _ = binner.fit(X)
>>> print(binner.n_bins_)
3
>>> print(len(binner.bin_edges_))
4
>>>
>>> # Fit with repeated values (may reduce number of bins)
>>> X_repeated = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3])
>>> binner2 = QuantileBinner(n_bins=5)
>>> _ = binner2.fit(X_repeated)
>>> # n_bins_ may be less than 5 due to duplicates
>>> assert binner2.n_bins_ <= 5
fit_transform
preprocessing._binner.QuantileBinner.fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X : array-like of shape (n_samples, n_features) Input samples.

array-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_params : dict Additional fit parameters.

Returns

X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.

get_params
preprocessing._binner.QuantileBinner.get_params(deep=True)

Get parameters of the quantile binner.

Returns
Name Type Description
dict[str, Any] Dictionary containing n_bins, method, subsample, dtype, and
dict[str, Any] random_state parameters.
Examples
>>> import numpy as np
>>> from spotforecast2.preprocessing import QuantileBinner
>>>
>>> binner = QuantileBinner(n_bins=5, method='median_unbiased', subsample=1000)
>>> params = binner.get_params()
>>> print(params['n_bins'])
5
>>> print(params['method'])
median_unbiased
>>> print(params['subsample'])
1000
set_params
preprocessing._binner.QuantileBinner.set_params(**params)

Set parameters of the QuantileBinner.

Parameters
Name Type Description Default
**params Any Parameter names and values to set as keyword arguments. {}
Returns
Name Type Description
self 'QuantileBinner' Returns the updated QuantileBinner instance.
Examples
>>> import numpy as np
>>> from spotforecast2.preprocessing import QuantileBinner
>>>
>>> binner = QuantileBinner(n_bins=3)
>>> print(binner.n_bins)
3
>>> binner.set_params(n_bins=5, method='weibull')
>>> print(binner.n_bins)
5
>>> print(binner.method)
weibull
transform
preprocessing._binner.QuantileBinner.transform(X, y=None)

Assign new data to learned bins.

Uses numpy.searchsorted for efficient bin assignment. Values are assigned to bins following the convention: bins[i-1] <= x < bins[i]. Values outside the fitted range are clipped to the first or last bin.

Parameters
Name Type Description Default
X np.ndarray Data to assign to bins (1D numpy array). required
y object Ignored. None
Returns
Name Type Description
np.ndarray Bin indices as numpy array with dtype specified in init.
Raises
Name Type Description
NotFittedError If fit() has not been called yet.
Examples
>>> import numpy as np
>>> from spotforecast2.preprocessing import QuantileBinner
>>>
>>> # Fit and transform
>>> X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> binner = QuantileBinner(n_bins=3)
>>> _ = binner.fit(X_train)
>>>
>>> X_test = np.array([1.5, 5.5, 9.5])
>>> result = binner.transform(X_test)
>>> print(result)
[0. 1. 2.]
>>>
>>> # Values outside range are clipped
>>> X_extreme = np.array([0, 100])
>>> result_extreme = binner.transform(X_extreme)
>>> print(result_extreme)  # Both clipped to valid bin indices
[0. 2.]