utils.stats

utils.stats

Functions

Name	Description
calculate_outliers	Calculate the number of outliers using the IQR method.
compute_coefficients_table
compute_standardized_betas	Computes standardized (beta) coefficients for a fitted statsmodels OLS model.
condition_index	Calculates the Condition Index for a DataFrame to assess multicollinearity.
cov_to_cor	Convert a covariance matrix to a correlation matrix.
fit_all_lm	Fit a linear regression model for all possible combinations of independent variables.
get_all_vars_from_formula	Utility function to extract variables from a formula.
get_combinations	Generates all possible combinations of two values from a list of values. Order is not important.
get_sample_size	Calculate sample size n for comparing two means.
normalize_X	Normalize array X to [0, 1] in each dimension.
partial_correlation	Calculate the partial correlation matrix for a given data set.
partial_correlation_test	The partial correlation coefficient between x and y given z.
plot_coeff_vs_pvals	Plot the coefficient estimates from fit_all_lm against the corresponding p-values.
plot_coeff_vs_pvals_by_included	Generates a panel of scatter plots with effect estimates of all possible models against p-values.
preprocess_df_for_ols	Preprocesses a df for fiitting an OLS regression model using the specified target column and predictors.
vif	Calculates the Variance Inflation Factor (VIF) for each feature in a DataFrame.

calculate_outliers

utils.stats.calculate_outliers(series_or_df, irqmultiplier=1.5)

Calculate the number of outliers using the IQR method.

Accepts either a pandas Series or a pandas DataFrame. For a DataFrame, counts outliers across all numeric columns and returns the total count.

Parameters

Name	Type	Description	Default
series_or_df	Union[pd.Series, pd.DataFrame]	pd.Series or pd.DataFrame containing numeric data.	required
irqmultiplier	float	Multiplier for IQR to define fences. Defaults to 1.5.	`1.5`

Returns

Name	Type	Description
int	int	The number of outliers.

Examples

>>> import pandas as pd
>>> from spotoptim.utils.stats import calculate_outliers
>>> s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100])
>>> calculate_outliers(s)
1

>>> df = pd.DataFrame({
...     'a': [1, 2, 3, 100],
...     'b': [10, 12, 11, 10]
... })
>>> calculate_outliers(df)
1

compute_coefficients_table

utils.stats.compute_coefficients_table(model, X_encoded, y, vif_table=None)

Compute a coefficients table containing

Variable name
Zero-order correlation
Partial correlation
Semipartial (part) correlation
Tolerance (1 / VIF)
VIF

Parameters

Name	Type	Description	Default
model	`statsmodels`.`regression`.`linear_model`.`RegressionResultsWrapper`	A fitted OLS model from statsmodels.	required
X_encoded	pd.DataFrame	The DataFrame used to fit the model, including ‘const’.	required
y	pd.Series	Dependent variable used in fitting the model.	required
vif_table	pd.DataFrame	A DataFrame with columns [“feature”, “VIF”] for each column in X_encoded (typ. from statsmodels.stats.outliers_influence.variance_inflation_factor). Default is None.	`None`

Returns

Name	Type	Description
	pd.DataFrame	pd.DataFrame with columns: - “Variable” - “Zero-Order r” - “Partial r” - “Semipartial r” - “Tolerance” - “VIF”

Examples

>>> from spotpython.utils.stats import compute_coefficients_table
>>> import pandas as pd
>>> import statsmodels.api as sm
>>> data = pd.DataFrame({
...     'x1': [1, 2, 3, 4, 5],
...     'x2': [2, 4, 6, 8, 10],
...     'x3': [1, 3, 5, 7, 9]
... })
>>> y = pd.Series([1, 2, 3, 4, 5])
>>> X = sm.add_constant(data)
>>> model = sm.OLS(y, X).fit()
>>> vif_table = pd.DataFrame({
...     'feature': ['x1', 'x2', 'x3'],
...     'VIF': [1, 2, 3]
... })
>>> compute_coefficients_table(model, data, y, vif_table)
   Variable  Zero-Order r  Partial r  Semipartial r  Tolerance  VIF
0       x1           0.0        0.0            0.0        1.0  1.0
1       x2           0.0        0.0            0.0        0.5  2.0
2       x3           0.0        0.0            0.0        0.333333  3.0

compute_standardized_betas

utils.stats.compute_standardized_betas(model, X_encoded, y)

Computes standardized (beta) coefficients for a fitted statsmodels OLS model.

Parameters

Name	Type	Description	Default
model	`statsmodels`.`regression`.`linear_model`.`RegressionResultsWrapper`	The fitted OLS model.	required
X_encoded	pandas.DataFrame	The design matrix of independent variables.	required
y	pandas.Series	The dependent variable.	required

Returns

Name	Type	Description
	pd.DataFrame	pandas.DataFrame: A DataFrame containing the standardized beta coefficients.

Examples

>>> from spotpython.utils.stats import compute_standardized_betas
>>> import pandas as pd
>>> import statsmodels.api as sm
>>> data = pd.DataFrame({
...     'x1': [1, 2, 3, 4, 5],
...     'x2': [2, 4, 6, 8, 10],
...     'x3': [1, 3, 5, 7, 9]
... })
>>> y = pd.Series([1, 2, 3, 4, 5])
>>> X = sm.add_constant(data)
>>> model = sm.OLS(y, X).fit()
>>> compute_standardized_betas(model, data, y)
   Variable  Standardized Beta
0     const           0.000000
1       x1           0.000000
2       x2           0.000000
3       x3           0.000000

condition_index

utils.stats.condition_index(df)

Calculates the Condition Index for a DataFrame to assess multicollinearity.

The Condition Index is computed based on the eigenvalues of the covariance matrix of the standardized data. High condition indices suggest potential multicollinearity issues.

Parameters

Name	Type	Description	Default
df	pandas.DataFrame	A DataFrame containing the independent variables.	required

Returns

Name	Type	Description
	pd.DataFrame	pandas.DataFrame: A DataFrame with the following columns: - ‘Index’: The index of the eigenvalue. - ‘Eigenvalue’: The eigenvalue of the covariance matrix. - ‘Condition Index’: The Condition Index for the eigenvalue.

Examples

>>> from spotpython.utils.stats import condition_index
>>> import pandas as pd
>>> data = pd.DataFrame({
...     'x1': [1, 2, 3, 4, 5],
...     'x2': [2, 4, 6, 8, 10],
...     'x3': [1, 3, 5, 7, 9]
... })
>>> condition_index(data)
   Index  Eigenvalue  Condition Index
0      0    1.140000         1.000000
1      1    0.000000              inf
2      2    0.002857        20.000000

cov_to_cor

utils.stats.cov_to_cor(covariance_matrix)

Convert a covariance matrix to a correlation matrix.

Parameters

Name	Type	Description	Default
covariance_matrix	numpy.ndarray	A square matrix of covariance values.	required

Returns

Name	Type	Description
	np.ndarray	numpy.ndarray: A corresponding square matrix of correlation coefficients.

Examples

>>> from spotpython.utils.stats import cov_to_cor
>>> import numpy as np
>>> cov_matrix = np.array([[1, 0.8], [0.8, 1]])
>>> cov_to_cor(cov_matrix)
array([[1. , 0.8],
       [0.8, 1. ]])

fit_all_lm

utils.stats.fit_all_lm(basic, xlist, data, remove_na=True)

Fit a linear regression model for all possible combinations of independent variables.

Parameters

Name	Type	Description	Default
basic	str	The basic model formula.	required
xlist	list	A list of independent variables.	required
data	pandas.DataFrame	The data frame containing the variables.	required
remove_na	bool	Whether to remove missing values from the data frame.	`True`

Returns

Name	Type	Description
dict	dict	A dictionary containing the estimated coefficients, confidence intervals, p-values, AIC values, sample size, and the basic model formula.

Examples

>>> from spotpython.utils.stats import fit_all_lm
>>> import pandas as pd
>>> data = pd.DataFrame({
>>>     'y': [1, 2, 3],
>>>     'x1': [4, 5, 6],
>>>     'x2': [7, 8, 9]
>>> })
>>> fit_all_lm("y ~ x1", ["x2"], data)
{'estimate':   variables  estimate  conf_low  conf_high    p         aic  n
0    basic  1.000000  1.000000   1.000000  0.0  0.000000  3
1       x2  1.000000  1.000000   1.000000  0.0  0.000000  3}

get_all_vars_from_formula

utils.stats.get_all_vars_from_formula(formula)

Utility function to extract variables from a formula.

Parameters

Name	Type	Description	Default
formula	str	A formula.	required

Returns

Name	Type	Description
list	list	A list of variables.

Examples

>>> from spotpython.utils.stats import get_all_vars_from_formula
    get_all_vars_from_formula("y ~ x1 + x2")
        ['y', 'x1', 'x2']
    get_all_vars_from_formula("y ~ ")
        ['y']

get_combinations

utils.stats.get_combinations(ind_list, type='indices')

Generates all possible combinations of two values from a list of values. Order is not important.

Parameters

Name	Type	Description	Default
ind_list	list	A list of target indices.	required
type	str	The type of output, either ‘values’ or ‘indices’. Default is ‘indices’.	`'indices'`

Returns

Name	Type	Description
list	list	A list of tuples, where each tuple contains a combination of two values. The order of the values within a tuple is not important, and each combination appears only once.

Examples

>>> from spotoptim.utils import get_combinations
>>> ind_list = [0, 10, 20, 30]
>>> combinations = get_combinations(ind_list)
>>> combinations = get_combinations(ind_list, type='indices')
    [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
>>> print(combinations, type='values')
    [(0, 10), (0, 20), (0, 30), (1, 20), (1, 30), (2, 30)]

get_sample_size

utils.stats.get_sample_size(alpha, beta, sigma, delta)

Calculate sample size n for comparing two means.

Formula: n = 2 * sigma^2 * (z_{1-alpha/2} + z_{1-beta})^2 / delta^2 This corresponds to a two-sided test with equal variance.

Parameters

Name	Type	Description	Default
alpha	float	Significance level (Type I error probability).	required
beta	float	Type II error probability (1 - Power).	required
sigma	float	Standard deviation of the population (assumed equal for both groups).	required
delta	float	Minimum detectable difference (effect size to detect).	required

Returns

Name	Type	Description
float	float	The required sample size n per group.

Examples

>>> from spotoptim.utils.stats import get_sample_size
>>> alpha = 0.05
>>> beta = 0.2  # Power = 80%
>>> sigma = 1.0
>>> delta = 1.0
>>> n = get_sample_size(alpha, beta, sigma, delta)
>>> print(f"{n:.4f}")
15.6978

normalize_X

utils.stats.normalize_X(X, eps=1e-12)

Normalize array X to [0, 1] in each dimension.

For dimensions where all values are identical (X_max == X_min), the normalized value is set to 0.5 to avoid division by zero.

Parameters

Name	Type	Description	Default
X	np.ndarray	Input array of shape (n, d) to normalize.	required
eps	float	Small value to avoid division by zero when range is very small. Defaults to 1e-12.	`1e-12`

Returns

Name	Type	Description
	np.ndarray	np.ndarray: Normalized array with values in [0, 1] for each dimension. For constant dimensions, values are set to 0.5.

Examples

>>> from spotoptim.utils.stats import normalize_X
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> normalize_X(X)
array([[0. , 0. ],
       [0.5, 0.5],
       [1. , 1. ]])

>>> # Constant dimension example
>>> X_const = np.array([[1, 5], [1, 5], [1, 5]])
>>> normalize_X(X_const)
array([[0.5, 0.5],
       [0.5, 0.5],
       [0.5, 0.5]])

partial_correlation

utils.stats.partial_correlation(x, method='pearson')

Calculate the partial correlation matrix for a given data set.

Parameters

Name	Type	Description	Default
x	pandas.DataFrame or numpy.ndarray	The data matrix with variables as columns.	required
method	str	Correlation method, one of ‘pearson’, ‘kendall’, or ‘spearman’.	`'pearson'`

Returns

Name	Type	Description
dict	dict	A dictionary containing the partial correlation estimates, p-values, statistics, sample size (n), number of given parameters (gp), and method used.

Raises

Name	Type	Description
	ValueError	If input is not a matrix-like structure or not numeric.

References

Kim, S. ppcor: An R package for a fast calculation to semi-partial correlation coefficients. Commun Stat Appl Methods 22, 6 (Nov 2015), 665–674.

Examples

>>> from spotpython.utils.stats import partial_correlation
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.DataFrame({
>>>     'A': [1, 2, 3],
>>>     'B': [4, 5, 6],
>>>     'C': [7, 8, 9]
>>> })
>>> partial_correlation(data, method='pearson')
{'estimate': array([[ 1. , -1. ,  1. ],
                    [-1. ,  1. , -1. ],
                    [ 1. , -1. ,  1. ]]),
'p_value': array([[0. , 0. , 0. ],
                  [0. , 0. , 0. ],
                  [0. , 0. , 0. ]]), ...
}

partial_correlation_test

utils.stats.partial_correlation_test(x, y, z, method='pearson')

The partial correlation coefficient between x and y given z. x and y should be arrays (vectors) of the same length, and z should be a data frame (matrix).

Parameters

Name	Type	Description	Default
x	array - `like`	The first variable as a 1-dimensional array or list.	required
y	array - `like`	The second variable as a 1-dimensional array or list.	required
z	pandas.DataFrame	A data frame containing other conditional variables.	required
method	str	Correlation method, one of ‘pearson’, ‘kendall’, or ‘spearman’.	`'pearson'`

Returns

Name	Type	Description
dict	dict	A dictionary with the partial correlation estimate, p-value, statistic, sample size (n), number of given parameters (gp), and method used.

References

Kim, S. ppcor: An R package for a fast calculation to semi-partial correlation coefficients. Commun Stat Appl Methods 22, 6 (Nov 2015), 665–674.

Examples

>>> from spotpython.utils.stats import pairwise_partial_correlation
>>> import pandas as pd
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> z = pd.DataFrame({'C': [7, 8, 9]})
>>> pairwise_partial_correlation(x, y, z)
{'estimate': -1.0, 'p_value': 0.0, 'statistic': -inf, 'n': 3, 'gp': 1, 'method': 'pearson'}

plot_coeff_vs_pvals

utils.stats.plot_coeff_vs_pvals(
    data,
    xlabels=None,
    xlim=(0, 1),
    xlab='p-value',
    ylim=None,
    ylab=None,
    xscale_log=True,
    yscale_log=False,
    title=None,
    show=True,
    y_scaler=1.1,
)

Plot the coefficient estimates from fit_all_lm against the corresponding p-values.

Parameters

Name	Type	Description	Default
data	dict	A dictionary containing the estimated coefficients, p-values, and other information. Generated by the fit_all_lm function.	required
xlabels	list	A list of x-axis labels.	`None`
xlim	tuple	A tuple of the x-axis limits.	`(0, 1)`
xlab	str	The x-axis label.	`'p-value'`
ylim	tuple	A tuple of the y-axis limits.	`None`
ylab	str	The y-axis label.	`None`
xscale_log	bool	Whether to use a log scale on the x-axis.	`True`
yscale_log	bool	Whether to use a log scale on the y-axis.	`False`
title	str	The plot title.	`None`
show	bool	Whether to display the plot.	`True`
y_scaler	float	A scaling factor for the y-axis limits. Default is 1.1, i.e., 10% more than the maximum value.	`1.1`

Returns

Name	Type	Description
	None	None

Notes

Based on the R package ‘allestimates’ by Zhiqiang Wang, see https://cran.r-project.org/package=allestimates

References

Wang, Z. (2007). Two Postestimation Commands for Assessing Confounding Effects in Epidemiological Studies. The Stata Journal, 7(2), 183-196. https://doi.org/10.1177/1536867X0700700203

Examples

>>> from spotpython.utils.stats import plot_coeff_vs_pvals, fit_all_lm
>>> import pandas as pd
>>> data = pd.DataFrame({
>>>     'y': [1, 2, 3],
>>>     'x1': [4, 5, 6],
>>>     'x2': [7, 8, 9]
>>> })
>>> estimates = fit_all_lm("y ~ x1", ["x2"], data)
>>> plot_coeff_vs_pvals(estimates)

plot_coeff_vs_pvals_by_included

utils.stats.plot_coeff_vs_pvals_by_included(
    data,
    xlabels=None,
    xlim=(0, 1),
    xlab='P value',
    ylim=None,
    ylab=None,
    yscale_log=False,
    title=None,
    grid=True,
    ncol=2,
    show=True,
    y_scaler=1.1,
)

Generates a panel of scatter plots with effect estimates of all possible models against p-values. Uses a dictionry generated by the fit_all_lm function. Each plot includes effect estimates from all models including a specific variable.

Parameters

Name	Type	Description	Default
data	dict	A dictionary, generated by the fit_all_lm function, containing the following keys: - estimate (pd.DataFrame): A DataFrame containing the estimates. - xlist (list): A list of variables. - fun (str): The function name. - family (str): The family of the model.	required
xlabels	list	A list of x-axis labels.	`None`
xlim	tuple	The x-axis limits.	`(0, 1)`
xlab	str	The x-axis label.	`'P value'`
ylim	tuple	The y-axis limits.	`None`
ylab	str	The y-axis label.	`None`
yscale_log	bool	Whether to scale y-axis to log10. Default is False.	`False`
title	str	The title of the plot.	`None`
grid	bool	Whether to display gridlines. Default is True.	`True`
ncol	int	Number of columns in the plot grid. Default is 2.	`2`
show	bool	Whether to display the plot. Default is True.	`True`
y_scaler	float	A scaling factor for the y-axis limits. Default is 1.1, i.e., 10% more than the maximum value.	`1.1`

Returns

Name	Type	Description
	None	None

Notes

Based on the R package ‘allestimates’ by Zhiqiang Wang, see https://cran.r-project.org/package=allestimates

References

Wang, Z. (2007). Two Postestimation Commands for Assessing Confounding Effects in Epidemiological Studies. The Stata Journal, 7(2), 183-196. https://doi.org/10.1177/1536867X0700700203

Examples

data = { “estimate”: pd.DataFrame({ “variables”: [“Crude”, “AL”, “AM”, “AN”, “AO”], “estimate”: [0.5, 0.6, 0.7, 0.8, 0.9], “conf_low”: [0.1, 0.2, 0.3, 0.4, 0.5], “conf_high”: [0.9, 1.0, 1.1, 1.2, 1.3], “p”: [0.01, 0.02, 0.03, 0.04, 0.05], “aic”: [100, 200, 300, 400, 500], “n”: [10, 20, 30, 40, 50] }), “xlist”: [“AL”, “AM”, “AN”, “AO”], “fun”: “all_lm” } plot_coeff_vs_pvals_by_included(data)

preprocess_df_for_ols

utils.stats.preprocess_df_for_ols(df, independent_var_columns, target_col)

Preprocesses a df for fiitting an OLS regression model using the specified target column and predictors.

Parameters

Name	Type	Description	Default
df	pd.DataFrame	Input DataFrame containing the data.	required
independent_var_columns	list of str	List of names for predictor columns.	required
target_col	str	Name of the target/dependent variable column.	required

Returns

Name	Type	Description
X_encoded	pd.DataFrame	Encoded predictors with a constant term.
y	pd.Series	Target variable.

vif

utils.stats.vif(X, sorted=True)

Calculates the Variance Inflation Factor (VIF) for each feature in a DataFrame.

VIF measures the multicollinearity among independent variables within a regression model. High VIF values indicate high multicollinearity, which can cause issues with model interpretation and stability.

Parameters

Name	Type	Description	Default
X	pandas.DataFrame	A DataFrame containing the independent variables.	required
sorted	bool	Whether to sort the output DataFrame by VIF values.	`True`

Returns

Name	Type	Description
	pd.DataFrame	pandas.DataFrame: A DataFrame with two columns: - “feature”: The name of the feature. - “VIF”: The Variance Inflation Factor for the feature.

Examples

>>> from spotpython.utils.stats import vif
>>> import pandas as pd
>>> data = pd.DataFrame({
...     'x1': [1, 2, 3, 4, 5],
...     'x2': [2, 4, 6, 8, 10],
...     'x3': [1, 3, 5, 7, 9]
... })
>>> vif(data)
   feature          VIF
0      x1  1260.000000
1      x2         0.000000
2      x3   630.000000