utils.stats

utils.stats

Functions

Name Description
calculate_outliers Calculate the number of outliers using the IQR method.
compute_coefficients_table
compute_standardized_betas Computes standardized (beta) coefficients for a fitted statsmodels OLS model.
condition_index Calculates the Condition Index for a DataFrame to assess multicollinearity.
cov_to_cor Convert a covariance matrix to a correlation matrix.
fit_all_lm Fit a linear regression model for all possible combinations of independent variables.
get_all_vars_from_formula Utility function to extract variables from a formula.
get_combinations Generates all possible combinations of two values from a list of values. Order is not important.
get_sample_size Calculate sample size n for comparing two means.
normalize_X Normalize array X to [0, 1] in each dimension.
partial_correlation Calculate the partial correlation matrix for a given data set.
partial_correlation_test The partial correlation coefficient between x and y given z.
plot_coeff_vs_pvals Plot the coefficient estimates from fit_all_lm against the corresponding p-values.
plot_coeff_vs_pvals_by_included Generates a panel of scatter plots with effect estimates of all possible models against p-values.
preprocess_df_for_ols Preprocesses a df for fiitting an OLS regression model using the specified target column and predictors.
vif Calculates the Variance Inflation Factor (VIF) for each feature in a DataFrame.

calculate_outliers

utils.stats.calculate_outliers(series_or_df, irqmultiplier=1.5)

Calculate the number of outliers using the IQR method.

Accepts either a pandas Series or a pandas DataFrame. For a DataFrame, counts outliers across all numeric columns and returns the total count.

Parameters

Name Type Description Default
series_or_df Union[pd.Series, pd.DataFrame] pd.Series or pd.DataFrame containing numeric data. required
irqmultiplier float Multiplier for IQR to define fences. Defaults to 1.5. 1.5

Returns

Name Type Description
int int The number of outliers.

Examples

>>> import pandas as pd
>>> from spotoptim.utils.stats import calculate_outliers
>>> s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100])
>>> calculate_outliers(s)
1
>>> df = pd.DataFrame({
...     'a': [1, 2, 3, 100],
...     'b': [10, 12, 11, 10]
... })
>>> calculate_outliers(df)
1

compute_coefficients_table

utils.stats.compute_coefficients_table(model, X_encoded, y, vif_table=None)

Compute a coefficients table containing

  1. Variable name
  2. Zero-order correlation
  3. Partial correlation
  4. Semipartial (part) correlation
  5. Tolerance (1 / VIF)
  6. VIF

Parameters

Name Type Description Default
model statsmodels.regression.linear_model.RegressionResultsWrapper A fitted OLS model from statsmodels. required
X_encoded pd.DataFrame The DataFrame used to fit the model, including ‘const’. required
y pd.Series Dependent variable used in fitting the model. required
vif_table pd.DataFrame A DataFrame with columns [“feature”, “VIF”] for each column in X_encoded (typ. from statsmodels.stats.outliers_influence.variance_inflation_factor). Default is None. None

Returns

Name Type Description
pd.DataFrame pd.DataFrame with columns: - “Variable” - “Zero-Order r” - “Partial r” - “Semipartial r” - “Tolerance” - “VIF”

Examples

>>> from spotpython.utils.stats import compute_coefficients_table
>>> import pandas as pd
>>> import statsmodels.api as sm
>>> data = pd.DataFrame({
...     'x1': [1, 2, 3, 4, 5],
...     'x2': [2, 4, 6, 8, 10],
...     'x3': [1, 3, 5, 7, 9]
... })
>>> y = pd.Series([1, 2, 3, 4, 5])
>>> X = sm.add_constant(data)
>>> model = sm.OLS(y, X).fit()
>>> vif_table = pd.DataFrame({
...     'feature': ['x1', 'x2', 'x3'],
...     'VIF': [1, 2, 3]
... })
>>> compute_coefficients_table(model, data, y, vif_table)
   Variable  Zero-Order r  Partial r  Semipartial r  Tolerance  VIF
0       x1           0.0        0.0            0.0        1.0  1.0
1       x2           0.0        0.0            0.0        0.5  2.0
2       x3           0.0        0.0            0.0        0.333333  3.0

compute_standardized_betas

utils.stats.compute_standardized_betas(model, X_encoded, y)

Computes standardized (beta) coefficients for a fitted statsmodels OLS model.

Parameters

Name Type Description Default
model statsmodels.regression.linear_model.RegressionResultsWrapper The fitted OLS model. required
X_encoded pandas.DataFrame The design matrix of independent variables. required
y pandas.Series The dependent variable. required

Returns

Name Type Description
pd.DataFrame pandas.DataFrame: A DataFrame containing the standardized beta coefficients.

Examples

>>> from spotpython.utils.stats import compute_standardized_betas
>>> import pandas as pd
>>> import statsmodels.api as sm
>>> data = pd.DataFrame({
...     'x1': [1, 2, 3, 4, 5],
...     'x2': [2, 4, 6, 8, 10],
...     'x3': [1, 3, 5, 7, 9]
... })
>>> y = pd.Series([1, 2, 3, 4, 5])
>>> X = sm.add_constant(data)
>>> model = sm.OLS(y, X).fit()
>>> compute_standardized_betas(model, data, y)
   Variable  Standardized Beta
0     const           0.000000
1       x1           0.000000
2       x2           0.000000
3       x3           0.000000

condition_index

utils.stats.condition_index(df)

Calculates the Condition Index for a DataFrame to assess multicollinearity.

The Condition Index is computed based on the eigenvalues of the covariance matrix of the standardized data. High condition indices suggest potential multicollinearity issues.

Parameters

Name Type Description Default
df pandas.DataFrame A DataFrame containing the independent variables. required

Returns

Name Type Description
pd.DataFrame pandas.DataFrame: A DataFrame with the following columns: - ‘Index’: The index of the eigenvalue. - ‘Eigenvalue’: The eigenvalue of the covariance matrix. - ‘Condition Index’: The Condition Index for the eigenvalue.

Examples

>>> from spotpython.utils.stats import condition_index
>>> import pandas as pd
>>> data = pd.DataFrame({
...     'x1': [1, 2, 3, 4, 5],
...     'x2': [2, 4, 6, 8, 10],
...     'x3': [1, 3, 5, 7, 9]
... })
>>> condition_index(data)
   Index  Eigenvalue  Condition Index
0      0    1.140000         1.000000
1      1    0.000000              inf
2      2    0.002857        20.000000

cov_to_cor

utils.stats.cov_to_cor(covariance_matrix)

Convert a covariance matrix to a correlation matrix.

Parameters

Name Type Description Default
covariance_matrix numpy.ndarray A square matrix of covariance values. required

Returns

Name Type Description
np.ndarray numpy.ndarray: A corresponding square matrix of correlation coefficients.

Examples

>>> from spotpython.utils.stats import cov_to_cor
>>> import numpy as np
>>> cov_matrix = np.array([[1, 0.8], [0.8, 1]])
>>> cov_to_cor(cov_matrix)
array([[1. , 0.8],
       [0.8, 1. ]])

fit_all_lm

utils.stats.fit_all_lm(basic, xlist, data, remove_na=True)

Fit a linear regression model for all possible combinations of independent variables.

Parameters

Name Type Description Default
basic str The basic model formula. required
xlist list A list of independent variables. required
data pandas.DataFrame The data frame containing the variables. required
remove_na bool Whether to remove missing values from the data frame. True

Returns

Name Type Description
dict dict A dictionary containing the estimated coefficients, confidence intervals, p-values, AIC values, sample size, and the basic model formula.

Examples

>>> from spotpython.utils.stats import fit_all_lm
>>> import pandas as pd
>>> data = pd.DataFrame({
>>>     'y': [1, 2, 3],
>>>     'x1': [4, 5, 6],
>>>     'x2': [7, 8, 9]
>>> })
>>> fit_all_lm("y ~ x1", ["x2"], data)
{'estimate':   variables  estimate  conf_low  conf_high    p         aic  n
0    basic  1.000000  1.000000   1.000000  0.0  0.000000  3
1       x2  1.000000  1.000000   1.000000  0.0  0.000000  3}

get_all_vars_from_formula

utils.stats.get_all_vars_from_formula(formula)

Utility function to extract variables from a formula.

Parameters

Name Type Description Default
formula str A formula. required

Returns

Name Type Description
list list A list of variables.

Examples

>>> from spotpython.utils.stats import get_all_vars_from_formula
    get_all_vars_from_formula("y ~ x1 + x2")
        ['y', 'x1', 'x2']
    get_all_vars_from_formula("y ~ ")
        ['y']

get_combinations

utils.stats.get_combinations(ind_list, type='indices')

Generates all possible combinations of two values from a list of values. Order is not important.

Parameters

Name Type Description Default
ind_list list A list of target indices. required
type str The type of output, either ‘values’ or ‘indices’. Default is ‘indices’. 'indices'

Returns

Name Type Description
list list A list of tuples, where each tuple contains a combination of two values. The order of the values within a tuple is not important, and each combination appears only once.

Examples

>>> from spotoptim.utils import get_combinations
>>> ind_list = [0, 10, 20, 30]
>>> combinations = get_combinations(ind_list)
>>> combinations = get_combinations(ind_list, type='indices')
    [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
>>> print(combinations, type='values')
    [(0, 10), (0, 20), (0, 30), (1, 20), (1, 30), (2, 30)]

get_sample_size

utils.stats.get_sample_size(alpha, beta, sigma, delta)

Calculate sample size n for comparing two means.

Formula: n = 2 * sigma^2 * (z_{1-alpha/2} + z_{1-beta})^2 / delta^2 This corresponds to a two-sided test with equal variance.

Parameters

Name Type Description Default
alpha float Significance level (Type I error probability). required
beta float Type II error probability (1 - Power). required
sigma float Standard deviation of the population (assumed equal for both groups). required
delta float Minimum detectable difference (effect size to detect). required

Returns

Name Type Description
float float The required sample size n per group.

Examples

>>> from spotoptim.utils.stats import get_sample_size
>>> alpha = 0.05
>>> beta = 0.2  # Power = 80%
>>> sigma = 1.0
>>> delta = 1.0
>>> n = get_sample_size(alpha, beta, sigma, delta)
>>> print(f"{n:.4f}")
15.6978

normalize_X

utils.stats.normalize_X(X, eps=1e-12)

Normalize array X to [0, 1] in each dimension.

For dimensions where all values are identical (X_max == X_min), the normalized value is set to 0.5 to avoid division by zero.

Parameters

Name Type Description Default
X np.ndarray Input array of shape (n, d) to normalize. required
eps float Small value to avoid division by zero when range is very small. Defaults to 1e-12. 1e-12

Returns

Name Type Description
np.ndarray np.ndarray: Normalized array with values in [0, 1] for each dimension. For constant dimensions, values are set to 0.5.

Examples

>>> from spotoptim.utils.stats import normalize_X
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> normalize_X(X)
array([[0. , 0. ],
       [0.5, 0.5],
       [1. , 1. ]])
>>> # Constant dimension example
>>> X_const = np.array([[1, 5], [1, 5], [1, 5]])
>>> normalize_X(X_const)
array([[0.5, 0.5],
       [0.5, 0.5],
       [0.5, 0.5]])

partial_correlation

utils.stats.partial_correlation(x, method='pearson')

Calculate the partial correlation matrix for a given data set.

Parameters

Name Type Description Default
x pandas.DataFrame or numpy.ndarray The data matrix with variables as columns. required
method str Correlation method, one of ‘pearson’, ‘kendall’, or ‘spearman’. 'pearson'

Returns

Name Type Description
dict dict A dictionary containing the partial correlation estimates, p-values, statistics, sample size (n), number of given parameters (gp), and method used.

Raises

Name Type Description
ValueError If input is not a matrix-like structure or not numeric.

References

  1. Kim, S. ppcor: An R package for a fast calculation to semi-partial correlation coefficients. Commun Stat Appl Methods 22, 6 (Nov 2015), 665–674.

Examples

>>> from spotpython.utils.stats import partial_correlation
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.DataFrame({
>>>     'A': [1, 2, 3],
>>>     'B': [4, 5, 6],
>>>     'C': [7, 8, 9]
>>> })
>>> partial_correlation(data, method='pearson')
{'estimate': array([[ 1. , -1. ,  1. ],
                    [-1. ,  1. , -1. ],
                    [ 1. , -1. ,  1. ]]),
'p_value': array([[0. , 0. , 0. ],
                  [0. , 0. , 0. ],
                  [0. , 0. , 0. ]]), ...
}

partial_correlation_test

utils.stats.partial_correlation_test(x, y, z, method='pearson')

The partial correlation coefficient between x and y given z. x and y should be arrays (vectors) of the same length, and z should be a data frame (matrix).

Parameters

Name Type Description Default
x array - like The first variable as a 1-dimensional array or list. required
y array - like The second variable as a 1-dimensional array or list. required
z pandas.DataFrame A data frame containing other conditional variables. required
method str Correlation method, one of ‘pearson’, ‘kendall’, or ‘spearman’. 'pearson'

Returns

Name Type Description
dict dict A dictionary with the partial correlation estimate, p-value, statistic, sample size (n), number of given parameters (gp), and method used.

References

  1. Kim, S. ppcor: An R package for a fast calculation to semi-partial correlation coefficients. Commun Stat Appl Methods 22, 6 (Nov 2015), 665–674.

Examples

>>> from spotpython.utils.stats import pairwise_partial_correlation
>>> import pandas as pd
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> z = pd.DataFrame({'C': [7, 8, 9]})
>>> pairwise_partial_correlation(x, y, z)
{'estimate': -1.0, 'p_value': 0.0, 'statistic': -inf, 'n': 3, 'gp': 1, 'method': 'pearson'}

plot_coeff_vs_pvals

utils.stats.plot_coeff_vs_pvals(
    data,
    xlabels=None,
    xlim=(0, 1),
    xlab='p-value',
    ylim=None,
    ylab=None,
    xscale_log=True,
    yscale_log=False,
    title=None,
    show=True,
    y_scaler=1.1,
)

Plot the coefficient estimates from fit_all_lm against the corresponding p-values.

Parameters

Name Type Description Default
data dict A dictionary containing the estimated coefficients, p-values, and other information. Generated by the fit_all_lm function. required
xlabels list A list of x-axis labels. None
xlim tuple A tuple of the x-axis limits. (0, 1)
xlab str The x-axis label. 'p-value'
ylim tuple A tuple of the y-axis limits. None
ylab str The y-axis label. None
xscale_log bool Whether to use a log scale on the x-axis. True
yscale_log bool Whether to use a log scale on the y-axis. False
title str The plot title. None
show bool Whether to display the plot. True
y_scaler float A scaling factor for the y-axis limits. Default is 1.1, i.e., 10% more than the maximum value. 1.1

Returns

Name Type Description
None None

Notes

  • Based on the R package ‘allestimates’ by Zhiqiang Wang, see https://cran.r-project.org/package=allestimates

References

Wang, Z. (2007). Two Postestimation Commands for Assessing Confounding Effects in Epidemiological Studies. The Stata Journal, 7(2), 183-196. https://doi.org/10.1177/1536867X0700700203

Examples

>>> from spotpython.utils.stats import plot_coeff_vs_pvals, fit_all_lm
>>> import pandas as pd
>>> data = pd.DataFrame({
>>>     'y': [1, 2, 3],
>>>     'x1': [4, 5, 6],
>>>     'x2': [7, 8, 9]
>>> })
>>> estimates = fit_all_lm("y ~ x1", ["x2"], data)
>>> plot_coeff_vs_pvals(estimates)

plot_coeff_vs_pvals_by_included

utils.stats.plot_coeff_vs_pvals_by_included(
    data,
    xlabels=None,
    xlim=(0, 1),
    xlab='P value',
    ylim=None,
    ylab=None,
    yscale_log=False,
    title=None,
    grid=True,
    ncol=2,
    show=True,
    y_scaler=1.1,
)

Generates a panel of scatter plots with effect estimates of all possible models against p-values. Uses a dictionry generated by the fit_all_lm function. Each plot includes effect estimates from all models including a specific variable.

Parameters

Name Type Description Default
data dict A dictionary, generated by the fit_all_lm function, containing the following keys: - estimate (pd.DataFrame): A DataFrame containing the estimates. - xlist (list): A list of variables. - fun (str): The function name. - family (str): The family of the model. required
xlabels list A list of x-axis labels. None
xlim tuple The x-axis limits. (0, 1)
xlab str The x-axis label. 'P value'
ylim tuple The y-axis limits. None
ylab str The y-axis label. None
yscale_log bool Whether to scale y-axis to log10. Default is False. False
title str The title of the plot. None
grid bool Whether to display gridlines. Default is True. True
ncol int Number of columns in the plot grid. Default is 2. 2
show bool Whether to display the plot. Default is True. True
y_scaler float A scaling factor for the y-axis limits. Default is 1.1, i.e., 10% more than the maximum value. 1.1

Returns

Name Type Description
None None

Notes

  • Based on the R package ‘allestimates’ by Zhiqiang Wang, see https://cran.r-project.org/package=allestimates

References

Wang, Z. (2007). Two Postestimation Commands for Assessing Confounding Effects in Epidemiological Studies. The Stata Journal, 7(2), 183-196. https://doi.org/10.1177/1536867X0700700203

Examples

data = { “estimate”: pd.DataFrame({ “variables”: [“Crude”, “AL”, “AM”, “AN”, “AO”], “estimate”: [0.5, 0.6, 0.7, 0.8, 0.9], “conf_low”: [0.1, 0.2, 0.3, 0.4, 0.5], “conf_high”: [0.9, 1.0, 1.1, 1.2, 1.3], “p”: [0.01, 0.02, 0.03, 0.04, 0.05], “aic”: [100, 200, 300, 400, 500], “n”: [10, 20, 30, 40, 50] }), “xlist”: [“AL”, “AM”, “AN”, “AO”], “fun”: “all_lm” } plot_coeff_vs_pvals_by_included(data)

preprocess_df_for_ols

utils.stats.preprocess_df_for_ols(df, independent_var_columns, target_col)

Preprocesses a df for fiitting an OLS regression model using the specified target column and predictors.

Parameters

Name Type Description Default
df pd.DataFrame Input DataFrame containing the data. required
independent_var_columns list of str List of names for predictor columns. required
target_col str Name of the target/dependent variable column. required

Returns

Name Type Description
X_encoded pd.DataFrame Encoded predictors with a constant term.
y pd.Series Target variable.

vif

utils.stats.vif(X, sorted=True)

Calculates the Variance Inflation Factor (VIF) for each feature in a DataFrame.

VIF measures the multicollinearity among independent variables within a regression model. High VIF values indicate high multicollinearity, which can cause issues with model interpretation and stability.

Parameters

Name Type Description Default
X pandas.DataFrame A DataFrame containing the independent variables. required
sorted bool Whether to sort the output DataFrame by VIF values. True

Returns

Name Type Description
pd.DataFrame pandas.DataFrame: A DataFrame with two columns: - “feature”: The name of the feature. - “VIF”: The Variance Inflation Factor for the feature.

Examples

>>> from spotpython.utils.stats import vif
>>> import pandas as pd
>>> data = pd.DataFrame({
...     'x1': [1, 2, 3, 4, 5],
...     'x2': [2, 4, 6, 8, 10],
...     'x3': [1, 3, 5, 7, 9]
... })
>>> vif(data)
   feature          VIF
0      x1  1260.000000
1      x2         0.000000
2      x3   630.000000