utils.stats
utils.stats
Functions
| Name | Description |
|---|---|
| calculate_outliers | Calculate the number of outliers using the IQR method. |
| compute_coefficients_table | |
| compute_standardized_betas | Computes standardized (beta) coefficients for a fitted statsmodels OLS model. |
| condition_index | Calculates the Condition Index for a DataFrame to assess multicollinearity. |
| cov_to_cor | Convert a covariance matrix to a correlation matrix. |
| fit_all_lm | Fit a linear regression model for all possible combinations of independent variables. |
| get_all_vars_from_formula | Utility function to extract variables from a formula. |
| get_combinations | Generates all possible combinations of two values from a list of values. Order is not important. |
| get_sample_size | Calculate sample size n for comparing two means. |
| normalize_X | Normalize array X to [0, 1] in each dimension. |
| partial_correlation | Calculate the partial correlation matrix for a given data set. |
| partial_correlation_test | The partial correlation coefficient between x and y given z. |
| plot_coeff_vs_pvals | Plot the coefficient estimates from fit_all_lm against the corresponding p-values. |
| plot_coeff_vs_pvals_by_included | Generates a panel of scatter plots with effect estimates of all possible models against p-values. |
| preprocess_df_for_ols | Preprocesses a df for fiitting an OLS regression model using the specified target column and predictors. |
| vif | Calculates the Variance Inflation Factor (VIF) for each feature in a DataFrame. |
calculate_outliers
utils.stats.calculate_outliers(series_or_df, irqmultiplier=1.5)Calculate the number of outliers using the IQR method.
Accepts either a pandas Series or a pandas DataFrame. For a DataFrame, counts outliers across all numeric columns and returns the total count.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| series_or_df | Union[pd.Series, pd.DataFrame] | pd.Series or pd.DataFrame containing numeric data. | required |
| irqmultiplier | float | Multiplier for IQR to define fences. Defaults to 1.5. | 1.5 |
Returns
| Name | Type | Description |
|---|---|---|
| int | int | The number of outliers. |
Examples
>>> import pandas as pd
>>> from spotoptim.utils.stats import calculate_outliers
>>> s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100])
>>> calculate_outliers(s)
1>>> df = pd.DataFrame({
... 'a': [1, 2, 3, 100],
... 'b': [10, 12, 11, 10]
... })
>>> calculate_outliers(df)
1compute_coefficients_table
utils.stats.compute_coefficients_table(model, X_encoded, y, vif_table=None)Compute a coefficients table containing
- Variable name
- Zero-order correlation
- Partial correlation
- Semipartial (part) correlation
- Tolerance (1 / VIF)
- VIF
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | statsmodels.regression.linear_model.RegressionResultsWrapper |
A fitted OLS model from statsmodels. | required |
| X_encoded | pd.DataFrame | The DataFrame used to fit the model, including ‘const’. | required |
| y | pd.Series | Dependent variable used in fitting the model. | required |
| vif_table | pd.DataFrame | A DataFrame with columns [“feature”, “VIF”] for each column in X_encoded (typ. from statsmodels.stats.outliers_influence.variance_inflation_factor). Default is None. | None |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | pd.DataFrame with columns: - “Variable” - “Zero-Order r” - “Partial r” - “Semipartial r” - “Tolerance” - “VIF” |
Examples
>>> from spotpython.utils.stats import compute_coefficients_table
>>> import pandas as pd
>>> import statsmodels.api as sm
>>> data = pd.DataFrame({
... 'x1': [1, 2, 3, 4, 5],
... 'x2': [2, 4, 6, 8, 10],
... 'x3': [1, 3, 5, 7, 9]
... })
>>> y = pd.Series([1, 2, 3, 4, 5])
>>> X = sm.add_constant(data)
>>> model = sm.OLS(y, X).fit()
>>> vif_table = pd.DataFrame({
... 'feature': ['x1', 'x2', 'x3'],
... 'VIF': [1, 2, 3]
... })
>>> compute_coefficients_table(model, data, y, vif_table)
Variable Zero-Order r Partial r Semipartial r Tolerance VIF
0 x1 0.0 0.0 0.0 1.0 1.0
1 x2 0.0 0.0 0.0 0.5 2.0
2 x3 0.0 0.0 0.0 0.333333 3.0compute_standardized_betas
utils.stats.compute_standardized_betas(model, X_encoded, y)Computes standardized (beta) coefficients for a fitted statsmodels OLS model.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | statsmodels.regression.linear_model.RegressionResultsWrapper |
The fitted OLS model. | required |
| X_encoded | pandas.DataFrame | The design matrix of independent variables. | required |
| y | pandas.Series | The dependent variable. | required |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | pandas.DataFrame: A DataFrame containing the standardized beta coefficients. |
Examples
>>> from spotpython.utils.stats import compute_standardized_betas
>>> import pandas as pd
>>> import statsmodels.api as sm
>>> data = pd.DataFrame({
... 'x1': [1, 2, 3, 4, 5],
... 'x2': [2, 4, 6, 8, 10],
... 'x3': [1, 3, 5, 7, 9]
... })
>>> y = pd.Series([1, 2, 3, 4, 5])
>>> X = sm.add_constant(data)
>>> model = sm.OLS(y, X).fit()
>>> compute_standardized_betas(model, data, y)
Variable Standardized Beta
0 const 0.000000
1 x1 0.000000
2 x2 0.000000
3 x3 0.000000condition_index
utils.stats.condition_index(df)Calculates the Condition Index for a DataFrame to assess multicollinearity.
The Condition Index is computed based on the eigenvalues of the covariance matrix of the standardized data. High condition indices suggest potential multicollinearity issues.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pandas.DataFrame | A DataFrame containing the independent variables. | required |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | pandas.DataFrame: A DataFrame with the following columns: - ‘Index’: The index of the eigenvalue. - ‘Eigenvalue’: The eigenvalue of the covariance matrix. - ‘Condition Index’: The Condition Index for the eigenvalue. |
Examples
>>> from spotpython.utils.stats import condition_index
>>> import pandas as pd
>>> data = pd.DataFrame({
... 'x1': [1, 2, 3, 4, 5],
... 'x2': [2, 4, 6, 8, 10],
... 'x3': [1, 3, 5, 7, 9]
... })
>>> condition_index(data)
Index Eigenvalue Condition Index
0 0 1.140000 1.000000
1 1 0.000000 inf
2 2 0.002857 20.000000cov_to_cor
utils.stats.cov_to_cor(covariance_matrix)Convert a covariance matrix to a correlation matrix.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| covariance_matrix | numpy.ndarray | A square matrix of covariance values. | required |
Returns
| Name | Type | Description |
|---|---|---|
| np.ndarray | numpy.ndarray: A corresponding square matrix of correlation coefficients. |
Examples
>>> from spotpython.utils.stats import cov_to_cor
>>> import numpy as np
>>> cov_matrix = np.array([[1, 0.8], [0.8, 1]])
>>> cov_to_cor(cov_matrix)
array([[1. , 0.8],
[0.8, 1. ]])fit_all_lm
utils.stats.fit_all_lm(basic, xlist, data, remove_na=True)Fit a linear regression model for all possible combinations of independent variables.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| basic | str | The basic model formula. | required |
| xlist | list | A list of independent variables. | required |
| data | pandas.DataFrame | The data frame containing the variables. | required |
| remove_na | bool | Whether to remove missing values from the data frame. | True |
Returns
| Name | Type | Description |
|---|---|---|
| dict | dict | A dictionary containing the estimated coefficients, confidence intervals, p-values, AIC values, sample size, and the basic model formula. |
Examples
>>> from spotpython.utils.stats import fit_all_lm
>>> import pandas as pd
>>> data = pd.DataFrame({
>>> 'y': [1, 2, 3],
>>> 'x1': [4, 5, 6],
>>> 'x2': [7, 8, 9]
>>> })
>>> fit_all_lm("y ~ x1", ["x2"], data)
{'estimate': variables estimate conf_low conf_high p aic n
0 basic 1.000000 1.000000 1.000000 0.0 0.000000 3
1 x2 1.000000 1.000000 1.000000 0.0 0.000000 3}get_all_vars_from_formula
utils.stats.get_all_vars_from_formula(formula)Utility function to extract variables from a formula.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| formula | str | A formula. | required |
Returns
| Name | Type | Description |
|---|---|---|
| list | list | A list of variables. |
Examples
>>> from spotpython.utils.stats import get_all_vars_from_formula
get_all_vars_from_formula("y ~ x1 + x2")
['y', 'x1', 'x2']
get_all_vars_from_formula("y ~ ")
['y']get_combinations
utils.stats.get_combinations(ind_list, type='indices')Generates all possible combinations of two values from a list of values. Order is not important.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| ind_list | list | A list of target indices. | required |
| type | str | The type of output, either ‘values’ or ‘indices’. Default is ‘indices’. | 'indices' |
Returns
| Name | Type | Description |
|---|---|---|
| list | list | A list of tuples, where each tuple contains a combination of two values. The order of the values within a tuple is not important, and each combination appears only once. |
Examples
>>> from spotoptim.utils import get_combinations
>>> ind_list = [0, 10, 20, 30]
>>> combinations = get_combinations(ind_list)
>>> combinations = get_combinations(ind_list, type='indices')
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
>>> print(combinations, type='values')
[(0, 10), (0, 20), (0, 30), (1, 20), (1, 30), (2, 30)]get_sample_size
utils.stats.get_sample_size(alpha, beta, sigma, delta)Calculate sample size n for comparing two means.
Formula: n = 2 * sigma^2 * (z_{1-alpha/2} + z_{1-beta})^2 / delta^2 This corresponds to a two-sided test with equal variance.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| alpha | float | Significance level (Type I error probability). | required |
| beta | float | Type II error probability (1 - Power). | required |
| sigma | float | Standard deviation of the population (assumed equal for both groups). | required |
| delta | float | Minimum detectable difference (effect size to detect). | required |
Returns
| Name | Type | Description |
|---|---|---|
| float | float | The required sample size n per group. |
Examples
>>> from spotoptim.utils.stats import get_sample_size
>>> alpha = 0.05
>>> beta = 0.2 # Power = 80%
>>> sigma = 1.0
>>> delta = 1.0
>>> n = get_sample_size(alpha, beta, sigma, delta)
>>> print(f"{n:.4f}")
15.6978normalize_X
utils.stats.normalize_X(X, eps=1e-12)Normalize array X to [0, 1] in each dimension.
For dimensions where all values are identical (X_max == X_min), the normalized value is set to 0.5 to avoid division by zero.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| X | np.ndarray | Input array of shape (n, d) to normalize. | required |
| eps | float | Small value to avoid division by zero when range is very small. Defaults to 1e-12. | 1e-12 |
Returns
| Name | Type | Description |
|---|---|---|
| np.ndarray | np.ndarray: Normalized array with values in [0, 1] for each dimension. For constant dimensions, values are set to 0.5. |
Examples
>>> from spotoptim.utils.stats import normalize_X
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> normalize_X(X)
array([[0. , 0. ],
[0.5, 0.5],
[1. , 1. ]])>>> # Constant dimension example
>>> X_const = np.array([[1, 5], [1, 5], [1, 5]])
>>> normalize_X(X_const)
array([[0.5, 0.5],
[0.5, 0.5],
[0.5, 0.5]])partial_correlation
utils.stats.partial_correlation(x, method='pearson')Calculate the partial correlation matrix for a given data set.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| x | pandas.DataFrame or numpy.ndarray | The data matrix with variables as columns. | required |
| method | str | Correlation method, one of ‘pearson’, ‘kendall’, or ‘spearman’. | 'pearson' |
Returns
| Name | Type | Description |
|---|---|---|
| dict | dict | A dictionary containing the partial correlation estimates, p-values, statistics, sample size (n), number of given parameters (gp), and method used. |
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If input is not a matrix-like structure or not numeric. |
References
- Kim, S. ppcor: An R package for a fast calculation to semi-partial correlation coefficients. Commun Stat Appl Methods 22, 6 (Nov 2015), 665–674.
Examples
>>> from spotpython.utils.stats import partial_correlation
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.DataFrame({
>>> 'A': [1, 2, 3],
>>> 'B': [4, 5, 6],
>>> 'C': [7, 8, 9]
>>> })
>>> partial_correlation(data, method='pearson')
{'estimate': array([[ 1. , -1. , 1. ],
[-1. , 1. , -1. ],
[ 1. , -1. , 1. ]]),
'p_value': array([[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ]]), ...
}partial_correlation_test
utils.stats.partial_correlation_test(x, y, z, method='pearson')The partial correlation coefficient between x and y given z. x and y should be arrays (vectors) of the same length, and z should be a data frame (matrix).
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| x | array - like |
The first variable as a 1-dimensional array or list. | required |
| y | array - like |
The second variable as a 1-dimensional array or list. | required |
| z | pandas.DataFrame | A data frame containing other conditional variables. | required |
| method | str | Correlation method, one of ‘pearson’, ‘kendall’, or ‘spearman’. | 'pearson' |
Returns
| Name | Type | Description |
|---|---|---|
| dict | dict | A dictionary with the partial correlation estimate, p-value, statistic, sample size (n), number of given parameters (gp), and method used. |
References
- Kim, S. ppcor: An R package for a fast calculation to semi-partial correlation coefficients. Commun Stat Appl Methods 22, 6 (Nov 2015), 665–674.
Examples
>>> from spotpython.utils.stats import pairwise_partial_correlation
>>> import pandas as pd
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> z = pd.DataFrame({'C': [7, 8, 9]})
>>> pairwise_partial_correlation(x, y, z)
{'estimate': -1.0, 'p_value': 0.0, 'statistic': -inf, 'n': 3, 'gp': 1, 'method': 'pearson'}plot_coeff_vs_pvals
utils.stats.plot_coeff_vs_pvals(
data,
xlabels=None,
xlim=(0, 1),
xlab='p-value',
ylim=None,
ylab=None,
xscale_log=True,
yscale_log=False,
title=None,
show=True,
y_scaler=1.1,
)Plot the coefficient estimates from fit_all_lm against the corresponding p-values.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| data | dict | A dictionary containing the estimated coefficients, p-values, and other information. Generated by the fit_all_lm function. | required |
| xlabels | list | A list of x-axis labels. | None |
| xlim | tuple | A tuple of the x-axis limits. | (0, 1) |
| xlab | str | The x-axis label. | 'p-value' |
| ylim | tuple | A tuple of the y-axis limits. | None |
| ylab | str | The y-axis label. | None |
| xscale_log | bool | Whether to use a log scale on the x-axis. | True |
| yscale_log | bool | Whether to use a log scale on the y-axis. | False |
| title | str | The plot title. | None |
| show | bool | Whether to display the plot. | True |
| y_scaler | float | A scaling factor for the y-axis limits. Default is 1.1, i.e., 10% more than the maximum value. | 1.1 |
Returns
| Name | Type | Description |
|---|---|---|
| None | None |
Notes
- Based on the R package ‘allestimates’ by Zhiqiang Wang, see https://cran.r-project.org/package=allestimates
References
Wang, Z. (2007). Two Postestimation Commands for Assessing Confounding Effects in Epidemiological Studies. The Stata Journal, 7(2), 183-196. https://doi.org/10.1177/1536867X0700700203
Examples
>>> from spotpython.utils.stats import plot_coeff_vs_pvals, fit_all_lm
>>> import pandas as pd
>>> data = pd.DataFrame({
>>> 'y': [1, 2, 3],
>>> 'x1': [4, 5, 6],
>>> 'x2': [7, 8, 9]
>>> })
>>> estimates = fit_all_lm("y ~ x1", ["x2"], data)
>>> plot_coeff_vs_pvals(estimates)plot_coeff_vs_pvals_by_included
utils.stats.plot_coeff_vs_pvals_by_included(
data,
xlabels=None,
xlim=(0, 1),
xlab='P value',
ylim=None,
ylab=None,
yscale_log=False,
title=None,
grid=True,
ncol=2,
show=True,
y_scaler=1.1,
)Generates a panel of scatter plots with effect estimates of all possible models against p-values. Uses a dictionry generated by the fit_all_lm function. Each plot includes effect estimates from all models including a specific variable.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| data | dict | A dictionary, generated by the fit_all_lm function, containing the following keys: - estimate (pd.DataFrame): A DataFrame containing the estimates. - xlist (list): A list of variables. - fun (str): The function name. - family (str): The family of the model. | required |
| xlabels | list | A list of x-axis labels. | None |
| xlim | tuple | The x-axis limits. | (0, 1) |
| xlab | str | The x-axis label. | 'P value' |
| ylim | tuple | The y-axis limits. | None |
| ylab | str | The y-axis label. | None |
| yscale_log | bool | Whether to scale y-axis to log10. Default is False. | False |
| title | str | The title of the plot. | None |
| grid | bool | Whether to display gridlines. Default is True. | True |
| ncol | int | Number of columns in the plot grid. Default is 2. | 2 |
| show | bool | Whether to display the plot. Default is True. | True |
| y_scaler | float | A scaling factor for the y-axis limits. Default is 1.1, i.e., 10% more than the maximum value. | 1.1 |
Returns
| Name | Type | Description |
|---|---|---|
| None | None |
Notes
- Based on the R package ‘allestimates’ by Zhiqiang Wang, see https://cran.r-project.org/package=allestimates
References
Wang, Z. (2007). Two Postestimation Commands for Assessing Confounding Effects in Epidemiological Studies. The Stata Journal, 7(2), 183-196. https://doi.org/10.1177/1536867X0700700203
Examples
data = { “estimate”: pd.DataFrame({ “variables”: [“Crude”, “AL”, “AM”, “AN”, “AO”], “estimate”: [0.5, 0.6, 0.7, 0.8, 0.9], “conf_low”: [0.1, 0.2, 0.3, 0.4, 0.5], “conf_high”: [0.9, 1.0, 1.1, 1.2, 1.3], “p”: [0.01, 0.02, 0.03, 0.04, 0.05], “aic”: [100, 200, 300, 400, 500], “n”: [10, 20, 30, 40, 50] }), “xlist”: [“AL”, “AM”, “AN”, “AO”], “fun”: “all_lm” } plot_coeff_vs_pvals_by_included(data)
preprocess_df_for_ols
utils.stats.preprocess_df_for_ols(df, independent_var_columns, target_col)Preprocesses a df for fiitting an OLS regression model using the specified target column and predictors.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pd.DataFrame | Input DataFrame containing the data. | required |
| independent_var_columns | list of str | List of names for predictor columns. | required |
| target_col | str | Name of the target/dependent variable column. | required |
Returns
| Name | Type | Description |
|---|---|---|
| X_encoded | pd.DataFrame | Encoded predictors with a constant term. |
| y | pd.Series | Target variable. |
vif
utils.stats.vif(X, sorted=True)Calculates the Variance Inflation Factor (VIF) for each feature in a DataFrame.
VIF measures the multicollinearity among independent variables within a regression model. High VIF values indicate high multicollinearity, which can cause issues with model interpretation and stability.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| X | pandas.DataFrame | A DataFrame containing the independent variables. | required |
| sorted | bool | Whether to sort the output DataFrame by VIF values. | True |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | pandas.DataFrame: A DataFrame with two columns: - “feature”: The name of the feature. - “VIF”: The Variance Inflation Factor for the feature. |
Examples
>>> from spotpython.utils.stats import vif
>>> import pandas as pd
>>> data = pd.DataFrame({
... 'x1': [1, 2, 3, 4, 5],
... 'x2': [2, 4, 6, 8, 10],
... 'x3': [1, 3, 5, 7, 9]
... })
>>> vif(data)
feature VIF
0 x1 1260.000000
1 x2 0.000000
2 x3 630.000000