utils.pca

utils.pca

Functions

Name Description
get_loading_scores Computes the loading scores matrix for Principal Component Analysis (PCA).
get_pca Scale the numeric data and perform PCA.
get_pca_topk Identify the top k features that have the strongest influence on PC1 and PC2.
plot_loading_scores Creates a heatmap visualization of PCA loading scores.
plot_pca1vs2 Create a scatter plot of the first two principal components from PCA.
plot_pca_scree Plot the scree plot for Principal Component Analysis (PCA).

get_loading_scores

utils.pca.get_loading_scores(pca, feature_names)

Computes the loading scores matrix for Principal Component Analysis (PCA).

Creates and returns a DataFrame showing how each original feature contributes to each principal component.

Parameters

Name Type Description Default
pca sklearn.decomposition.PCA Fitted PCA object containing the components_ attribute with the principal components. required
feature_names list - like Names of the original features, must match the order of features used in PCA fitting. required

Returns

Name Type Description
pd.DataFrame pd.DataFrame: DataFrame containing the loading scores matrix with features as rows and principal components as columns.

Example

from sklearn.decomposition import PCA from sklearn.datasets import load_iris from spotpython.utils.pca import print_loading_scores,

Load and prepare iris dataset

iris = load_iris() X = iris.data feature_names = iris.feature_names

Fit PCA

pca = PCA() pca.fit(X)

Print loading scores

scores_df = print_loading_scores(pca, feature_names) print(scores_df)

get_pca

utils.pca.get_pca(df, n_components=3)

Scale the numeric data and perform PCA.

Parameters

Name Type Description Default
df pd.DataFrame Input DataFrame. required
n_components int Number of principal components to compute. Defaults to 3. 3

Returns

Name Type Description
tuple tuple - pca (PCA): Fitted PCA object. - scaled_data (np.ndarray): Scaled numeric data. - feature_names (pd.Index): Names of the numeric features. - sample_names (pd.Index): Index of the samples. - pca_data (np.ndarray): PCA-transformed data.

Examples

>>> import pandas as pd
>>> from spotpython.utils.pca import get_pca
>>> df = pd.DataFrame({
...     "A": [1, 2, 3],
...     "B": [4, 5, 6],
...     "C": ["x", "y", "z"]  # Non-numeric column will be ignored
... })
>>> pca, scaled_data, feature_names, sample_names, pca_data = get_pca(df)
>>> print(feature_names)
Index(['A', 'B'], dtype='object')
>>> print(pca_data.shape)
(3, 2)

get_pca_topk

utils.pca.get_pca_topk(pca, feature_names, k=10)

Identify the top k features that have the strongest influence on PC1 and PC2.

This function analyzes the loading scores (coefficients) of the first two principal components to determine which original features contribute most strongly to these components. The absolute values of the loading scores are used to rank feature importance.

Parameters

Name Type Description Default
pca sklearn.decomposition.PCA Fitted PCA object containing the components_ attribute with the principal components. required
feature_names list - like Names of the original features, must match the order of features used in PCA fitting. required
k int Number of top features to select for each principal component. Defaults to 10. 10

Returns

Name Type Description
tuple tuple A tuple containing two lists: - list[str]: Names of the k features most influential on PC1 - list[str]: Names of the k features most influential on PC2

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import get_pca_topk
>>>
>>> # Load and prepare the iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>> feature_names = iris.feature_names
>>>
>>> # Fit PCA
>>> pca = PCA()
>>> pca.fit(X)
>>>
>>> # Get top 2 most influential features for PC1 and PC2
>>> top_pc1, top_pc2 = get_pca_topk(pca,
...                                 feature_names=feature_names,
...                                 k=2)
>>> print("Top PC1 features:", top_pc1)
>>> print("Top PC2 features:", top_pc2)

Note

  • The function assumes that PCA has been fitted on standardized data
  • The length of feature_names must match the number of features in the PCA input
  • k should not exceed the total number of features

plot_loading_scores

utils.pca.plot_loading_scores(loading_scores, figsize=(12, 8))

Creates a heatmap visualization of PCA loading scores.

Generates a heatmap showing the relationship between original features and principal components, with color intensity indicating the strength and direction of the relationship.

Parameters

Name Type Description Default
loading_scores pd.DataFrame DataFrame containing the loading scores matrix with features as rows and principal components as columns. required
figsize tuple Size of the figure as (width, height). Defaults to (12, 8). (12, 8)

Returns

Name Type Description
None None The function creates and displays a matplotlib plot.

Example

from sklearn.decomposition import PCA from sklearn.datasets import load_iris from spotpython.utils.pca import print_loading_scores, plot_loading_scores

Load and prepare iris dataset

iris = load_iris() X = iris.data feature_names = iris.feature_names

Fit PCA and get loading scores

pca = PCA() pca.fit(X) scores_df = print_loading_scores(pca, feature_names)

Create heatmap

plot_loading_scores(scores_df, figsize=(10, 6))

plot_pca1vs2

utils.pca.plot_pca1vs2(pca, pca_data, df_name='', figsize=(12, 6))

Create a scatter plot of the first two principal components from PCA.

This function visualizes the first two principal components (PC1 vs PC2) from a PCA analysis, creating a scatter plot where each point represents a sample in the transformed space. The percentage of variance explained by each component is shown on the axes.

Parameters

Name Type Description Default
pca sklearn.decomposition.PCA Fitted PCA object containing the explained variance ratios and components. required
pca_data array - like PCA-transformed data, where each row represents a sample and each column represents a principal component. required
df_name str Name of the dataset to be displayed in the plot title. Defaults to empty string. ''
figsize tuple Size of the figure as (width, height). Defaults to (12, 6). (12, 6)

Returns

Name Type Description
None None The function creates and displays a matplotlib plot.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import plot_pca1vs2
>>>
>>> # Load and prepare the iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>>
>>> # Fit PCA and transform the data
>>> pca = PCA()
>>> pca_data = pca.fit_transform(X)
>>>
>>> # Create PCA scatter plot
>>> plot_pca1vs2(pca,
...             pca_data,
...             df_name="Iris Dataset",
...             figsize=(10, 5))

Note

  • The function assumes that the input data has at least two principal components
  • Sample names are taken from the index of the created DataFrame
  • The percentage of variance explained is rounded to 1 decimal place

plot_pca_scree

utils.pca.plot_pca_scree(pca, df_name='', max_scree=None, figsize=(12, 6))

Plot the scree plot for Principal Component Analysis (PCA).

A scree plot shows the percentage of variance explained by each principal component in descending order. It helps in determining the optimal number of components to retain.

Parameters

Name Type Description Default
pca sklearn.decomposition.PCA Fitted PCA object containing the explained variance ratios. required
df_name str Name of the dataset to be displayed in the plot title. Defaults to empty string. ''
max_scree int Maximum number of principal components to plot. If None, all components are plotted. Defaults to None. None
figsize tuple Size of the figure as (width, height). Defaults to (12, 6). (12, 6)

Returns

Name Type Description
None None The function creates and displays a matplotlib plot.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import plot_pca_scree
>>>
>>> # Load iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>>
>>> # Fit PCA
>>> pca = PCA()
>>> pca.fit(X)
>>>
>>> # Create scree plot
>>> plot_pca_scree(pca,
...                df_name="Iris Dataset",
...                max_scree=4,
...                figsize=(10, 5))