utils.pca

utils.pca

Functions

Name	Description
get_loading_scores	Computes the loading scores matrix for Principal Component Analysis (PCA).
get_pca	Scale the numeric data and perform PCA.
get_pca_topk	Identify the top k features that have the strongest influence on PC1 and PC2.
plot_loading_scores	Creates a heatmap visualization of PCA loading scores.
plot_pca1vs2	Create a scatter plot of the first two principal components from PCA.
plot_pca_scree	Plot the scree plot for Principal Component Analysis (PCA).

get_loading_scores

utils.pca.get_loading_scores(pca, feature_names)

Computes the loading scores matrix for Principal Component Analysis (PCA).

Creates and returns a DataFrame showing how each original feature contributes to each principal component.

Parameters

Name	Type	Description	Default
pca	`sklearn`.`decomposition`.`PCA`	Fitted PCA object containing the components_ attribute with the principal components.	required
feature_names	list - `like`	Names of the original features, must match the order of features used in PCA fitting.	required

Returns

Name	Type	Description
	pd.DataFrame	pd.DataFrame: DataFrame containing the loading scores matrix with features as rows and principal components as columns.

Example

from sklearn.decomposition import PCA from sklearn.datasets import load_iris from spotpython.utils.pca import print_loading_scores,

Load and prepare iris dataset

iris = load_iris() X = iris.data feature_names = iris.feature_names

Fit PCA

pca = PCA() pca.fit(X)

Print loading scores

scores_df = print_loading_scores(pca, feature_names) print(scores_df)

get_pca

utils.pca.get_pca(df, n_components=3)

Scale the numeric data and perform PCA.

Parameters

Name	Type	Description	Default
df	pd.DataFrame	Input DataFrame.	required
n_components	int	Number of principal components to compute. Defaults to 3.	`3`

Returns

Name	Type	Description
tuple	tuple	- pca (PCA): Fitted PCA object. - scaled_data (np.ndarray): Scaled numeric data. - feature_names (pd.Index): Names of the numeric features. - sample_names (pd.Index): Index of the samples. - pca_data (np.ndarray): PCA-transformed data.

Examples

>>> import pandas as pd
>>> from spotpython.utils.pca import get_pca
>>> df = pd.DataFrame({
...     "A": [1, 2, 3],
...     "B": [4, 5, 6],
...     "C": ["x", "y", "z"]  # Non-numeric column will be ignored
... })
>>> pca, scaled_data, feature_names, sample_names, pca_data = get_pca(df)
>>> print(feature_names)
Index(['A', 'B'], dtype='object')
>>> print(pca_data.shape)
(3, 2)

get_pca_topk

utils.pca.get_pca_topk(pca, feature_names, k=10)

Identify the top k features that have the strongest influence on PC1 and PC2.

This function analyzes the loading scores (coefficients) of the first two principal components to determine which original features contribute most strongly to these components. The absolute values of the loading scores are used to rank feature importance.

Parameters

Name	Type	Description	Default
pca	`sklearn`.`decomposition`.`PCA`	Fitted PCA object containing the components_ attribute with the principal components.	required
feature_names	list - `like`	Names of the original features, must match the order of features used in PCA fitting.	required
k	int	Number of top features to select for each principal component. Defaults to 10.	`10`

Returns

Name	Type	Description
tuple	tuple	A tuple containing two lists: - list[str]: Names of the k features most influential on PC1 - list[str]: Names of the k features most influential on PC2

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import get_pca_topk
>>>
>>> # Load and prepare the iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>> feature_names = iris.feature_names
>>>
>>> # Fit PCA
>>> pca = PCA()
>>> pca.fit(X)
>>>
>>> # Get top 2 most influential features for PC1 and PC2
>>> top_pc1, top_pc2 = get_pca_topk(pca,
...                                 feature_names=feature_names,
...                                 k=2)
>>> print("Top PC1 features:", top_pc1)
>>> print("Top PC2 features:", top_pc2)

Note

The function assumes that PCA has been fitted on standardized data
The length of feature_names must match the number of features in the PCA input
k should not exceed the total number of features

plot_loading_scores

utils.pca.plot_loading_scores(loading_scores, figsize=(12, 8))

Creates a heatmap visualization of PCA loading scores.

Generates a heatmap showing the relationship between original features and principal components, with color intensity indicating the strength and direction of the relationship.

Parameters

Name	Type	Description	Default
loading_scores	pd.DataFrame	DataFrame containing the loading scores matrix with features as rows and principal components as columns.	required
figsize	tuple	Size of the figure as (width, height). Defaults to (12, 8).	`(12, 8)`

Returns

Name	Type	Description
None	None	The function creates and displays a matplotlib plot.

Example

from sklearn.decomposition import PCA from sklearn.datasets import load_iris from spotpython.utils.pca import print_loading_scores, plot_loading_scores

Load and prepare iris dataset

iris = load_iris() X = iris.data feature_names = iris.feature_names

Fit PCA and get loading scores

pca = PCA() pca.fit(X) scores_df = print_loading_scores(pca, feature_names)

Create heatmap

plot_loading_scores(scores_df, figsize=(10, 6))

plot_pca1vs2

utils.pca.plot_pca1vs2(pca, pca_data, df_name='', figsize=(12, 6))

Create a scatter plot of the first two principal components from PCA.

This function visualizes the first two principal components (PC1 vs PC2) from a PCA analysis, creating a scatter plot where each point represents a sample in the transformed space. The percentage of variance explained by each component is shown on the axes.

Parameters

Name	Type	Description	Default
pca	`sklearn`.`decomposition`.`PCA`	Fitted PCA object containing the explained variance ratios and components.	required
pca_data	array - `like`	PCA-transformed data, where each row represents a sample and each column represents a principal component.	required
df_name	str	Name of the dataset to be displayed in the plot title. Defaults to empty string.	`''`
figsize	tuple	Size of the figure as (width, height). Defaults to (12, 6).	`(12, 6)`

Returns

Name	Type	Description
None	None	The function creates and displays a matplotlib plot.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import plot_pca1vs2
>>>
>>> # Load and prepare the iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>>
>>> # Fit PCA and transform the data
>>> pca = PCA()
>>> pca_data = pca.fit_transform(X)
>>>
>>> # Create PCA scatter plot
>>> plot_pca1vs2(pca,
...             pca_data,
...             df_name="Iris Dataset",
...             figsize=(10, 5))

Note

The function assumes that the input data has at least two principal components
Sample names are taken from the index of the created DataFrame
The percentage of variance explained is rounded to 1 decimal place

plot_pca_scree

utils.pca.plot_pca_scree(pca, df_name='', max_scree=None, figsize=(12, 6))

Plot the scree plot for Principal Component Analysis (PCA).

A scree plot shows the percentage of variance explained by each principal component in descending order. It helps in determining the optimal number of components to retain.

Parameters

Name	Type	Description	Default
pca	`sklearn`.`decomposition`.`PCA`	Fitted PCA object containing the explained variance ratios.	required
df_name	str	Name of the dataset to be displayed in the plot title. Defaults to empty string.	`''`
max_scree	int	Maximum number of principal components to plot. If None, all components are plotted. Defaults to None.	`None`
figsize	tuple	Size of the figure as (width, height). Defaults to (12, 6).	`(12, 6)`

Returns

Name	Type	Description
None	None	The function creates and displays a matplotlib plot.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import plot_pca_scree
>>>
>>> # Load iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>>
>>> # Fit PCA
>>> pca = PCA()
>>> pca.fit(X)
>>>
>>> # Create scree plot
>>> plot_pca_scree(pca,
...                df_name="Iris Dataset",
...                max_scree=4,
...                figsize=(10, 5))