utils.pca
utils.pca
Functions
| Name | Description |
|---|---|
| get_loading_scores | Computes the loading scores matrix for Principal Component Analysis (PCA). |
| get_pca | Scale the numeric data and perform PCA. |
| get_pca_topk | Identify the top k features that have the strongest influence on PC1 and PC2. |
| plot_loading_scores | Creates a heatmap visualization of PCA loading scores. |
| plot_pca1vs2 | Create a scatter plot of the first two principal components from PCA. |
| plot_pca_scree | Plot the scree plot for Principal Component Analysis (PCA). |
get_loading_scores
utils.pca.get_loading_scores(pca, feature_names)Computes the loading scores matrix for Principal Component Analysis (PCA).
Creates and returns a DataFrame showing how each original feature contributes to each principal component.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| pca | sklearn.decomposition.PCA |
Fitted PCA object containing the components_ attribute with the principal components. | required |
| feature_names | list - like |
Names of the original features, must match the order of features used in PCA fitting. | required |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | pd.DataFrame: DataFrame containing the loading scores matrix with features as rows and principal components as columns. |
Example
from sklearn.decomposition import PCA from sklearn.datasets import load_iris from spotpython.utils.pca import print_loading_scores,
Load and prepare iris dataset
iris = load_iris() X = iris.data feature_names = iris.feature_names
Fit PCA
pca = PCA() pca.fit(X)
Print loading scores
scores_df = print_loading_scores(pca, feature_names) print(scores_df)
get_pca
utils.pca.get_pca(df, n_components=3)Scale the numeric data and perform PCA.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pd.DataFrame | Input DataFrame. | required |
| n_components | int | Number of principal components to compute. Defaults to 3. | 3 |
Returns
| Name | Type | Description |
|---|---|---|
| tuple | tuple | - pca (PCA): Fitted PCA object. - scaled_data (np.ndarray): Scaled numeric data. - feature_names (pd.Index): Names of the numeric features. - sample_names (pd.Index): Index of the samples. - pca_data (np.ndarray): PCA-transformed data. |
Examples
>>> import pandas as pd
>>> from spotpython.utils.pca import get_pca
>>> df = pd.DataFrame({
... "A": [1, 2, 3],
... "B": [4, 5, 6],
... "C": ["x", "y", "z"] # Non-numeric column will be ignored
... })
>>> pca, scaled_data, feature_names, sample_names, pca_data = get_pca(df)
>>> print(feature_names)
Index(['A', 'B'], dtype='object')
>>> print(pca_data.shape)
(3, 2)get_pca_topk
utils.pca.get_pca_topk(pca, feature_names, k=10)Identify the top k features that have the strongest influence on PC1 and PC2.
This function analyzes the loading scores (coefficients) of the first two principal components to determine which original features contribute most strongly to these components. The absolute values of the loading scores are used to rank feature importance.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| pca | sklearn.decomposition.PCA |
Fitted PCA object containing the components_ attribute with the principal components. | required |
| feature_names | list - like |
Names of the original features, must match the order of features used in PCA fitting. | required |
| k | int | Number of top features to select for each principal component. Defaults to 10. | 10 |
Returns
| Name | Type | Description |
|---|---|---|
| tuple | tuple | A tuple containing two lists: - list[str]: Names of the k features most influential on PC1 - list[str]: Names of the k features most influential on PC2 |
Examples
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import get_pca_topk
>>>
>>> # Load and prepare the iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>> feature_names = iris.feature_names
>>>
>>> # Fit PCA
>>> pca = PCA()
>>> pca.fit(X)
>>>
>>> # Get top 2 most influential features for PC1 and PC2
>>> top_pc1, top_pc2 = get_pca_topk(pca,
... feature_names=feature_names,
... k=2)
>>> print("Top PC1 features:", top_pc1)
>>> print("Top PC2 features:", top_pc2)Note
- The function assumes that PCA has been fitted on standardized data
- The length of feature_names must match the number of features in the PCA input
- k should not exceed the total number of features
plot_loading_scores
utils.pca.plot_loading_scores(loading_scores, figsize=(12, 8))Creates a heatmap visualization of PCA loading scores.
Generates a heatmap showing the relationship between original features and principal components, with color intensity indicating the strength and direction of the relationship.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| loading_scores | pd.DataFrame | DataFrame containing the loading scores matrix with features as rows and principal components as columns. | required |
| figsize | tuple | Size of the figure as (width, height). Defaults to (12, 8). | (12, 8) |
Returns
| Name | Type | Description |
|---|---|---|
| None | None | The function creates and displays a matplotlib plot. |
Example
from sklearn.decomposition import PCA from sklearn.datasets import load_iris from spotpython.utils.pca import print_loading_scores, plot_loading_scores
Load and prepare iris dataset
iris = load_iris() X = iris.data feature_names = iris.feature_names
Fit PCA and get loading scores
pca = PCA() pca.fit(X) scores_df = print_loading_scores(pca, feature_names)
Create heatmap
plot_loading_scores(scores_df, figsize=(10, 6))
plot_pca1vs2
utils.pca.plot_pca1vs2(pca, pca_data, df_name='', figsize=(12, 6))Create a scatter plot of the first two principal components from PCA.
This function visualizes the first two principal components (PC1 vs PC2) from a PCA analysis, creating a scatter plot where each point represents a sample in the transformed space. The percentage of variance explained by each component is shown on the axes.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| pca | sklearn.decomposition.PCA |
Fitted PCA object containing the explained variance ratios and components. | required |
| pca_data | array - like |
PCA-transformed data, where each row represents a sample and each column represents a principal component. | required |
| df_name | str | Name of the dataset to be displayed in the plot title. Defaults to empty string. | '' |
| figsize | tuple | Size of the figure as (width, height). Defaults to (12, 6). | (12, 6) |
Returns
| Name | Type | Description |
|---|---|---|
| None | None | The function creates and displays a matplotlib plot. |
Examples
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import plot_pca1vs2
>>>
>>> # Load and prepare the iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>>
>>> # Fit PCA and transform the data
>>> pca = PCA()
>>> pca_data = pca.fit_transform(X)
>>>
>>> # Create PCA scatter plot
>>> plot_pca1vs2(pca,
... pca_data,
... df_name="Iris Dataset",
... figsize=(10, 5))Note
- The function assumes that the input data has at least two principal components
- Sample names are taken from the index of the created DataFrame
- The percentage of variance explained is rounded to 1 decimal place
plot_pca_scree
utils.pca.plot_pca_scree(pca, df_name='', max_scree=None, figsize=(12, 6))Plot the scree plot for Principal Component Analysis (PCA).
A scree plot shows the percentage of variance explained by each principal component in descending order. It helps in determining the optimal number of components to retain.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| pca | sklearn.decomposition.PCA |
Fitted PCA object containing the explained variance ratios. | required |
| df_name | str | Name of the dataset to be displayed in the plot title. Defaults to empty string. | '' |
| max_scree | int | Maximum number of principal components to plot. If None, all components are plotted. Defaults to None. | None |
| figsize | tuple | Size of the figure as (width, height). Defaults to (12, 6). | (12, 6) |
Returns
| Name | Type | Description |
|---|---|---|
| None | None | The function creates and displays a matplotlib plot. |
Examples
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import plot_pca_scree
>>>
>>> # Load iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>>
>>> # Fit PCA
>>> pca = PCA()
>>> pca.fit(X)
>>>
>>> # Create scree plot
>>> plot_pca_scree(pca,
... df_name="Iris Dataset",
... max_scree=4,
... figsize=(10, 5))