utils.pca.get_pca_topk

utils.pca.get_pca_topk(pca, feature_names, k=10)

Identify the top k features that have the strongest influence on PC1 and PC2.

This function analyzes the loading scores (coefficients) of the first two principal components to determine which original features contribute most strongly to these components. The absolute values of the loading scores are used to rank feature importance.

Parameters

Name Type Description Default
pca sklearn.decomposition.PCA Fitted PCA object containing the components_ attribute with the principal components. required
feature_names list - like Names of the original features, must match the order of features used in PCA fitting. required
k int Number of top features to select for each principal component. Defaults to 10. 10

Returns

Name Type Description
tuple tuple A tuple containing two lists: - list[str]: Names of the k features most influential on PC1 - list[str]: Names of the k features most influential on PC2

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import get_pca_topk
>>>
>>> # Load and prepare the iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>> feature_names = iris.feature_names
>>>
>>> # Fit PCA
>>> pca = PCA()
>>> pca.fit(X)
>>>
>>> # Get top 2 most influential features for PC1 and PC2
>>> top_pc1, top_pc2 = get_pca_topk(pca,
...                                 feature_names=feature_names,
...                                 k=2)
>>> print("Top PC1 features:", top_pc1)
>>> print("Top PC2 features:", top_pc2)

Note

  • The function assumes that PCA has been fitted on standardized data
  • The length of feature_names must match the number of features in the PCA input
  • k should not exceed the total number of features