utils.pca.get_pca_topk
utils.pca.get_pca_topk(pca, feature_names, k=10)Identify the top k features that have the strongest influence on PC1 and PC2.
This function analyzes the loading scores (coefficients) of the first two principal components to determine which original features contribute most strongly to these components. The absolute values of the loading scores are used to rank feature importance.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| pca | sklearn.decomposition.PCA |
Fitted PCA object containing the components_ attribute with the principal components. | required |
| feature_names | list - like |
Names of the original features, must match the order of features used in PCA fitting. | required |
| k | int | Number of top features to select for each principal component. Defaults to 10. | 10 |
Returns
| Name | Type | Description |
|---|---|---|
| tuple | tuple | A tuple containing two lists: - list[str]: Names of the k features most influential on PC1 - list[str]: Names of the k features most influential on PC2 |
Examples
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from spotpython.utils.pca import get_pca_topk
>>>
>>> # Load and prepare the iris dataset
>>> iris = load_iris()
>>> X = iris.data
>>> feature_names = iris.feature_names
>>>
>>> # Fit PCA
>>> pca = PCA()
>>> pca.fit(X)
>>>
>>> # Get top 2 most influential features for PC1 and PC2
>>> top_pc1, top_pc2 = get_pca_topk(pca,
... feature_names=feature_names,
... k=2)
>>> print("Top PC1 features:", top_pc1)
>>> print("Top PC2 features:", top_pc2)Note
- The function assumes that PCA has been fitted on standardized data
- The length of feature_names must match the number of features in the PCA input
- k should not exceed the total number of features