Recovering features names of explained_variance_ratio_ in PCA with sklearn

前端 未结 5 1959
日久生厌
日久生厌 2020-11-28 19:34

I\'m trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

A classic example with IRIS dataset.

5条回答
  •  南笙
    南笙 (楼主)
    2020-11-28 20:21

    The way this question is phrased reminds me of a misunderstanding of Principle Component Analysis when I was first trying to figure it out. I’d like to go through it here in the hope that others won’t spend as much time on a road-to-nowhere as I did before the penny finally dropped.

    The notion of “recovering” feature names suggests that PCA identifies those features that are most important in a dataset. That’s not strictly true.

    PCA, as I understand it, identifies the features with the greatest variance in a dataset, and can then use this quality of the dataset to create a smaller dataset with a minimal loss of descriptive power. The advantages of a smaller dataset is that it requires less processing power and should have less noise in the data. But the features of greatest variance are not the "best" or "most important" features of a dataset, insofar as such concepts can be said to exist at all.

    To bring that theory into the practicalities of @Rafa’s sample code above:

    # load dataset
    iris = datasets.load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    
    # normalize data
    from sklearn import preprocessing
    data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 
    
    # PCA
    pca = PCA(n_components=2)
    pca.fit_transform(data_scaled)
    

    consider the following:

    post_pca_array = pca.fit_transform(data_scaled)
    
    print data_scaled.shape
    (150, 4)
    
    print post_pca_array.shape
    (150, 2)
    

    In this case, post_pca_array has the same 150 rows of data as data_scaled, but data_scaled’s four columns have been reduced from four to two.

    The critical point here is that the two columns – or components, to be terminologically consistent – of post_pca_array are not the two “best” columns of data_scaled. They are two new columns, determined by the algorithm behind sklearn.decomposition’s PCA module. The second column, PC-2 in @Rafa’s example, is informed by sepal_width more than any other column, but the values in PC-2 and data_scaled['sepal_width'] are not the same.

    As such, while it’s interesting to find out how much each column in original data contributed to the components of a post-PCA dataset, the notion of “recovering” column names is a little misleading, and certainly misled me for a long time. The only situation where there would be a match between post-PCA and original columns would be if the number of principle components were set at the same number as columns in the original. However, there would be no point in using the same number of columns because the data would not have changed. You would only have gone there to come back again, as it were.

提交回复
热议问题