PCA For categorical features?

后端 未结 6 2031
清酒与你
清酒与你 2020-12-13 06:15

In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding ca

相关标签:
6条回答
  • 2020-12-13 06:50

    MCA is a known technique for categorical data dimension reduction. In R there is a lot of package to use MCA and even mix with PCA in mixed contexts. In python exist a a mca library too. MCA apply similar maths that PCA, indeed the French statistician used to say, "data analysis is to find correct matrix to diagonalize"

    http://gastonsanchez.com/visually-enforced/how-to/2012/10/13/MCA-in-R/

    0 讨论(0)
  • 2020-12-13 06:50

    The following publication shows great and meaningful results when computing PCA on categorical variables treated as simplex vertices:

    Niitsuma H., Okada T. (2005) Covariance and PCA for Categorical Variables. In: Ho T.B., Cheung D., Liu H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, Berlin, Heidelberg

    https://doi.org/10.1007/11430919_61

    It is available via https://arxiv.org/abs/0711.4452 (including as a PDF).

    0 讨论(0)
  • 2020-12-13 07:06

    I disagree with the others.

    While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well.

    PCA is desinged for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have binary variables.

    So yes, you can use PCA. And yes, you get an output. It even is a least-squared output - it's not as if PCA would segfault on such data. It works, but it is just much less meaningful than you'd want it to be; and supposedly less meaningful than e.g. frequent pattern mining.

    0 讨论(0)
  • 2020-12-13 07:09

    PCA is a dimensionality reduction method that can be applied any set of features. Here is an example using OneHotEncoded (i.e. categorical) data:

    from sklearn.preprocessing import OneHotEncoder
    enc = OneHotEncoder()
    X = enc.fit_transform([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]).toarray()
    
    print(X)
    
    > array([[ 1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.],
           [ 0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.]])
    
    
    from sklearn.decomposition import PCA
    pca = PCA(n_components=3)
    X_pca = pca.fit_transform(X)
    
    print(X_pca)
    
    > array([[-0.70710678,  0.79056942,  0.70710678],
           [ 1.14412281, -0.79056942,  0.43701602],
           [-1.14412281, -0.79056942, -0.43701602],
           [ 0.70710678,  0.79056942, -0.70710678]])
    
    0 讨论(0)
  • 2020-12-13 07:11

    I think pca is reducing var by leverage the linear relation between vars. If there's only one categoral var coded in onehot, there's not linear relation between the onehoted cols. so it can't reduce by pca.

    But if there exsits other vars, the onehoted cols may be can presented by linear relation of other vars.

    So may be it can reduce by pca, depends on the relation of vars.

    0 讨论(0)
  • 2020-12-13 07:12

    Basically, PCA finds and eliminate less informative (duplicate) information on feature set and reduce the dimension of feature space. In other words, imagine a N-dimensional hyperspace, PCA finds such M (M < N) features that the data variates most. In this way data may be represented as M-dimensional feature vectors. Mathematically, it is some-kind of a eigen-values & eigen vectors calculation of a feature space.

    So, it is not important whether the features are continuous or not.

    PCA is used widely on many application. Mostly for eliminating noisy, less informative data that comes from some sensor or hardware before classification/recognition.

    Edit:

    Statistically speaking, categorical features can be seen as discrete random variables in interval [0,1]. Computation for expectation E{X} and variance E{(X-E{X})^2) are still valid and meaningful for discrete rvs. I still stand for the applicability of PCA in case of categorical features.

    Consider a case where you would like to predict whether "It is going to rain for a given day or not". You have categorical feature X which is "Do I have to go to work for the given day", 1 for yes and 0 for no. Clearly weather conditions do not depend on our work schedule, so P(R|X)=P(R). Assuming 5 days of work for every week, we have more 1s than 0s for X in our randomly collected dataset. PCA would probably lead to dropping this low-variance dimension in your feature representation.

    At the end of the day, PCA is for dimension reduction with minimal loss of information. Intuitively, we rely on variance of the data on a given axis to measure its usefulness for the task. I don't think there is any theoretical limitation for applying it to categorical features. Practical value depends on application and data which is also the case for continuous variables.

    0 讨论(0)
提交回复
热议问题