Calculate correlation with cor(), only for numerical columns

后端 未结 4 643
小蘑菇
小蘑菇 2020-12-02 05:58

I have a dataframe and would like to calculate the correlation (with Spearman, data is categorical and ranked) but only for a subset of columns. I tried with all, but R\'s c

4条回答
  •  孤街浪徒
    2020-12-02 06:40

    For numerical data you have the solution. But it is categorical data, you said. Then life gets a bit more complicated...

    Well, first : The amount of association between two categorical variables is not measured with a Spearman rank correlation, but with a Chi-square test for example. Which is logic actually. Ranking means there is some order in your data. Now tell me which is larger, yellow or red? I know, sometimes R does perform a spearman rank correlation on categorical data. If I code yellow 1 and red 2, R would consider red larger than yellow.

    So, forget about Spearman for categorical data. I'll demonstrate the chisq-test and how to choose columns using combn(). But you would benefit from a bit more time with Agresti's book : http://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics/dp/0471360937

    set.seed(1234)
    X <- rep(c("A","B"),20)
    Y <- sample(c("C","D"),40,replace=T)
    
    table(X,Y)
    chisq.test(table(X,Y),correct=F)
    # I don't use Yates continuity correction
    
    #Let's make a matrix with tons of columns
    
    Data <- as.data.frame(
              matrix(
                sample(letters[1:3],2000,replace=T),
                ncol=25
              )
            )
    
    # You want to select which columns to use
    columns <- c(3,7,11,24)
    vars <- names(Data)[columns]
    
    # say you need to know which ones are associated with each other.
    out <-  apply( combn(columns,2),2,function(x){
              chisq.test(table(Data[,x[1]],Data[,x[2]]),correct=F)$p.value
            })
    
    out <- cbind(as.data.frame(t(combn(vars,2))),out)
    

    Then you should get :

    > out
       V1  V2       out
    1  V3  V7 0.8116733
    2  V3 V11 0.1096903
    3  V3 V24 0.1653670
    4  V7 V11 0.3629871
    5  V7 V24 0.4947797
    6 V11 V24 0.7259321
    

    Where V1 and V2 indicate between which variables it goes, and "out" gives the p-value for association. Here all variables are independent. Which you would expect, as I created the data at random.

提交回复
热议问题