Calculate correlation with cor(), only for numerical columns

后端未结

关注

 4  643

小蘑菇 2020-12-02 05:58

I have a dataframe and would like to calculate the correlation (with Spearman, data is categorical and ranked) but only for a subset of columns. I tried with all, but R\'s c

4条回答

孤街浪徒 (楼主)

2020-12-02 06:40
For numerical data you have the solution. But it is categorical data, you said. Then life gets a bit more complicated...

Well, first : The amount of association between two categorical variables is not measured with a Spearman rank correlation, but with a Chi-square test for example. Which is logic actually. Ranking means there is some order in your data. Now tell me which is larger, yellow or red? I know, sometimes R does perform a spearman rank correlation on categorical data. If I code yellow 1 and red 2, R would consider red larger than yellow.

So, forget about Spearman for categorical data. I'll demonstrate the chisq-test and how to choose columns using combn(). But you would benefit from a bit more time with Agresti's book : http://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics/dp/0471360937
```
set.seed(1234)
X <- rep(c("A","B"),20)
Y <- sample(c("C","D"),40,replace=T)

table(X,Y)
chisq.test(table(X,Y),correct=F)
# I don't use Yates continuity correction

#Let's make a matrix with tons of columns

Data <- as.data.frame(
          matrix(
            sample(letters[1:3],2000,replace=T),
            ncol=25
          )
        )

# You want to select which columns to use
columns <- c(3,7,11,24)
vars <- names(Data)[columns]

# say you need to know which ones are associated with each other.
out <-  apply( combn(columns,2),2,function(x){
          chisq.test(table(Data[,x[1]],Data[,x[2]]),correct=F)$p.value
        })

out <- cbind(as.data.frame(t(combn(vars,2))),out)
```
Then you should get :
```
> out
   V1  V2       out
1  V3  V7 0.8116733
2  V3 V11 0.1096903
3  V3 V24 0.1653670
4  V7 V11 0.3629871
5  V7 V24 0.4947797
6 V11 V24 0.7259321
```
Where V1 and V2 indicate between which variables it goes, and "out" gives the p-value for association. Here all variables are independent. Which you would expect, as I created the data at random.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...