问题
Consider I have four objects (a,b,c,d), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as
df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))
In tabular format,
---------
a b c d
---------
1 1 2 1
2 2 1 2
1 2 2 1
2 1 2 2
1 1 2 1
----------
Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)
I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in R which can calculate such statistics?
Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.
回答1:
There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. combn can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with prop.table, but here it's simpler to calculate directly.
Here I've used tidyverse grammar (primarily purrr, which is helpful for handling lists), but convert into base if you like.
library(tidyverse)
map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>% # get combinations
flatten() %>% # eliminate nesting
set_names(map_chr(., paste0, collapse = '')) %>% # add useful names
# subset df with combination, see if each row has only one unique value
map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>%
map_dbl(~sum(.x) / length(.x)) # calculate TRUE proportion
## ab ac ad bc bd cd abc abd acd bcd abcd
## 0.6 0.2 1.0 0.2 0.6 0.2 0.0 0.6 0.2 0.0 0.0
回答2:
If you have numeric ratings, you could use diff to check if you consistently have 0 difference between each rater:
f <- function(cols, data) {
sum(colSums(diff(t(data[cols]))==0)==(length(cols)-1)) / nrow(data)
}
Results are as expected when applying the function to example groups:
f(c("a","b","d"), df)
#[1] 0.6
f(c("a","d"), df)
#[1] 1
回答3:
With base R functions you could do:
groupVec = c("a","b","d")
transDF = t(as.matrix(DF))
subDF = transDF[rownames(transDF) %in% groupVec,]
subDF
# [,1] [,2] [,3] [,4] [,5]
# a 1 2 1 2 1
# b 1 2 2 1 1
# d 1 2 1 2 1
#if length of unique values is 1, it implies match across all objects, count unique values/total columns = match pct
match_pct = sum(sapply(as.data.frame(subDF), function(x) sum(length(unique(x))==1) ))/ncol(subDF)
match_pct
# [1] 0.6
Wrapping it in a custom funtion:
fn_matchPercent = function(groupVec = c("a","d") ) {
transDF = t(as.matrix(DF))
subDF = transDF[rownames(transDF) %in% groupVec,]
match_pct = sum(sapply(as.data.frame(subDF), function(x) sum(length(unique(x))==1) ))/ncol(subDF)
outputDF = data.frame(groups = paste0(groupVec,collapse=",") ,match_pct = match_pct)
return(outputDF)
}
fn_matchPercent(c("a","d"))
# groups match_pct
# 1 a,d 1
fn_matchPercent(c("a","b","d"))
# groups match_pct
# 1 a,b,d 0.6
回答4:
Try this:
find.unanimous.percentage <- function(df, at.a.time) {
cols <- as.data.frame(t(combn(names(df), at.a.time)))
names(cols) <- paste('O', 1:at.a.time, sep='')
cols$percent.unanimous <- 100*colMeans(apply(cols, 1, function(x) apply(df[x], 1, function(y) length(unique(y)) == 1)))
return(cols)
}
find.unanimous.percentage(df, 2) # take 2 at a time
O1 O2 percent.unanimous
1 a b 60
2 a c 20
3 a d 100
4 b c 20
5 b d 60
6 c d 20
find.unanimous.percentage(df, 3) # take 3 at a time
O1 O2 O3 percent.unanimous
1 a b c 0
2 a b d 60
3 a c d 20
4 b c d 0
find.unanimous.percentage(df, 4)
O1 O2 O3 O4 percent.unanimous
1 a b c d 0
回答5:
Clustering similarity metrics
It seems that you might want to calculate a substantially different (better?) metric than what you propose now, if your actual problem requires to evaluate various options of clustering the same data.
This http://cs.utsa.edu/~qitian/seminar/Spring11/03_11_11/IR2009.pdf is a good overview of the problem, but the BCubed precision/recall metrics are commonly used for similar problems in NLP (e.g http://alias-i.com/lingpipe/docs/api/com/aliasi/cluster/ClusterScore.html).
回答6:
Try this code. It works for your example and should hold for the extended case.
df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))
# Find all unique combinations of the column names
group_pairs <- data.frame(t(combn(colnames(df), 2)))
# For each combination calculate the similarity
group_pairs$similarities <- apply(group_pairs, 1, function(x) {
sum(df[x["X1"]] == df[x["X2"]])/nrow(df)
})
来源:https://stackoverflow.com/questions/40713096/compute-similarity-percentage-or-compute-correlation-between-more-than-2-objects