Use of column inside sum() function using dplyr's mutate() function

问题

I have a data frame and I want to create a new column prob using dplyr's mutate() function. prob should include the probability P(row value > all column values) that there are rows of greater value in the data frame than each row value. Here is what I want to do:

data = data.frame(value = c(1,2,3,3,4,4,4,5,5,6,7,8,8,8,8,8,9))

require(dplyr)

data %>% mutate(prob = sum(value < data$value) / nrow(data))

This gives the following results:

   value prob
1      1    0
2      2    0
3      3    0
4      3    0
...    ...  ...

Here prob only contains 0 for each row. If I replace value with 2 in the expression sum(value < data$value):

data %>% mutate(prob = sum(2 < data$value) / nrow(data))

I get the following results:

   value      prob
1      1 0.8823529
2      2 0.8823529
3      3 0.8823529
4      3 0.8823529
...    ...  ...

0.8823529 is the probability that there are rows of greater value than 2 in the data frame. The problem seems to be that the mutate() function doesn't accept the value column as a parameter inside the sum() function.

回答1:

adapt agstudy's code a bit into dplyr:

data %>% mutate(prob = sapply(value, function(x) sum(x < value) / nrow(data)))

回答2:

I think a basic vapply (or sapply) would make much more sense here. However, if you really wanted to take the scenic route, you can try something like this:

data = data.frame(value = c(1,2,3,3,4,4,4,5,5,6,7,8,8,8,8,8,9))

data %>% 
  rowwise() %>%                ## You are really working by rows here
  do(prob = sum(.$value < data$value) / nrow(data)) %>%
  mutate(prob = c(prob)) %>%   ## The previous value was a list -- unlist here
  cbind(data)                  ## and combine with the original data
#          prob value
# 1  0.94117647     1
# 2  0.88235294     2
# 3  0.76470588     3
# 4  0.76470588     3
# 5  0.58823529     4
# 6  0.58823529     4
# 7  0.58823529     4
# 8  0.47058824     5
# 9  0.47058824     5
# 10 0.41176471     6
# 11 0.35294118     7
# 12 0.05882353     8
# 13 0.05882353     8
# 14 0.05882353     8
# 15 0.05882353     8
# 16 0.05882353     8
# 17 0.00000000     9

来源：https://stackoverflow.com/questions/26200978/use-of-column-inside-sum-function-using-dplyrs-mutate-function

标签

sum

dataframe

probability

dplyr