This question should have a simple, elegant solution but I can\'t figure it out, so here it goes:
Let\'s say I have the following dataset and I want to count the num
Try rowSums:
> set.seed(1)
> ID <- LETTERS[1:5]
> X1 <- sample(1:5, 5,T)
> X2 <- sample(1:5, 5,T)
> X3 <- sample(1:5, 5,T)
> df <- data.frame(ID,X1,X2,X3)
> df
ID X1 X2 X3
1 A 2 5 2
2 B 2 5 1
3 C 3 4 4
4 D 5 4 2
5 E 2 1 4
> rowSums(df == 2)
[1] 2 1 0 1 1
Alternatively, with dplyr:
> df %>% mutate(numtwos = rowSums(. == 2))
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1
One approach is to use a combination of dplyr and tidyr to convert data into long format, and do the computation:
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -ID) %>%
group_by(ID) %>%
summarise(numtwos = sum(value == 2)) %>%
inner_join(df, .)
Output is as follows:
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1
Here's another alternative using purrr:
library(purrr)
df %>%
by_row(function(x) {
sum(x[-1] == 2) },
.to = "numtwos",
.collate = "cols"
)
Which gives:
#Source: local data frame [5 x 5]
#
# ID X1 X2 X3 numtwos
# <fctr> <int> <int> <int> <int>
#1 A 2 5 2 2
#2 B 2 5 1 1
#3 C 3 4 4 0
#4 D 5 4 2 1
#5 E 2 1 4 1
As per mentioned in the NEWS, row based functionals are still maturing in dplyr:
We are still figuring out what belongs in
dplyrand what belongs inpurrr. Expect much experimentation and many changes with these functions.
Benchmark
We can see how rowwise() and do() compare to purrr::by_row() for this type of problem and how they "perform" against rowSums() and the tidy data way:
largedf <- df[rep(seq_len(nrow(df)), 10e3), ]
library(microbenchmark)
microbenchmark(
steven = largedf %>%
by_row(function(x) {
sum(x[-1] == 2) },
.to = "numtwos",
.collate = "cols"),
psidom = largedf %>%
rowwise %>%
do(data_frame(numtwos = sum(.[-1] == 2))) %>%
cbind(largedf, .),
gopala = largedf %>%
gather(key, value, -ID) %>%
group_by(ID) %>%
summarise(numtwos = sum(value == 2)) %>%
inner_join(largedf, .),
evan = largedf %>%
mutate(numtwos = rowSums(. == 2)),
times = 10L,
unit = "relative"
)
Results:
#Unit: relative
# expr min lq mean median uq max neval cld
# steven 1225.190659 1261.466936 1267.737126 1227.762573 1276.07977 1339.841636 10 b
# psidom 3677.603240 3759.402212 3726.891458 3678.717170 3728.78828 3777.425492 10 c
# gopala 2.715005 2.684599 2.638425 2.612631 2.59827 2.572972 10 a
# evan 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 10 a
You can use do, which doesn't add the column to your original data frame and you need to add the column to your original data frame.
df %>%
rowwise %>%
do(numtwos = sum(.[-1] == 2)) %>%
data.frame
numtwos
1 2
2 1
3 0
4 1
5 1
Add a cbind to bind the new column to the original data frame:
df %>%
rowwise %>%
do(numtwos = sum(.[-1] == 2)) %>%
data.frame %>% cbind(df, .)
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1
Just wanted to add to the answer of @evan.oman in case you only want to sum rows for specific columns, not all of them. You can use the regular select and/or select_helpers functions. In this example, we don't want to include X1 in rowSums:
df %>%
mutate(numtwos = rowSums(select(., -X1) == 2))
ID X1 X2 X3 numtwos
1 A 2 5 2 1
2 B 2 5 1 0
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 0