Count of values within specified range of value in each row using data.table

匆匆过客 提交于 2020-01-02 21:19:10

问题


To come up with a column of counts for each level (or combination of levels) for categorical variables is data.table syntax can be handled with something like:

#setting up the data so it's pasteable
df <- data.table(var1 = c('dog','cat','dog','cat','dog','dog','dog'),
                 var2 = c(1,5,90,95,91,110,8),
                 var3 = c('lamp','lamp','lamp','table','table','table','table'))

#adding a count column for var1
df[, var1count := .N, by = .(var1)]

#adding a count of each combo of var1 and var3
df[, var1and3comb := .N, by = .(var1,var3)]

I am curious as to how I could instead produce a count column that counts the number of records with a value that is within +- 5 from each value of var2.

In my non-functioning attempt at this,

df[, var2withinrange := .N, by = .(between((var2-5),(var2+5),var2))]

I get a column with the total number of records as opposed to the desired result. I'd be hoping for the first row to hold a value of 2, since the 1 and 5 fall into that range. Row 2 should have a value of 3, since the 1, 5, and 8 all fall into that range for the 5, and so on.

Any help on coming up with a solution is much appreciated. Ideally in data.table code!


回答1:


A solution with data.table:

df[, var2withinrange := df[.(var2min = var2 - 5, var2plus = var2 + 5)
                           , on = .(var2 >= var2min, var2 <= var2plus)
                           , .N
                           , by = .EACHI][, N]][]

which gives:

> df
   var1 var2  var3 var2withinrange
1:  dog    1  lamp               2
2:  cat    5  lamp               3
3:  dog   90  lamp               3
4:  cat   95 table               3
5:  dog   91 table               3
6:  dog  110 table               1
7:  dog    8 table               2



回答2:


The problem with your version is that it aggregates the boolean values. Every line will produce a FALSE as every value - 5 lies outside of its value + 5 and its value. If you would change it to

between(var2, var2 - 5, var2 + 5)

you would get TRUE back but still a 7 in every column because of the vectorized version.

Your problem can be solved with sapply which takes values by value and compares it to every value in the columns vectorized. This works but it is not mainly data.table code.

df$var2withinrange = sapply(df$var2, function(x){ sum(between(x, df$var2 - 5, df$var2 + 5)) })


来源:https://stackoverflow.com/questions/52949020/count-of-values-within-specified-range-of-value-in-each-row-using-data-table

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!