Filter each column of a data.frame based on a specific value

匿名 (未验证) 提交于 2019-12-03 01:57:01

问题:

Consider the following data frame:

df 

Using dplyr, how can I filter, on each column (without implicitly naming them), for all values greater than 2.

Something that would mimic an hypothetical filter_each(funs(. >= 2))

Right now I'm doing:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2) 

Which is equivalent to:

df %>% filter(!rowSums(. 

Note: Let's say I wanted to filter only on the first 4 columns, I would do:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2)  

or

df %>% filter(!rowSums(.[-5] 

Would there be a more efficient alternative ?

Edit: sub question

How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

Benchmark sub question

Since I have to run this on a large dataset, I benchmarked the suggestions.

df % filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] =", as.name(x), y) }, 2)), Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) 

Here are the results:

#Unit: milliseconds #    expr       min        lq      mean    median       uq      max neval #   Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458    50 # Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669    50 # Docendo  874.0247  933.1399  983.5435  985.3697 1026.901 1053.407    50 

回答1:

Here's another option with slice which can be used similarly to filter in this case. Main difference is that you supply an integer vector to slice whereas filter takes a logical vector.

df %>% slice(which(!rowSums(select(., -matches("X5")) 

What I like about this approach is that because we use select inside rowSums you can make use of all the special functions that select supplies, like matches for example.


Let's see how it compares to the other answers:

df % filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] =", as.name(x), y) }, 2)),     dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) 

Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).


Following a comment that base R would have the same speed as the slice approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:

base = df[!rowSums(df[-5L] 

Benchmark:

df % filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] =", as.name(x), y) }, 2)),   dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) 

Not really any better or comparable performance with these two base R approaches.

Edit note #2: added benchmark with base R options.



回答2:

Here's an idea that makes it fairly simple to choose the names. You can set up a list of calls to send to the .dots argument of filter_(). First a function that creates an unevaluated call.

Call =") call(fun, as.name(x), value) 

Now we use filter_(), passing a list of calls into the .dots argument using lapply(), choosing any name and value you want.

nm 

You can have a look at the unevaluated calls created by Call(), for example X4 and X5, with

lapply(names(df)[4:5], Call, 2L) # [[1]] # X4 >= 2L # # [[2]] # X5 >= 2L 

So if you adjust the names() in the X argument of lapply(), you should be fine.



回答3:

How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

It might be not the most elegant solution, but it gets the job done:

df %>% filter(!rowSums(.[,!colnames(.)%in%'X5',drop=F] 

In case of several excluded columns (e.g. X3,X5), one can use:

df %>% filter(!rowSums(.[,!colnames(.)%in%c('X3','X5'),drop=F] 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!