How to delete rows where all the columns are zero

后端 未结 4 1704
旧时难觅i
旧时难觅i 2020-12-07 03:32

I have the following data frame

dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))

Which prints:

> dat 
         


        
相关标签:
4条回答
  • 2020-12-07 03:43

    Try dat[rowSums(abs(dat)) != 0,].

    0 讨论(0)
  • 2020-12-07 03:45

    Why use sum? it is much more efficient to simply check if all elements are zero. I would do

    dat = dat[!apply(dat, 1, function(x) all(x == 0)), ]
    

    If you need to keep track of which rows were removed:

    indremoved = which(apply(dat, 1, function(x) all(x == 0)) )
    dat = dat[ -indremoved, ]
    
    0 讨论(0)
  • 2020-12-07 03:50

    Shorter and more efficient (at least on my machine) is to use Reduce and |

    dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))
    dat[Reduce(`|`,dat),]
    #   a b c
    # 1 0 1 0
    # 3 2 0 1
    # 4 3 0 3
    

    Handling NAs

    Current solutions don't handle NAs, to adapt mine (using example from: How to remove rows with all zeros without using rowSums in R?):

    dat2 <- data.frame(a=c(0,0,0,0),b=c(0,-1,NA,1),c=c(0,1,0,-1),d=c(0,NA,0,0), e=c(0,0,NA,1))
    #   a  b  c  d  e
    # 1 0  0  0  0  0
    # 2 0 -1  1 NA  0
    # 3 0 NA  0  0 NA
    # 4 0  1 -1  0  1
    

    If you want to remove rows containing NAs AND zeros

    dat[Reduce(`|`,`[<-`(dat,is.na(dat),value=0)),]
    #   a  b  c  d e
    # 2 0 -1  1 NA 0
    # 4 0  1 -1  0 1
    

    If you want to keep them:

    dat[Reduce(`|`,`[<-`(dat,is.na(dat),value=1)),]
    #   a  b  c  d  e
    # 2 0 -1  1 NA  0
    # 3 0 NA  0  0 NA
    # 4 0  1 -1  0  1
    

    Updated benchmark (all methods assuming no NAs)

    dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))
    mm <- function() dat[Reduce(`|`,dat),]
    microbenchmark(Codoremifa(), Marco(), Sven(), Sven_2(), Sven_3(),mm(),unit='relative',times=50)
    # Unit: relative
    #         expr      min       lq     mean   median       uq      max neval
    # Codoremifa() 4.060050 4.020630 3.979949 3.921504 3.814334 4.517048    50
    #      Marco() 2.473624 2.358608 2.397922 2.444411 2.431119 2.365830    50
    #       Sven() 1.932279 1.937906 1.954935 2.013045 1.999980 1.960975    50
    #     Sven_2() 1.857111 1.834460 1.871929 1.885606 1.898201 2.595113    50
    #     Sven_3() 1.781943 1.731038 1.814738 1.800647 1.766469 3.346325    50
    #         mm() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    50
    
    
    # a data frame with 10.000 rows
    set.seed(1)
    dat <- dat[sample(nrow(dat), 10000, TRUE), ]
    library(microbenchmark)
    microbenchmark(Codoremifa(), Marco(), Sven(), Sven_2(), Sven_3(),mm(),unit='relative',times=50)
    # Unit: relative
    #         expr       min        lq      mean    median        uq       max neval
    # Codoremifa()  1.395990  1.496361  3.224857  1.520903  3.146186 26.793544    50
    #      Marco() 35.794446 36.015642 29.930283 35.625356 34.414162 13.379470    50
    #       Sven()  1.347117  1.363027  1.473354  1.375143  1.408369  1.175388    50
    #     Sven_2()  1.268169  1.281210  1.466629  1.299255  1.355403  2.605840    50
    #     Sven_3()  1.067669  1.124846  1.380731  1.122851  1.191207  2.384538    50
    #         mm()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    50
    
    0 讨论(0)
  • 2020-12-07 04:01

    You can use (1)

    dat[as.logical(rowSums(dat != 0)), ]
    

    This works for both positive and negative values.

    Another, even faster, possibility for large datasets is (2)

    dat[rowSums(!as.matrix(dat)) < ncol(dat), ]
    

    A faster approach for short and long data frames is to use matrix multiplication (3):

    dat[as.logical(abs(as.matrix(dat)) %*% rep(1L, ncol(dat))), ]
    

    Some benchmarks:

    # the original dataset
    dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))
    
    Codoremifa <- function() dat[rowSums(abs(dat)) != 0,]
    Marco <- function() dat[!apply(dat, 1, function(x) all(x == 0)), ]
    Sven <- function() dat[as.logical(rowSums(dat != 0)), ]
    Sven_2 <- function() dat[rowSums(!as.matrix(dat)) < ncol(dat), ]
    Sven_3 <- function() dat[as.logical(abs(as.matrix(dat)) %*% rep(1L,ncol(dat))), ]
    
    library(microbenchmark)
    microbenchmark(Codoremifa(), Marco(), Sven(), Sven_2(), Sven_3())
    # Unit: microseconds
    #          expr     min       lq   median       uq     max neval
    #  Codoremifa() 267.772 273.2145 277.1015 284.0995 1190.197   100
    #       Marco() 192.509 198.4190 201.2175 208.9925  265.594   100
    #        Sven() 143.372 147.7260 150.0585 153.9455  227.031   100
    #      Sven_2() 152.080 155.1900 156.9000 161.5650  214.591   100
    #      Sven_3() 146.793 151.1460 153.3235 157.9885  187.845   100
    
    
    # a data frame with 10.000 rows
    set.seed(1)
    dat <- dat[sample(nrow(dat), 10000, TRUE), ]
    microbenchmark(Codoremifa(), Marco(), Sven(), Sven_2(), Sven_3())
    # Unit: milliseconds
    #          expr       min        lq    median        uq        max neval
    #   Codoremifa()  2.426419  2.471204  3.488017  3.750189  84.268432   100
    #        Marco() 36.268766 37.840246 39.406751 40.791321 119.233175   100
    #         Sven()  2.145587  2.184150  2.205299  2.270764  83.055534   100
    #       Sven_2()  2.007814  2.048711  2.077167  2.207942  84.944856   100
    #       Sven_3()  1.814994  1.844229  1.861022  1.917779   4.452892   100
    
    0 讨论(0)
提交回复
热议问题