Combining more than 2 columns by removing NA's in R

前端 未结 3 2168
说谎
说谎 2020-12-06 07:28

At first sight this seems a duplicate of Combine/merge columns while avoiding NA? but in fact it isn\'t. I am dealing sometimes with more than two columns instead of just tw

相关标签:
3条回答
  • 2020-12-06 08:06

    This function is a bit long-winded but (1) it will be faster in the long run and (2) it offers a good amount of flexibility:

    myFun <- function(inmat, outList = TRUE, fill = NA, origDim = FALSE) {
      ## Split up the data by row and isolate the non-NA values
      myList <- lapply(sequence(nrow(inmat)), function(x) {
        y <- inmat[x, ]
        y[!is.na(y)]
      })
      ## If a `list` is all that you want, the function stops here
      if (isTRUE(outList)) {
        myList
      } else {
        ## If you want a matrix instead, it goes on like this
        Len <- vapply(myList, length, 1L)
        ## The new matrix can be either just the number of columns required
        ##   or it can have the same number of columns as the input matrix
        if (isTRUE(origDim)) Ncol <- ncol(inmat) else Ncol <- max(Len)
        Nrow <- nrow(inmat)
        M <- matrix(fill, ncol = Ncol, nrow = Nrow)
        M[cbind(rep(sequence(Nrow), Len), sequence(Len))] <- 
          unlist(myList, use.names=FALSE)
        M
      }
    }
    

    To test it out, let's create a function to make up some dummy data:

    makeData <- function(nrow = 10, ncol = 5, pctNA = .8, maxval = 25) {
      a <- nrow * ncol
      m <- matrix(sample(maxval, a, TRUE), ncol = ncol)
      m[sample(a, a * pctNA)] <- NA
      m
    }
    
    set.seed(1)
    m <- makeData(nrow = 5, ncol = 4, pctNA=.6)
    m
    #      [,1] [,2] [,3] [,4]
    # [1,]   NA   NA   NA   NA
    # [2,]   10   24   NA   18
    # [3,]   NA   17   NA   25
    # [4,]   NA   16   10   NA
    # [5,]   NA    2   NA   NA
    

    ... and apply it...

    myFun(m)
    # [[1]]
    # integer(0)
    # 
    # [[2]]
    # [1] 10 24 18
    # 
    # [[3]]
    # [1] 17 25
    # 
    # [[4]]
    # [1] 16 10
    # 
    # [[5]]
    # [1] 2
    
    myFun(m, outList = FALSE)
    #      [,1] [,2] [,3]
    # [1,]   NA   NA   NA
    # [2,]   10   24   18
    # [3,]   17   25   NA
    # [4,]   16   10   NA
    # [5,]    2   NA   NA
    
    ## Try also
    ## myFun(m, outList = FALSE, origDim = TRUE)
    

    And, let's run some timings on bigger data in comparison to the other answers so far:

    set.seed(1)
    m <- makeData(nrow = 1e5, ncol = 5, pctNA = .75)
    
    ## Will return a matrix
    funCP <- function(inmat) t(apply(inmat, 1, sort, na.last = T))
    system.time(funCP(m))
    #    user  system elapsed 
    #   9.776   0.000   9.757 
    
    ## Will return a list in this case
    funJT <- function(inmat) apply(inmat, 1, function(x) x[!is.na(x)])
    system.time(JT <- funJT(m))
    #    user  system elapsed 
    #   0.577   0.000   0.575 
    
    ## Output a list
    system.time(AM <- myFun(m))
    #    user  system elapsed 
    #   0.469   0.000   0.466 
    
    identical(JT, AM)
    # [1] TRUE
    
    ## Output a matrix
    system.time(myFun(m, outList=FALSE, origDim=TRUE))
    #    user  system elapsed 
    #   0.610   0.000   0.612 
    

    So, the list output appears slightly faster than @JT85's solution, and the matrix output appears slightly slower. But, compared to using sort row-by-row, this is a definite improvement.

    0 讨论(0)
  • 2020-12-06 08:12

    You can use apply for this. If df is your dataframe`:

    df2 <- apply(df,1,function(x) x[!is.na(x)])
    df3 <- data.frame(t(df2))
    colnames(df3) <- colnames(df)[1:ncol(df3)]
    

    Output:

    #      col1 col2
    #         1   13
    #        10   18
    #         7   15
    #         4   16
    
    0 讨论(0)
  • 2020-12-06 08:13

    You can use apply and na.exclude

    DF
    ##   V1 V2 V3 V4 V5
    ## 1  1 NA NA 13 NA
    ## 2 NA NA 10 NA 18
    ## 3 NA  7 NA 15 NA
    ## 4  4 NA NA 16 NA
    
    t(apply(DF, 1, na.exclude))
    ##      [,1] [,2]
    ## [1,]    1   13
    ## [2,]   10   18
    ## [3,]    7   15
    ## [4,]    4   16
    

    If you want to keep the dimensions of the data.frame same, you can use sort with na.last=TRUE instead. This will also take care of cases where you have unequal number of values in different rows.

    t(apply(DF, 1, sort, na.last = T))
    ##      [,1] [,2] [,3] [,4] [,5]
    ## [1,]    1   13   NA   NA   NA
    ## [2,]   10   18   NA   NA   NA
    ## [3,]    7   15   NA   NA   NA
    ## [4,]    4   16   NA   NA   NA
    
    0 讨论(0)
提交回复
热议问题