Can dplyr package be used for conditional mutating?

前端 未结 5 1850
庸人自扰
庸人自扰 2020-11-22 15:10

Can the mutate be used when the mutation is conditional (depending on the values of certain column values)?

This example helps showing what I mean.

s         


        
5条回答
  •  -上瘾入骨i
    2020-11-22 15:23

    dplyr now has a function case_when that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor.

    df %>%
    mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                         a %in% c(0,1,3,4) | c == 4 ~ 3L, 
                         TRUE~as.integer(NA)))
    

    EDIT: If you're using dplyr::case_when() from before version 0.7.0 of the package, then you need to precede variable names with '.$' (e.g. write .$a == 1 inside case_when).

    Benchmark: For the benchmark (reusing functions from Arun 's post) and reducing sample size:

    require(data.table) 
    require(mosaic) 
    require(dplyr)
    require(microbenchmark)
    
    set.seed(42) # To recreate the dataframe
    DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
    setnames(DT, letters[1:6])
    DF <- as.data.frame(DT)
    
    DPLYR_case_when <- function(DF) {
      DF %>%
      mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                           a %in% c(0,1,3,4) | c==4 ~ 3L, 
                           TRUE~as.integer(NA)))
    }
    
    DT_fun <- function(DT) {
      DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
      DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
    }
    
    DPLYR_fun <- function(DF) {
      mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
                        ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
    }
    
    mosa_fun <- function(DF) {
      mutate(DF, g = derivedFactor(
        "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
        "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
        .method = "first",
        .default = NA
      ))
    }
    
    perf_results <- microbenchmark(
      dt_fun <- DT_fun(copy(DT)),
      dplyr_ifelse <- DPLYR_fun(copy(DF)),
      dplyr_case_when <- DPLYR_case_when(copy(DF)),
      mosa <- mosa_fun(copy(DF)),
      times = 100L
    )
    

    This gives:

    print(perf_results)
    Unit: milliseconds
               expr        min         lq       mean     median         uq        max neval
             dt_fun   1.391402    1.560751   1.658337   1.651201   1.716851   2.383801   100
       dplyr_ifelse   1.172601    1.230351   1.331538   1.294851   1.390351   1.995701   100
    dplyr_case_when   1.648201    1.768002   1.860968   1.844101   1.958801   2.207001   100
               mosa 255.591301  281.158350 291.391586 286.549802 292.101601 545.880702   100
    

提交回复
热议问题