Does ifelse really calculate both of its vectors every time? Is it slow?

前端 未结 1 1921
慢半拍i
慢半拍i 2020-11-22 02:18

Does ifelse really calculate both the yes and no vectors -- as in, the entirety of each vector? Or does it just calculate some values

相关标签:
1条回答
  • 2020-11-22 02:38

    Yes. (With exception)

    ifelse calculates both its yes value and its no value. Except in the case where the test condition is either all TRUE or all FALSE.

    We can see this by generating random numbers and observing how many numbers are actually generated. (by reverting the seed).

    # TEST CONDITION, ALL TRUE
    set.seed(1)
    dump  <- ifelse(rep(TRUE, 200), rnorm(200), rnorm(200))
    next.random.number.after.all.true <- rnorm(1)
    
    # TEST CONDITION, ALL FALSE
    set.seed(1)
    dump  <- ifelse(rep(FALSE, 200), rnorm(200), rnorm(200))
    next.random.number.after.all.false <- rnorm(1)
    
    # TEST CONDITION, MIXED
    set.seed(1)
    dump   <- ifelse(c(FALSE, rep(TRUE, 199)), rnorm(200), rnorm(200))
    next.random.number.after.some.TRUE.some.FALSE <- rnorm(1)
    
    # RESET THE SEED, GENERATE SEVERAL RANDOM NUMBERS TO SEARCH FOR A MATCH
    set.seed(1)
    r.1000 <- rnorm(1000)
    
    
    cat("Quantity of random numbers generated during the `ifelse` statement when:", 
        "\n\tAll True  ", which(r.1000 == next.random.number.after.all.true) - 1,
        "\n\tAll False ", which(r.1000 == next.random.number.after.all.false) - 1,
        "\n\tMixed T/F ", which(r.1000 == next.random.number.after.some.TRUE.some.FALSE) - 1 
      )
    

    Gives the following output:

    Quantity of random numbers generated during the `ifelse` statement when: 
      All True   200 
      All False  200 
      Mixed T/F  400   <~~ Notice TWICE AS MANY numbers were
                           generated when `test` had both
                           T & F values present
    

    We can also see it in the source code itself:

    .
    .
    if (any(test[!nas]))    
        ans[test & !nas] <- rep(yes, length.out = length(ans))[test &   # <~~~~ This line and the one below
            !nas]
    if (any(!test[!nas])) 
        ans[!test & !nas] <- rep(no, length.out = length(ans))[!test &  # <~~~~ ... are the cluprits
            !nas]
    .
    .
    

    Notice that yes and no are computed only if there is some non-NA value of test that is TRUE or FALSE (respectively).
    At which point -- and this is the imporant part when it comes to efficiency -- the entirety of each vector is computed.


    Ok, but is it slower?

    Lets see if we can test it:

    library(microbenchmark)
    
    # Create some sample data
      N <- 1e4
      set.seed(1)
      X <- sample(c(seq(100), rep(NA, 100)), N, TRUE)
      Y <- ifelse(is.na(X), rnorm(X), NA)  # Y has reverse NA/not-NA setup than X
    

    These two statements generate the same results

    yesifelse <- quote(sort(ifelse(is.na(X), Y+17, X-17 ) ))
    noiflese  <- quote(sort(c(Y[is.na(X)]+17, X[is.na(Y)]-17)))
    
    identical(eval(yesifelse), eval(noiflese))
    # [1] TRUE
    

    but one is twice as fast as the other

    microbenchmark(eval(yesifelse), eval(noiflese), times=50L)
    
    N = 1,000
    Unit: milliseconds
                expr      min       lq   median       uq      max neval
     eval(yesifelse) 2.286621 2.348590 2.411776 2.537604 10.05973    50
      eval(noiflese) 1.088669 1.093864 1.122075 1.149558 61.23110    50
    
    N = 10,000
    Unit: milliseconds
                expr      min       lq   median       uq      max neval
     eval(yesifelse) 30.32039 36.19569 38.50461 40.84996 98.77294    50
      eval(noiflese) 12.70274 13.58295 14.38579 20.03587 21.68665    50
    
    0 讨论(0)
提交回复
热议问题