Vectorized “in” function in julia?

前端 未结 5 1243
抹茶落季
抹茶落季 2020-11-28 13:53

I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing

giant_list =          


        
5条回答
  •  醉酒成梦
    2020-11-28 14:21

    Performance Review

    The other answers are neglecting one important aspect - performance. So, let me briefly review that. To make this realistic I create two Integer vectors with 100,000 elements each.

    using StatsBase
    
    a = sample(1:1_000_000, 100_000)
    b = sample(1:1_000_000, 100_000)
    

    In order to know what a decent performance would be, I did the same thing in R, leading to a median performance of 4.4 ms:

    # R code
    
    a <- sample.int(1000000, 100000)
    b <- sample.int(1000000, 100000)
    
    microbenchmark::microbenchmark(a %in% b)
    
    Unit: milliseconds
         expr     min       lq     mean   median       uq      max neval
     a %in% b 4.09538 4.191653 5.517475 4.376034 5.765283 65.50126   100
    

    The performant Solution

    findall(in(b),a)
    
    5.039 ms (27 allocations: 3.63 MiB)
    

    Slower than R, but not by much. The syntax, however, could really use some improvement.

    The imperformant Solutions

    a .∈ Ref(b)
    in.(a,Ref(b))
    findall(x -> x in b, a)
    
    3.879468 seconds (6 allocations: 16.672 KiB)
    3.866001 seconds (6 allocations: 16.672 KiB)
    3.936978 seconds (178.88 k allocations: 5.788 MiB)
    

    800 times slower (almost 1000 times slower than R) - this is really nothing to write home about. In my opinion the syntax of these three also isn't very good, but at least the first solution looks better to me than the 'performant solution'.

    The is-not-a Solution

    This one here

    indexin(a,b)
    
    5.287 ms (38 allocations: 6.53 MiB)
    

    is performant, but for me it is not a solution. It contains nothing elements where the element is not in the other vector. In my opinion the main application is to subset a vector, and this does not work with this solution.

    a[indexin(b,a)]
    
    ERROR: ArgumentError: unable to check bounds for indices of type Nothing
    

提交回复
热议问题