About GForce in data.table 1.9.2

前端 未结 1 1617
灰色年华
灰色年华 2020-12-09 11:06

I don\'t know how to make great advantage of GForce in data.table 1.9.2

New optimization: GForce. Rather than grouping the data, the group locations a

相关标签:
1条回答
  • 2020-12-09 11:39

    It's nothing to do with na.rm. What you show worked fine before as well. However, I can see why you might have thought that. Here is the rest of the same NEWS item :

    Examples where GForce applies now :
        DT[,sum(x,na.rm=),by=...]                       # yes
        DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...]  # yes
        DT[,lapply(.SD,sum,na.rm=),by=...]              # yes
        DT[,list(sum(x),min(y)),by=...]                 # no. gmin not yet available
    GForce is a level 2 optimization. To turn it off: options(datatable.optimize=1)
    Reminder: to see the optimizations and other info, set verbose=TRUE
    

    You don't need to do anything to benefit, it's an automatic optimization.

    Here's an example on 500 million rows and 4 columns (13GB). First create and illustrate the data :

    $ R
    R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
    Copyright (C) 2013 The R Foundation for Statistical Computing
    Platform: x86_64-pc-linux-gnu (64-bit)
    
    > require(data.table)
    Loading required package: data.table
    data.table 1.9.2  For help type: help("data.table")
    
    > DT = data.table( grp = sample(1e6,5e8,replace=TRUE), 
                       a = rnorm(1e6),
                       b = rnorm(1e6),
                       c = rnorm(1e6))
    > tables()
         NAME        NROW    MB COLS      KEY
    [1,] DT   500,000,000 13352 grp,a,b,c    
    Total: 13,352MB
    > print(DT)
              grp          a            b          c
    1e+00: 695059 -1.4055192  1.587540028  1.7104991
    2e+00: 915263 -0.8239298 -0.513575696 -0.3429516
    3e+00: 139937 -0.2202024  0.971816721  1.0597421
    4e+00: 651525  1.0026858 -1.157824780  0.3100616
    5e+00: 438180  1.1074729 -2.513939427  0.8357155
       ---                                          
    5e+08: 705823 -1.4773420  0.004369457 -0.2867529
    5e+08: 716694 -0.6826147 -0.357086020 -0.4044164
    5e+08: 217509  0.4939808 -0.012797093 -1.1084564
    5e+08: 501760  1.7081212 -1.772721799 -0.7119432
    5e+08: 765653 -1.1141456 -1.569578263  0.4947304
    

    Now time with GForce optimization on (the default). Notice here there is no setkey first. This is what's known as cold by or ad hoc by which is common practice when you want to group in lots of different ways.

    > system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
       user  system elapsed 
     47.520   5.651  53.173 
    > system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
       user  system elapsed 
     47.372   5.676  53.049      # immediate repeat to confirm timing
    

    Now turn off GForce optimization (as per NEWS item) to see the difference it makes :

    > options(datatable.optimize=1)
    
    > system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
       user  system elapsed 
     97.274   3.383 100.659 
    > system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
       user  system elapsed 
     97.199   3.423 100.624      # immediate repeat to confirm timing
    

    Finally, confirm the results are the same :

    > identical(ans1,ans2)
    [1] TRUE
    > print(ans1)
                grp          a          b          c
          1: 695059  16.791281  13.269647 -10.663118
          2: 915263  43.312584 -33.587933   4.490842
          3: 139937   3.967393 -10.386636  -3.766019
          4: 651525  -4.152362   9.339594   7.740136
          5: 438180   4.725874  26.328877   9.063309
         ---                                        
     999996: 372601  -2.087248 -19.936420  21.172860
     999997:  13912  18.414226  -1.744378  -7.951381
     999998: 150074  -4.031619   8.433173 -22.041731
     999999: 385718  11.527876   6.807802   7.405016
    1000000: 906246 -13.857315 -23.702011   6.605254
    

    Notice that data.table retains the order of the groups according to when they first appeared. To order the grouped result, use keyby= instead of by=.

    To turn GForce optimization back on (default is Inf to benefit from all optimizations) :

    > options(datatable.optimize=Inf)
    

    Aside : if you're not familiar with the lapply(.SD,...) syntax, it's just a way to apply a function through columns by group. For example, these two lines are equivalent :

     DT[, lapply(.SD,sum), by=grp]               # (1)
     DT[, list(sum(a),sum(b),sum(c)), by=grp]    # (2) exactly the same
    

    The first (1) is more useful as you have more columns, especially in combination with .SDcols to control which subset of columns to apply the function through.

    The NEWS item was just trying to convey that it doesn't matter which of these syntax is used, or whether you pass na.rm or not, GForce optimization will still be applied. It's saying that you can mix sum() and mean() in one call (which syntax (2) allows), but as soon as you do something else (like min()), then GForce won't kick in since min isn't done yet; only mean and sum have GForce optimizations currently. You can use verbose=TRUE to see if GForce is being applied.

    Details of the machine used for this timing :

    $ lscpu
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                8
    On-line CPU(s) list:   0-7
    Thread(s) per core:    8
    Core(s) per socket:    1
    Socket(s):             1
    NUMA node(s):          1
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 62
    Stepping:              4
    CPU MHz:               2494.022
    BogoMIPS:              4988.04
    Hypervisor vendor:     Xen
    Virtualization type:   full
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              25600K
    NUMA node0 CPU(s):     0-7
    
    0 讨论(0)
提交回复
热议问题