Using data.table i and j arguments in functions

后端 未结 3 435
情深已故
情深已故 2020-12-02 14:58

I am trying to write some wrapper functions to reduce code duplication with data.table.

Here is an example using mtcars. First, set up some

3条回答
  •  日久生厌
    2020-12-02 15:19

    For the general question of how to control scoping within data.table, Gavin's answer has got you well covered.

    To really take full advantage of the data.table package's strengths, though, you should be setting the key for your data.table objects. A key causes your data to be presorted so that rows from the same level (or combinations of levels) of the grouping factor(s) are stored in contiguous blocks of memory. This can in turn greatly speed up grouping operations compared to 'ad hoc by's of the sort used in your example. (Search for 'ad hoc' in the datatable-faq (warning, pdf) for more details).

    In many situations (your example included) using keys also has the happy side-effect of simplifying the code needed to manipulate a data.table. Plus, it automatically outputs the results in the order specified by the key, which is often what you want as well.

    First, if you will only be needing to subset by the 'car' column, you could simply do:

    ## Create data.table with a key
    group <- "car"
    mtcars <- data.table(mtcars, key = group)
    
    ## Outputs results in correct order
    mtcars[, list(Total=length(mpg)), by = key(mtcars)]
            car Total
            AMC     1
       Cadillac     1
         Camaro     1
       Chrysler     1
         Datsun     1
    

    Even if your key contains several columns, using a key still makes for simpler code (and you gain the speed-up that's likely your real reason for using data.table in the first place!):

    group <- "car"
    mtcars <- data.table(mtcars, key = c("car", "gear"))
    mtcars[, list(Total=length(mpg)), by = eval(group)]
    

    EDIT: A picky note of caution

    If the by argument is used to perform grouping based on a column that is part of the key but that is not the first element of the key the order of the results may still need post processing. So, in the second example above, if key = c("gear", "car"), then "Dodge" sorts before "Datsun". In a situation like that, I might still prefer to reorder the key beforehand, rather than reorder the results after the fact. Perhaps Matthew Dowle will weigh in which of those two is preferred/faster.

提交回复
热议问题