aggregation with data.table in R

匿名 (未验证) 提交于 2019-12-03 00:50:01

问题:

The exercise consists in aggregating a numeric vector of values by a combination of factors with data.table in R. Take the following data table as example:

require (data.table) require (plyr) dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3),                                        fac = letters[1:3]),                           value = rnorm (27))) 

Notice that every unique combination of 'month' and 'fac' shows up three times. So, when I try to average values by both these factors, I should expect a data frame with 9 unique rows:

(agg1 <- ddply (dtb, c ("month", "fac"), function (dfr) mean (dfr$value)))   month fac          V1 1   Jan   a -0.36030953 2   Jan   b -0.58444588 3   Jan   c -0.15472876 4   Feb   a -0.05674483 5   Feb   b  0.26415972 6   Feb   c -1.62346772 7   Mar   a  0.24560510 8   Mar   b  0.82548140 9   Mar   c  0.18721114 

However, when aggregating with data.table, I keep getting the results provided by every redundant combination of the two factors:

(agg2 <- dtb[, value := mean (value), by = list (month, fac)])     month fac       value  1:   Jan   a -0.36030953  2:   Jan   a -0.36030953  3:   Jan   a -0.36030953  4:   Feb   a -0.05674483  5:   Feb   a -0.05674483  6:   Feb   a -0.05674483  7:   Mar   a  0.24560510  8:   Mar   a  0.24560510  9:   Mar   a  0.24560510 10:   Jan   b -0.58444588 11:   Jan   b -0.58444588 12:   Jan   b -0.58444588 13:   Feb   b  0.26415972 14:   Feb   b  0.26415972 15:   Feb   b  0.26415972 16:   Mar   b  0.82548140 17:   Mar   b  0.82548140 18:   Mar   b  0.82548140 19:   Jan   c -0.15472876 20:   Jan   c -0.15472876 21:   Jan   c -0.15472876 22:   Feb   c -1.62346772 23:   Feb   c -1.62346772 24:   Feb   c -1.62346772 25:   Mar   c  0.18721114 26:   Mar   c  0.18721114 27:   Mar   c  0.18721114     month fac       value 

Is there an elegant way to collapse these results to one row per unique combination of factors with data table?

回答1:

The issue (and reasoning) is related to the fact that aggregated value is being assigned not just calculated.

It is easier to observe this in action if you look at a data.table with more columns than just the ones being used for the computation.

# Therefore, let's add a new column dtb[, newCol := LETTERS[seq(length(value))] 

Note that if we just want to output the computed value, then expression on the RHS as you have it is just fine.

# This gives the expected results dtb[, mean (value), by = list (month, fac)]  # This on the other hand assigns the respective values to *each* row dtb[, value := mean (value), by = list (month, fac)] 

In other words, the data is being subsetted to only return unique values.
However, if you want to save this value back into the SAME data table (which is what happens when using := operator) then all rows that are identified in i (all rows by defualt) will be assigned a value. (which, when you look at the output with additional columns, makes sense)

Then copying this data.table to agg still sends through all the rows.

Therefore, if you want to copy to a new table, only those rows from your original table that are unique, you can

a.  wrap the original table inside `unique()` before assigning it b.  assign the table, above, that is returned when you      are not assigning the RHS output (which is what @Arun suggested) 

An example of a. would be:

 agg2 <- unique(dtb[, value := mean (value), by = list (month, fac)]) 

The following example might help illustrate.

(You would need to copy + paste this, as the output is ommitted)

  # SAMPLE DATA, as above   library(data.table)   dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27))    #  METHOD 1  #    #------------#   dtb <- copy(dtb.bak)  # restore, from sample data.     dtb[, value := mean (value), by = list (month, fac)]   dtb    # this is what you would like to assign   unique(dtb)     #  METHOD 2  #    #------------#   dtb <- copy(dtb.bak)  # restore, from sample data.    # this is what you would like to assign   # next two lines are the same, only differnce is column name   dtb[, mean (value), by = list (month, fac)]   dtb[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity    # dtb is unchanged.    dtb      # NOW COMPARE THE SAME TWO METHODS, BUT IF THERE IS AN ADDITIOANL COLUMN   dtb.bak[, newCol := rep(c("A", "B", "A"), length(value)/3)]     dtb1 <- copy(dtb.bak)  # restore, from sample data.   dtb2 <- copy(dtb.bak)  # restore, from sample data.     # Method 1   dtb1[, value := mean (value), by = list (month, fac)]   dtb1   unique(dtb1)    #  METHOD 2  #    dtb2[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity   dtb2    # METHOD 2, WITH ADDED COLUMNS IN list() in `j`   dtb2[, list("mean" = mean (value), newCol), by = list (month, fac)]  # quote marks added for clarity   # notice this has more columns thatn    unique(dtb1) 


回答2:

You should do:

agg2 <- dtb[, list(value = mean(value)), by = list (month, fac)] 

:= will recycle values for RHS to fit the number of elements in LHS. Do ?':=' to read more about this.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!