The exercise consists in aggregating a numeric vector of values by a combination of factors with data.table in R. Take the following data table as example:
require (data.table) require (plyr) dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27)))
Notice that every unique combination of 'month' and 'fac' shows up three times. So, when I try to average values by both these factors, I should expect a data frame with 9 unique rows:
(agg1 <- ddply (dtb, c ("month", "fac"), function (dfr) mean (dfr$value))) month fac V1 1 Jan a -0.36030953 2 Jan b -0.58444588 3 Jan c -0.15472876 4 Feb a -0.05674483 5 Feb b 0.26415972 6 Feb c -1.62346772 7 Mar a 0.24560510 8 Mar b 0.82548140 9 Mar c 0.18721114
However, when aggregating with data.table, I keep getting the results provided by every redundant combination of the two factors:
(agg2 <- dtb[, value := mean (value), by = list (month, fac)]) month fac value 1: Jan a -0.36030953 2: Jan a -0.36030953 3: Jan a -0.36030953 4: Feb a -0.05674483 5: Feb a -0.05674483 6: Feb a -0.05674483 7: Mar a 0.24560510 8: Mar a 0.24560510 9: Mar a 0.24560510 10: Jan b -0.58444588 11: Jan b -0.58444588 12: Jan b -0.58444588 13: Feb b 0.26415972 14: Feb b 0.26415972 15: Feb b 0.26415972 16: Mar b 0.82548140 17: Mar b 0.82548140 18: Mar b 0.82548140 19: Jan c -0.15472876 20: Jan c -0.15472876 21: Jan c -0.15472876 22: Feb c -1.62346772 23: Feb c -1.62346772 24: Feb c -1.62346772 25: Mar c 0.18721114 26: Mar c 0.18721114 27: Mar c 0.18721114 month fac value
Is there an elegant way to collapse these results to one row per unique combination of factors with data table?
The issue (and reasoning) is related to the fact that aggregated value is being assigned not just calculated.
It is easier to observe this in action if you look at a data.table with more columns than just the ones being used for the computation.
# Therefore, let's add a new column dtb[, newCol := LETTERS[seq(length(value))]
Note that if we just want to output the computed value, then expression on the RHS
as you have it is just fine.
# This gives the expected results dtb[, mean (value), by = list (month, fac)] # This on the other hand assigns the respective values to *each* row dtb[, value := mean (value), by = list (month, fac)]
In other words, the data is being subsetted to only return unique values.
However, if you want to save this value back into the SAME data table (which is what happens when using :=
operator) then all rows that are identified in i
(all rows by defualt) will be assigned a value. (which, when you look at the output with additional columns, makes sense)
Then copying this data.table to agg still sends through all the rows.
Therefore, if you want to copy to a new table, only those rows from your original table that are unique, you can
a. wrap the original table inside `unique()` before assigning it b. assign the table, above, that is returned when you are not assigning the RHS output (which is what @Arun suggested)
An example of a.
would be:
agg2 <- unique(dtb[, value := mean (value), by = list (month, fac)])
The following example might help illustrate.
(You would need to copy + paste this, as the output is ommitted)
# SAMPLE DATA, as above library(data.table) dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27)) # METHOD 1 # #------------# dtb <- copy(dtb.bak) # restore, from sample data. dtb[, value := mean (value), by = list (month, fac)] dtb # this is what you would like to assign unique(dtb) # METHOD 2 # #------------# dtb <- copy(dtb.bak) # restore, from sample data. # this is what you would like to assign # next two lines are the same, only differnce is column name dtb[, mean (value), by = list (month, fac)] dtb[, list("mean" = mean (value)), by = list (month, fac)] # quote marks added for clarity # dtb is unchanged. dtb # NOW COMPARE THE SAME TWO METHODS, BUT IF THERE IS AN ADDITIOANL COLUMN dtb.bak[, newCol := rep(c("A", "B", "A"), length(value)/3)] dtb1 <- copy(dtb.bak) # restore, from sample data. dtb2 <- copy(dtb.bak) # restore, from sample data. # Method 1 dtb1[, value := mean (value), by = list (month, fac)] dtb1 unique(dtb1) # METHOD 2 # dtb2[, list("mean" = mean (value)), by = list (month, fac)] # quote marks added for clarity dtb2 # METHOD 2, WITH ADDED COLUMNS IN list() in `j` dtb2[, list("mean" = mean (value), newCol), by = list (month, fac)] # quote marks added for clarity # notice this has more columns thatn unique(dtb1)
You should do:
agg2 <- dtb[, list(value = mean(value)), by = list (month, fac)]
:=
will recycle values for RHS
to fit the number of elements in LHS
. Do ?':='
to read more about this.