Applying calculation per groups within R dataframe

て烟熏妆下的殇ゞ 提交于 2019-12-04 16:56:13

Responding specifically with the final sentence in mind: "What's a more efficient and elegant way of doing that directly on the original data.", it just so happens that data.table has a new feature for this.

install.packages("data.table", repos="http://R-Forge.R-project.org")
# Needs version 1.8.1 from R-Forge.  Soon to be released to CRAN.

With your data in DT :

> DT[, countcat:=.N, by=list(country,category)]     # add 'countcat' column
    category country countcat
 1:        1     RUS        3
 2:        2     GER        1
 3:        3     USA        2
 4:        1     RUS        3
 5:        1     USA        1
 6:        1     RUS        3
 7:        3     GER        1
 8:        3     USA        2
 9:        2     RUS        1
10:        2     USA        1

> DT[, weight:=countcat/.N, by=country]     # add 'weight' column
    category country countcat weight
 1:        1     RUS        3   0.75
 2:        2     GER        1   0.50
 3:        3     USA        2   0.50
 4:        1     RUS        3   0.75
 5:        1     USA        1   0.25
 6:        1     RUS        3   0.75
 7:        3     GER        1   0.50
 8:        3     USA        2   0.50
 9:        2     RUS        1   0.25
10:        2     USA        1   0.25

:= adds a column by reference to the data and is an 'old' feature. The new feature is that it now works by group. .N is a symbol that holds the number of rows in each group.

These operations are memory efficient and should scale to large data; e.g., 1e8, 1e9 rows.

If you don't wish to include the intermediate column countcat, just remove it afterwards. Again, this is an efficient operation which works instantly regardless of the size of the table (by moving pointers internally).

> DT[,countcat:=NULL]     # remove 'countcat' column
    category country weight
 1:        1     RUS   0.75
 2:        2     GER   0.50
 3:        3     USA   0.50
 4:        1     RUS   0.75
 5:        1     USA   0.25
 6:        1     RUS   0.75
 7:        3     GER   0.50
 8:        3     USA   0.50
 9:        2     RUS   0.25
10:        2     USA   0.25
> 
ilprincipe

I actually asked a similar question some time ago. data.table is really nice for this, especially now that := by group is implemented, and a self join is not necessary anymore - as illustrated above. the best solution from base R is ave(). tapply() can also be used.

This is similar to the solution above, using ave(). However, I highly recommend you look at data.table.

df$count <- ave(x = df$object, df$country, df$category, FUN = length)
df$weight <- ave(x = df$count, df$country, FUN = function(x) x/length(x))

I don't see a readable way to do it in one line. But it can be quite compact.

# Use table to get the counts.
counts <- table(df[,2:3])
# Normalize the table
weights <- t(t(counts)/colSums(counts))
# Use 'matrix' selection by names.
df$weight <- weights[as.matrix(df[,2:3])]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!