Regression and summary statistics by group within a data.table

守給你的承諾、 提交于 2019-11-29 17:50:42

问题


I would like to calculate some summary statistics and perform different regressions by group within a data table, and have the results in "wide" format (i.e. one row per group with several columns). I can do it in multiple steps, but it seems like it should be possible to do all at once.

Consider this example data:

set.seed=46984
dt <- data.table(ID=c(rep('Frank',5),rep('Tony',5),rep('Ed',5)), y=rnorm(15), x=rnorm(15), z=rnorm(15),key="ID")
dt
#       ID          y          x            z
# 1:    Ed  0.2129400 -0.3024061  0.845335632
# 2:    Ed  0.4850342 -0.5159197 -0.087965415
# 3:    Ed  1.8917489  1.7803220  0.760465271
# 4:    Ed -0.4330460 -2.1720944  0.973812545
# 5:    Ed  0.7685060  0.7947470  1.279761200
# 6: Frank  0.4978475 -0.2906851  0.568101004
# 7: Frank  0.6323386 -0.5596599  1.537133025
# 8: Frank -0.8243218 -0.4354885  0.057818033
# 9: Frank  1.2402488  0.3229422  0.005995249
#10: Frank  0.2436210 -0.2651422  0.349532173
#11:  Tony  0.4179568  0.1418463  0.142380549
#12:  Tony  0.7036613  0.4402572  0.141237901
#13:  Tony -0.1978720 -0.9553784  0.480425820
#14:  Tony -1.7269375 -0.1881292  0.370583351
#15:  Tony  1.1064903  0.4375014 -0.798221750

Let's say I want to get the median by ID, perform linear regression on y ~ x by ID, and perform linear regression on y ~ x + z by ID. Here I get the median:

dt.med <- dt[,list(y.med=median(y)),by=ID]
dt.med
#      ID     y.med
#1:    Ed 0.4850342
#2: Frank 0.4978475
#3:  Tony 0.4179568

And thanks to this answer by @DWin, here I get the two individual sets of regression coefficients as columns by ID:

dt.reg.1 <- dt[,as.list(coef(lm(y ~ x))), by=ID]
dt.reg.1
#      ID (Intercept)         x
#1:    Ed  0.63057884 0.5482373
#2: Frank  0.69720351 1.3813007
#3:  Tony  0.08588421 1.0179131

dt.reg.2 <- dt[,as.list(coef(lm(y ~ x + z))), by=ID]
dt.reg.2
#      ID (Intercept)         x          z
#1:    Ed   0.8262577 0.5587170 -0.2582699
#2: Frank   0.4317538 2.7221024  1.1807442
#3:  Tony   0.1494439 0.3166547 -1.2029693

Now I have to join the three result sets, and rename the columns:

dt.ans <- dt.med[dt.reg.1][dt.reg.2]
setnames(dt.ans,c("ID","y.med","reg.1.c0","reg.1.c1","reg.2.c0","reg.2.c1","reg.2.c2"))

Finally, here is the desired output for the example:

dt.ans
#      ID     y.med   reg.1.c0  reg.1.c1  reg.2.c0  reg.2.c1   reg.2.c2
#1:    Ed 0.4850342 0.63057884 0.5482373 0.8262577 0.5587170 -0.2582699
#2: Frank 0.4978475 0.69720351 1.3813007 0.4317538 2.7221024  1.1807442
#3:  Tony 0.4179568 0.08588421 1.0179131 0.1494439 0.3166547 -1.2029693

It seems inefficient to calculate the three results, join them, and then rename the columns. Also, my actual tables are largish so I'd like to make sure I don't use too much system memory. Is it possible to do this all within "one" data.table statement? Or more generally, can this be done more efficiently?

I've tried different things. Here is one failed example that gives the median but ignores the regression coefficients:

dt[,as.list(median(y),coef(lm(y ~ x))), by=ID]
#      ID        V1
#1:    Ed 0.4850342
#2: Frank 0.4978475
#3:  Tony 0.4179568

回答1:


dt[,c(y.med = median(y),
      reg.1 = as.list(coef(lm(y ~ x))),
      reg.2 = as.list(coef(lm(y ~ x + z)))), by=ID]
#      ID     y.med reg.1.(Intercept)   reg.1.x reg.2.(Intercept)      reg.2.x   reg.2.z
#1:    Ed 0.7280448        0.75977555 0.1132509        0.83322290 -0.484348116 0.7655563
#2: Frank 0.6100339       -0.07830664 0.2700846        0.04720686  0.004027939 0.7168521
#3:  Tony 0.2710623       -0.78319379 0.9166601       -0.35836990  0.622822617 0.4161102


来源:https://stackoverflow.com/questions/19523720/regression-and-summary-statistics-by-group-within-a-data-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!