Fast linear regression by group

后端未结

关注

 5  1953

后悔当初

I have 500K users and I need to compute a linear regression (with intercept) for each of them.

Each user has around 30 records.

I t

相关标签:

5条回答

花落未央

2020-12-08 08:58

Update: As pointed out by Dirk, my original approach can be greatly improved upon by specifying x and Y directly rather than using the formula-based interface of fastLm, which incurs (a fairly significant) processing overhead. For comparison, using the original full size data set,

R> system.time({
  dt[,c("lm_b0", "lm_b1") := as.list(
    unname(fastLm(x, Y)$coefficients))
    ,by = "user_id"]
})
#  user  system elapsed 
#55.364   0.014  55.401 
##
R> system.time({
  dt[,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients))
    ,by = "user_id"]
})
#   user  system elapsed 
#356.604   0.047 356.820

this simple change yields roughly a 6.5x speedup.

[Original approach]

There is probably some room for improvement, but the following took about 25 minutes on a Linux VM (2.6 GHz processor), running 64-bit R:

library(data.table)
library(RcppArmadillo)
##
dt[
  ,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients)),
  by=user_id]
##
R> dt[c(1:2, 31:32, 61:62),]
   user_id   x         Y     lm_b0    lm_b1
1:       1 1.0 1674.8316 -202.0066 744.6252
2:       1 1.5  369.8608 -202.0066 744.6252
3:       2 1.0  463.7460 -144.2961 374.1995
4:       2 1.5  412.7422 -144.2961 374.1995
5:       3 1.0  513.0996  217.6442 261.0022
6:       3 1.5 1140.2766  217.6442 261.0022

Data:

dt <- data.table(
  user_id = rep(1:500000,each=30))
##
dt[, x := seq(1, by=.5, length.out=30), by = user_id]
dt[, Y := 1000*runif(1)*x, by = user_id]
dt[, Y := Y + rnorm(
  30, 
  mean = sample(c(-.05,0,0.5)*mean(Y),1), 
  sd = mean(Y)*.25), 
  by = user_id]

0 讨论(0)

北海茫月

2020-12-08 09:01
An example using Rfast.

Assuming a single response and 500K predictor variables.
```
y <- rnorm(30)
x <- matrnorm(500*1000,30)
system.time( Rfast::univglms(y, x,"normal") )  ## 0.70 seconds
```
Assuming 500K response variables and a singl predictor variable.
```
system.time( Rfast::mvbetas(x,y) )  ## 0.60 seconds
```
Note: The above times will decrease in the nearby future.
0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-08 09:12
If all you want is coefficients, I'd just use user_id as a factor in the regression. Using @miles2know's simulated data code (though renaming since an object other than exp() sharing that name looks weird to me)
```
dat <- data.frame(id = rep(c("a","b","c"), each = 20),
                  x = rnorm(60,5,1.5),
                  y = rnorm(60,2,.2))

mod = lm(y ~ x:id + id + 0, data = dat)
```
We fit no global intercept (+ 0) so that the intercept for each id is the id coefficient, and no x by itself, so that the x:id interactions are the slopes for each id:
```
coef(mod)
#      ida      idb      idc    x:ida    x:idb    x:idc 
# 1.779686 1.893582 1.946069 0.039625 0.033318 0.000353 
```
So, for the a level of id, the ida coefficient, 1.78, is the intercept and the x:ida coefficient, 0.0396, is the slope.

I'll leave the gathering of these coefficients into appropriate columns of a data frame to you...

This solution should be very fast because you're not having to deal with subsets of data frames. It could probably be sped up even more with fastLm or such.

Note on scalability:

I did just try this on @nrussell's simulated full-size data and ran into memory allocation issues. Depending on how much memory you have it may not work in one go, but you could probably do it in batches of user ids. Some combination of his answer and my answer might be the fastest overall---or nrussell's might just be faster---expanding the user id factor into thousands of dummy variables might not be computationally efficient, as I've been waiting more than a couple minutes now for a run on just 5000 user ids.
0 讨论(0)
发布评论:

提交评论
- 加载中...

情书的邮戳

2020-12-08 09:15

You might give this a try using data.table like this. I've just created some toy data but I'd imagine data.table would give some improvement. It's quite speedy. But that is quite a large data-set so perhaps benchmark this approach on a smaller sample to see if the speed is a lot better. good luck.


    library(data.table)

    exp <- data.table(id = rep(c("a","b","c"), each = 20), x = rnorm(60,5,1.5), y = rnorm(60,2,.2))
    # edit: it might also help to set a key on id with such a large data-set
    # with the toy example it would make no diff of course
    exp <- setkey(exp,id)
    # the nuts and bolts of the data.table part of the answer
    result <- exp[, as.list(coef(lm(y ~ x))), by=id]
    result
       id (Intercept)            x
    1:  a    2.013548 -0.008175644
    2:  b    2.084167 -0.010023549
    3:  c    1.907410  0.015823088

0 讨论(0)

说谎

2020-12-08 09:20

You can just use the basic formulas for calculating slope and regression. lm does a lot of unnecessary things if all you care about are those two numbers. Here I use data.table for the aggregation, but you could do it in base R as well (or dplyr):

system.time(
  res <- DT[, 
    {
      ux <- mean(x)
      uy <- mean(y)
      slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2)
      list(slope=slope, intercept=uy - slope * ux)
    }, by=user.id
  ]
)

Produces for 500K users ~30 obs each (in seconds):

 user  system elapsed 
 7.35    0.00    7.36

Or about 15 microseconds per user.

Update: I ended up writing a bunch of blog posts that touch on this as well.

And to confirm this is working as expected:

> summary(DT[user.id==89663, lm(y ~ x)])$coefficients
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.1965844  0.2927617 0.6714826 0.5065868
x           0.2021210  0.5429594 0.3722580 0.7120808
> res[user.id == 89663]
   user.id    slope intercept
1:   89663 0.202121 0.1965844

Data:

set.seed(1)
users <- 5e5
records <- 30
x <- runif(users * records)
DT <- data.table(
  x=x, y=x + runif(users * records) * 4 - 2, 
  user.id=sample(users, users * records, replace=T)
)

0 讨论(0)

Fast linear regression by group

Note on scalability: