Fast linear regression by group

后端 未结 5 1918
后悔当初
后悔当初 2020-12-08 08:31

I have 500K users and I need to compute a linear regression (with intercept) for each of them.

Each user has around 30 records.

I t

相关标签:
5条回答
  • 2020-12-08 08:58

    Update: As pointed out by Dirk, my original approach can be greatly improved upon by specifying x and Y directly rather than using the formula-based interface of fastLm, which incurs (a fairly significant) processing overhead. For comparison, using the original full size data set,

    R> system.time({
      dt[,c("lm_b0", "lm_b1") := as.list(
        unname(fastLm(x, Y)$coefficients))
        ,by = "user_id"]
    })
    #  user  system elapsed 
    #55.364   0.014  55.401 
    ##
    R> system.time({
      dt[,c("lm_b0","lm_b1") := as.list(
        unname(fastLm(Y ~ x, data=.SD)$coefficients))
        ,by = "user_id"]
    })
    #   user  system elapsed 
    #356.604   0.047 356.820
    

    this simple change yields roughly a 6.5x speedup.


    [Original approach]

    There is probably some room for improvement, but the following took about 25 minutes on a Linux VM (2.6 GHz processor), running 64-bit R:

    library(data.table)
    library(RcppArmadillo)
    ##
    dt[
      ,c("lm_b0","lm_b1") := as.list(
        unname(fastLm(Y ~ x, data=.SD)$coefficients)),
      by=user_id]
    ##
    R> dt[c(1:2, 31:32, 61:62),]
       user_id   x         Y     lm_b0    lm_b1
    1:       1 1.0 1674.8316 -202.0066 744.6252
    2:       1 1.5  369.8608 -202.0066 744.6252
    3:       2 1.0  463.7460 -144.2961 374.1995
    4:       2 1.5  412.7422 -144.2961 374.1995
    5:       3 1.0  513.0996  217.6442 261.0022
    6:       3 1.5 1140.2766  217.6442 261.0022
    

    Data:

    dt <- data.table(
      user_id = rep(1:500000,each=30))
    ##
    dt[, x := seq(1, by=.5, length.out=30), by = user_id]
    dt[, Y := 1000*runif(1)*x, by = user_id]
    dt[, Y := Y + rnorm(
      30, 
      mean = sample(c(-.05,0,0.5)*mean(Y),1), 
      sd = mean(Y)*.25), 
      by = user_id]
    
    0 讨论(0)
  • 2020-12-08 09:01

    An example using Rfast.

    Assuming a single response and 500K predictor variables.

    y <- rnorm(30)
    x <- matrnorm(500*1000,30)
    system.time( Rfast::univglms(y, x,"normal") )  ## 0.70 seconds
    

    Assuming 500K response variables and a singl predictor variable.

    system.time( Rfast::mvbetas(x,y) )  ## 0.60 seconds
    

    Note: The above times will decrease in the nearby future.

    0 讨论(0)
  • 2020-12-08 09:12

    If all you want is coefficients, I'd just use user_id as a factor in the regression. Using @miles2know's simulated data code (though renaming since an object other than exp() sharing that name looks weird to me)

    dat <- data.frame(id = rep(c("a","b","c"), each = 20),
                      x = rnorm(60,5,1.5),
                      y = rnorm(60,2,.2))
    
    mod = lm(y ~ x:id + id + 0, data = dat)
    

    We fit no global intercept (+ 0) so that the intercept for each id is the id coefficient, and no x by itself, so that the x:id interactions are the slopes for each id:

    coef(mod)
    #      ida      idb      idc    x:ida    x:idb    x:idc 
    # 1.779686 1.893582 1.946069 0.039625 0.033318 0.000353 
    

    So, for the a level of id, the ida coefficient, 1.78, is the intercept and the x:ida coefficient, 0.0396, is the slope.

    I'll leave the gathering of these coefficients into appropriate columns of a data frame to you...

    This solution should be very fast because you're not having to deal with subsets of data frames. It could probably be sped up even more with fastLm or such.

    Note on scalability:

    I did just try this on @nrussell's simulated full-size data and ran into memory allocation issues. Depending on how much memory you have it may not work in one go, but you could probably do it in batches of user ids. Some combination of his answer and my answer might be the fastest overall---or nrussell's might just be faster---expanding the user id factor into thousands of dummy variables might not be computationally efficient, as I've been waiting more than a couple minutes now for a run on just 5000 user ids.

    0 讨论(0)
  • 2020-12-08 09:15

    You might give this a try using data.table like this. I've just created some toy data but I'd imagine data.table would give some improvement. It's quite speedy. But that is quite a large data-set so perhaps benchmark this approach on a smaller sample to see if the speed is a lot better. good luck.

    
        library(data.table)
    
        exp <- data.table(id = rep(c("a","b","c"), each = 20), x = rnorm(60,5,1.5), y = rnorm(60,2,.2))
        # edit: it might also help to set a key on id with such a large data-set
        # with the toy example it would make no diff of course
        exp <- setkey(exp,id)
        # the nuts and bolts of the data.table part of the answer
        result <- exp[, as.list(coef(lm(y ~ x))), by=id]
        result
           id (Intercept)            x
        1:  a    2.013548 -0.008175644
        2:  b    2.084167 -0.010023549
        3:  c    1.907410  0.015823088
    
    0 讨论(0)
  • 2020-12-08 09:20

    You can just use the basic formulas for calculating slope and regression. lm does a lot of unnecessary things if all you care about are those two numbers. Here I use data.table for the aggregation, but you could do it in base R as well (or dplyr):

    system.time(
      res <- DT[, 
        {
          ux <- mean(x)
          uy <- mean(y)
          slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2)
          list(slope=slope, intercept=uy - slope * ux)
        }, by=user.id
      ]
    )
    

    Produces for 500K users ~30 obs each (in seconds):

     user  system elapsed 
     7.35    0.00    7.36 
    

    Or about 15 microseconds per user.

    Update: I ended up writing a bunch of blog posts that touch on this as well.

    And to confirm this is working as expected:

    > summary(DT[user.id==89663, lm(y ~ x)])$coefficients
                 Estimate Std. Error   t value  Pr(>|t|)
    (Intercept) 0.1965844  0.2927617 0.6714826 0.5065868
    x           0.2021210  0.5429594 0.3722580 0.7120808
    > res[user.id == 89663]
       user.id    slope intercept
    1:   89663 0.202121 0.1965844
    

    Data:

    set.seed(1)
    users <- 5e5
    records <- 30
    x <- runif(users * records)
    DT <- data.table(
      x=x, y=x + runif(users * records) * 4 - 2, 
      user.id=sample(users, users * records, replace=T)
    )
    
    0 讨论(0)
提交回复
热议问题