Parallelization over for loop analyzing a data.frame

我只是一个虾纸丫 提交于 2019-12-11 17:34:04

问题


These days I've been working with a data.frame of 8M registers, and I need to improve a loop that analyzes this data.

I will describe each process of the problem that I am trying to solve. First, I have to arrange all the data.frame in ascending order by three fields ClientID, Date and Time. (Check) Then, using that arranged data.frame, I must operate the differences between each of the observations, where it can be only done when the ClientID is the same. For example:

ClientID|Date(YMD)|Time(HMS)
A|20120101|110000
A|20120101|111500
A|20120101|120000
B|20120202|010000
B|20120202|012030

According to the data up, the result that I want to obtain is the following:

ClientID|Date(YMD)|Time(HMS)|Difference(minutes)
A|20120101|110000|0.00
A|20120101|111500|15.00
A|20120101|120000|45.00
B|20120202|010000|0
B|20120202|012030|20.30

The problem now is that, analyzing all this with a data.frame of 8M observations, it takes like 3 days. I wish I could parallelize this process. My idea is that the data.frame could be segmented by clusters, but this segmentation could be in order and not randomly, and then using the library foreach or another library, could take by clusters the analysis and set it to the number of cores available. For example:

Cluster|ClientID|Date(YMD)|Time(HMS)
CORE 1|
1|A|20120101|110000
1|A|20120101|111500
1|A|20120101|120000
CORE 2|
2|B|20120202|010000
2|B|20120202|012030

回答1:


I wouldn't recommend trying to parallelize this. Using the data.table package and working with times stored in an integer format this should take a pretty trivial amount of time.

Generate some example data

library(data.table)

## Generate Data
RowCount <- 8e6
GroupCount <-1e4

DT <- data.table(ClientID = paste0("Client ",sample.int(GroupCount,size = RowCount, replace = TRUE)),
                 Time = sample.int(12,size = RowCount, replace = TRUE)*900)

DT[, Time := cumsum(Time), keyby = .(ClientID)]
DT[, Time := as.POSIXct(Time, tz = "UTC", origin = "1970-01-01 00:00:00")]

print(DT)

gives

            ClientID                Time
      1:    Client 1 1970-01-01 02:30:00
      2:    Client 1 1970-01-01 04:00:00
      3:    Client 1 1970-01-01 05:30:00
      4:    Client 1 1970-01-01 07:00:00
      5:    Client 1 1970-01-01 10:00:00
     ---                                
7999996: Client 9999 1970-02-20 18:15:00
7999997: Client 9999 1970-02-20 18:30:00
7999998: Client 9999 1970-02-20 21:00:00
7999999: Client 9999 1970-02-20 22:45:00
8000000: Client 9999 1970-02-21 00:30:00

Calculate time differences

system.time({
  ## Create a integer column that stores time as the number of seconds midnight on 1970
  DT[,Time_Unix := as.integer(Time)]

  ## Order by ClientID then Time_Unix
  setkey(DT, ClientID, Time_Unix)

  ## Calculate Elapsed Time in minutes between rows, grouped by ClientID
  DT[, Elapsed_Minutes := (Time_Unix - shift(Time_Unix, n = 1L, type = "lag", fill = NA))/60L, keyby = .(ClientID)]

  ## Clean up the integer time
  DT[,Time_Unix := NULL]
})

...

   user  system elapsed 
  0.416   0.025   0.442 

Results:

print(DT)

...

            ClientID                Time Elapsed_Minutes
      1:    Client 1 1970-01-01 02:30:00              NA
      2:    Client 1 1970-01-01 04:00:00              90
      3:    Client 1 1970-01-01 05:30:00              90
      4:    Client 1 1970-01-01 07:00:00              90
      5:    Client 1 1970-01-01 10:00:00             180
     ---                                                
7999996: Client 9999 1970-02-20 18:15:00             135
7999997: Client 9999 1970-02-20 18:30:00              15
7999998: Client 9999 1970-02-20 21:00:00             150
7999999: Client 9999 1970-02-20 22:45:00             105
8000000: Client 9999 1970-02-21 00:30:00             105


来源:https://stackoverflow.com/questions/49177163/parallelization-over-for-loop-analyzing-a-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!