Parallelization over for loop analyzing a data.frame

问题

These days I've been working with a data.frame of 8M registers, and I need to improve a loop that analyzes this data.

I will describe each process of the problem that I am trying to solve. First, I have to arrange all the data.frame in ascending order by three fields ClientID, Date and Time. (Check) Then, using that arranged data.frame, I must operate the differences between each of the observations, where it can be only done when the ClientID is the same. For example:

ClientID|Date(YMD)|Time(HMS)
A|20120101|110000
A|20120101|111500
A|20120101|120000
B|20120202|010000
B|20120202|012030

According to the data up, the result that I want to obtain is the following:

ClientID|Date(YMD)|Time(HMS)|Difference(minutes)
A|20120101|110000|0.00
A|20120101|111500|15.00
A|20120101|120000|45.00
B|20120202|010000|0
B|20120202|012030|20.30

The problem now is that, analyzing all this with a data.frame of 8M observations, it takes like 3 days. I wish I could parallelize this process. My idea is that the data.frame could be segmented by clusters, but this segmentation could be in order and not randomly, and then using the library foreach or another library, could take by clusters the analysis and set it to the number of cores available. For example:

Cluster|ClientID|Date(YMD)|Time(HMS)
CORE 1|
1|A|20120101|110000
1|A|20120101|111500
1|A|20120101|120000
CORE 2|
2|B|20120202|010000
2|B|20120202|012030

回答1:

I wouldn't recommend trying to parallelize this. Using the data.table package and working with times stored in an integer format this should take a pretty trivial amount of time.

Generate some example data

library(data.table)

## Generate Data
RowCount <- 8e6
GroupCount <-1e4

DT <- data.table(ClientID = paste0("Client ",sample.int(GroupCount,size = RowCount, replace = TRUE)),
                 Time = sample.int(12,size = RowCount, replace = TRUE)*900)

DT[, Time := cumsum(Time), keyby = .(ClientID)]
DT[, Time := as.POSIXct(Time, tz = "UTC", origin = "1970-01-01 00:00:00")]

print(DT)

gives

            ClientID                Time
      1:    Client 1 1970-01-01 02:30:00
      2:    Client 1 1970-01-01 04:00:00
      3:    Client 1 1970-01-01 05:30:00
      4:    Client 1 1970-01-01 07:00:00
      5:    Client 1 1970-01-01 10:00:00
     ---                                
7999996: Client 9999 1970-02-20 18:15:00
7999997: Client 9999 1970-02-20 18:30:00
7999998: Client 9999 1970-02-20 21:00:00
7999999: Client 9999 1970-02-20 22:45:00
8000000: Client 9999 1970-02-21 00:30:00

Calculate time differences

system.time({
  ## Create a integer column that stores time as the number of seconds midnight on 1970
  DT[,Time_Unix := as.integer(Time)]

  ## Order by ClientID then Time_Unix
  setkey(DT, ClientID, Time_Unix)

  ## Calculate Elapsed Time in minutes between rows, grouped by ClientID
  DT[, Elapsed_Minutes := (Time_Unix - shift(Time_Unix, n = 1L, type = "lag", fill = NA))/60L, keyby = .(ClientID)]

  ## Clean up the integer time
  DT[,Time_Unix := NULL]
})

...

   user  system elapsed 
  0.416   0.025   0.442

Results:

print(DT)

...

            ClientID                Time Elapsed_Minutes
      1:    Client 1 1970-01-01 02:30:00              NA
      2:    Client 1 1970-01-01 04:00:00              90
      3:    Client 1 1970-01-01 05:30:00              90
      4:    Client 1 1970-01-01 07:00:00              90
      5:    Client 1 1970-01-01 10:00:00             180
     ---                                                
7999996: Client 9999 1970-02-20 18:15:00             135
7999997: Client 9999 1970-02-20 18:30:00              15
7999998: Client 9999 1970-02-20 21:00:00             150
7999999: Client 9999 1970-02-20 22:45:00             105
8000000: Client 9999 1970-02-21 00:30:00             105

来源：https://stackoverflow.com/questions/49177163/parallelization-over-for-loop-analyzing-a-data-frame

标签

parallel-processing

data.table

data-mining