Speeding up an interpolation exercise

风流意气都作罢 提交于 2019-11-30 10:29:22

There are a lot of things in your code that could be improved, but let's focus on the main bottleneck here. The problem at hand can be considered as an embarrassingly parallel problem. This means that your data can be split up into multiple smaller pieces that can each be computed on separate threads individually without any extra overhead.

To see the parallelisation possibilities for the current problem you should first note that you are performing the exact same calculations for each of the individual firms and/or years separately. You could for example split up the calculations in smaller subtasks for each individual year and then allocate these subtasks to different CPU/GPU cores. A significant performance gain can be obtained in this manner. Finally, when the processing of the subtasks is done, the only thing you still need to do is merge the results.

However, R and all its internal libraries run as a single thread. You will have to explicitly split up your data and then assign the subtasks to different cores. In order to achieve this, there exist a number of R packages that support multithreading. We will use the doparallel package in our example here.

You did not provide an explicit dataset that is big enough to effectively test the performance so we will first create some random data:

set.seed(42)
wages<-data.table(firm=substr(10001:10010,2,5)[sample(10,size=1e6,replace=T)],
                  year=round(unif(1e6,1996,2015)),
                  position=round(runif(1e6,4,5)),
                  exp=round(runif(1e6,1,40)),
                  salary=round(exp(rnorm(1e6,mean=10.682,sd=.286))))
> wages
         firm year position exp salary
      1: 0001 1996        4  14  66136
      2: 0001 1996        4   3  42123
      3: 0001 1996        4   9  46528
      4: 0001 1996        4  11  35195
      5: 0001 1996        4   2  43926
     ---                              
 999996: 0010 2015        5  11  43140
 999997: 0010 2015        5  23  64025
 999998: 0010 2015        5  31  35266
 999999: 0010 2015        5  11  36267
1000000: 0010 2015        5   7  44315

Now, lets run the first part of your code:

#get the range of experience for each firm
wages[,min_exp:=min(exp),by=.(year,firm,position)]
wages[,max_exp:=max(exp),by=.(year,firm,position)]
#Can't interpolate if there are only 2 or 3 unique experience cells represented
wages[,node_count:=length(unique(exp)),by=.(year,firm,position)]
#Nor if there are too few teachers
wages[,ind_count:=.N,by=.(year,firm,position)]
#Also troublesome when there is little variation in salaries like so:
wages[,sal_scale_flag:=mean(abs(salary-mean(salary)))<50,by=.(year,firm,position)]
wages[,sal_count_flag:=length(unique(salary))<5,by=.(year,firm,position)]
> wages
         firm year position exp salary min_exp max_exp node_count ind_count sal_scale_flag sal_count_flag
      1: 0001 1996        4  14  66136       1      40         40      1373          FALSE          FALSE
      2: 0001 1996        4   3  42123       1      40         40      1373          FALSE          FALSE
      3: 0001 1996        4   9  46528       1      40         40      1373          FALSE          FALSE
      4: 0001 1996        4  11  35195       1      40         40      1373          FALSE          FALSE
      5: 0001 1996        4   2  43926       1      40         40      1373          FALSE          FALSE
     ---                                                                                                 
 999996: 0010 2015        5  11  43140       1      40         40      1326          FALSE          FALSE
 999997: 0010 2015        5  23  64025       1      40         40      1326          FALSE          FALSE
 999998: 0010 2015        5  31  35266       1      40         40      1326          FALSE          FALSE
 999999: 0010 2015        5  11  36267       1      40         40      1326          FALSE          FALSE
1000000: 0010 2015        5   7  44315       1      40         40      1326          FALSE          FALSE

We will now process the wages in a single threaded manner as you have done before. Note that we first save the original data so that we can perform multithreaded operations on it later and compare the results:

start <- Sys.time()
salary_scales_1 <-
  wages[node_count>=7&ind_count>=10
        &sal_scale_flag==0&sal_count_flag==0,
        .(exp=0:40,salary=cobs_extrap(exp,salary,min_exp,max_exp)),
        by=.(firm,year,position)]
print(paste("No Parallelisation time: ",Sys.time()-start))
> print(paste("No Parallelisation time: ",Sys.time()-start))
[1] "No Parallelisation time:  1.13971961339315"
> salary_scales_1
       firm year position exp   salary
    1: 0001 1996        4   0 43670.14
    2: 0001 1996        4   1 43674.00
    3: 0001 1996        4   2 43677.76
    4: 0001 1996        4   3 43681.43
    5: 0001 1996        4   4 43684.99
   ---                                
16396: 0010 2015        5  36 44464.02
16397: 0010 2015        5  37 44468.60
16398: 0010 2015        5  38 44471.35
16399: 0010 2015        5  39 44472.27
16400: 0010 2015        5  40 43077.70

It took about 1 minute, 8 seconds to process everything. Note that we only have 10 different firms in our dummy example, this is why the processing time is not that significant in comparison to your local results.

Now, let's try to perform this task in a parallelised manner. As mentioned, for our example we would like to split up the data per year and assign the smaller subparts to separate cores. We will use the doParallel package for this purpose:

The first thing that we will need to do is create a cluster with a particular number of cores. In our example we will try to use all the available cores. Next, we will have to register the cluster and export some variables to the global environments of the subnodes. In this case the subnodes only need access to wages. Additionally, some of the dependent libraries will also need to be evaluated on the nodes in order to make it work. In this case the nodes need access to both the data.frame and cobs libraries. The code looks like this:

library(doParallel)
start <- Sys.time()
cl <- makeCluster(detectCores()); 
registerDoParallel(cl); 
clusterExport(cl,c("wages"),envir=environment());
clusterEvalQ(cl,library("data.table"));
clusterEvalQ(cl,library("cobs"));
salary_scales_2 <- foreach(i = 1996:2015) %dopar%
  {
    subSet <- wages[.(i)] # binary subsetting
    subSet[node_count>=7&ind_count>=10
           &sal_scale_flag==0&sal_count_flag==0,
           .(exp=0:40,
             salary=cobs_extrap(exp,salary,min_exp,max_exp)),
           by=.(firm,year,position)]
  }
stopCluster(cl)
print(paste("With parallelisation time: ",Sys.time()-start))
> print(paste("With parallelisation time: ",Sys.time()-start))
[1] "With parallelisation time:  23.4177722930908"

We now have a list of data tables salary_scales_2 that contains the subresults for each invididual year. Notice the speedup in processing time: This time it took only 23 seconds instead of the original 1.1 minutes (65% improvement). The only thing that we still need to do now is merge the results. We can use do.call("rbind", salary_scales_2) in order to merge the rows of the tables together (this takes almost no time--.002 seconds on one run). Finally, we also perform a small check to verify that the multithreaded results are indeed identical to the results of the single threaded run:

salary_scales_2<-do.call("rbind",salary_scales_2)
identical(salary_scales_1,salary_scales_2)
> identical(salary_scales_1,salary_scales_2)
[1] TRUE

REPLY TO COMMENT It's a very interesting example indeed but I think you might be missing the more important issue here. The data.table indeed performs memory and structure related optimizations in order for you to query and access your data in a more efficient way. However, in this example there is no major memory or search related bottleneck, especially not when you make comparisons with the actual total data crunching time in the cobs function. For example, the line that you changed subSet <- wages[year==uniqueYears[i],] takes only about 0.04 seconds per call when you time it.

If you use a profiler on your runs then you will notice that it is not the data.table or any of its operations or groupings that beg for parallelisation, it is the cobs function that takes up almost all of the processing time (and this function doesn't even use a data.table as input). What we are trying to do in the example is reassigning our total workload of the cobs function to different cores in order to achieve our speedup. Our intention is not to split up the data.table operations since they are not costly at all. However, we indeed have to split up our data.table as a result of the fact that we need to split up the data for the separate cobs function runs. In fact, we have even taken advantage of the fact that the data.table is efficient in all regards while splitting and merging the table(s). This took no additional time at all.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!