Time series aggregation efficiency

前端 未结 5 896
傲寒
傲寒 2020-12-20 15:46

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seem

5条回答
  •  温柔的废话
    2020-12-20 16:39

    Mex Function 1

    HAMMER TIME: Mex function to crush it: The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from @Divakar is:

    groups2 = unique(groupIndex); 
    aggArray2 = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)).'*array; 
    

    Elapsed time is 0.589330 seconds.

    Then my MEX function:

    [groups3, aggArray3] = mg_aggregate(array, groupIndex, @(x) sum(x, 1));
    

    Elapsed time is 0.079725 seconds.

    Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15. Results also match original code.

    Code to generate the test conditions:

    array = rand(20006,10);
    groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
    

    For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.

    Mex Approach 2

    Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.

    [unique_groups, map] = mg_getRowsWithKey(groupIndex);
    
    results = zeros(length(unique_groups), size(array,2));
    
    for iGr = 1:length(unique_groups)
       array_subset             = array(map{iGr},:);
    
       %// do your collapse function on array_subset. eg.
       results(iGr,:)           = sum(array_subset, 1);
    end
    

    When you do array(groups(1)==groupIndex,:) to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:) is far more efficient.


    There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.

提交回复
热议问题