Time series aggregation efficiency

前端未结

关注

 5  896

傲寒 2020-12-20 15:46

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seem

5条回答

温柔的废话 (楼主)

2020-12-20 16:39
Mex Function 1

HAMMER TIME: Mex function to crush it: The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from @Divakar is:
```
groups2 = unique(groupIndex); 
aggArray2 = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)).'*array; 
```
Elapsed time is 0.589330 seconds.

Then my MEX function:
```
[groups3, aggArray3] = mg_aggregate(array, groupIndex, @(x) sum(x, 1));
```
Elapsed time is 0.079725 seconds.

Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15. Results also match original code.

Code to generate the test conditions:
```
array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
```
For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.

Mex Approach 2

Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.
```
[unique_groups, map] = mg_getRowsWithKey(groupIndex);

results = zeros(length(unique_groups), size(array,2));

for iGr = 1:length(unique_groups)
   array_subset             = array(map{iGr},:);

   %// do your collapse function on array_subset. eg.
   results(iGr,:)           = sum(array_subset, 1);
end
```
When you do array(groups(1)==groupIndex,:) to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:) is far more efficient.

There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

Time series aggregation efficiency

Mex Function 1

Mex Approach 2