Time series aggregation efficiency

前端未结

关注

 5  900

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seem

相关标签:

5条回答

南方客

2020-12-20 16:16
Well I have a solution that is almost as quick as the mex but only using matlab. The logic is the same as most of the above, creating a dummy 2D matrix but instead of using @eq I initialize a logical array from the start.

Elapsed time for mine is 0.172975 seconds. Elapsed time for Divakar 0.289122 seconds.
```
function aggArray = aggregate(array, group, collapseFn)
    [m,~] = size(array);
    n = max(group);
    D = false(m,n); 
    row = (1:m)';
    idx = m*(group(:) - 1) + row;
    D(idx) = true;
    out = zeros(m,size(array,2));
    for ii = 1:n
        out(ii,:) = collapseFn(array(D(:,ii),:),1);
    end
end
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

粉色の甜心

2020-12-20 16:22

Doing away with the inner loop, i.e.

function aggArray = aggregate(array, groupIndex, collapseFn)

groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));

for iGr = 1:size(groups,1)
    grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
   aggArray(iGr,:) = collapseFn(array(grIdx,:));
end

and calling the collapse function with a dimension parameter

res=aggregate(a, b, @(x)sum(x,1));

gives some speedup (3x on my machine) already and avoids the errors e.g. sum or mean produce, when they encounter a single row of data without a dimension parameter and then collapse across columns rather than labels.

If you had just one group label vector, i.e. same group labels for all columns of data, you could speed further up:

function aggArray = aggregate(array, groupIndex, collapseFn)

ng=max(groupIndex);
aggArray = nan(ng, size(array, 2));

for iGr = 1:ng
    aggArray(iGr,:) = collapseFn(array(groupIndex==iGr,:));
end

The latter functions gives identical results for your example, with a 6x speedup, but cannot handle different group labels per data column.

Assuming a 2D test case for the group index (provided here as well with 10 different columns for groupIndex:

a = rand(20006,10);
B=[]; % make random length periods for each of the 10 signals
for i=1:size(a,2)
      n0=randi(10);
      b=transpose([ones(1,n0) 2*ones(1,11-n0) sort(repmat((3:4001), [1 5]))]);
      B=[B b];
end
tic; erg0=aggregate(a, B, @sum); toc % original method 
tic; erg1=aggregate2(a, B, @(x)sum(x,1)); toc %just remove the inner loop
tic; erg2=aggregate3(a, B, @(x)sum(x,1)); toc %use function below

Elapsed time is 2.646297 seconds. Elapsed time is 1.214365 seconds. Elapsed time is 0.039678 seconds (!!!!).

function aggArray = aggregate3(array, groupIndex, collapseFn)

[groups,ix1,jx] = unique(groupIndex, 'rows','first');
[groups,ix2,jx] = unique(groupIndex, 'rows','last');

ng=size(groups,1);
aggArray = nan(ng, size(array, 2));

for iGr = 1:ng
    aggArray(iGr,:) = collapseFn(array(ix1(iGr):ix2(iGr),:));
end

I think this is as fast as it gets without using MEX. Thanks to the suggestion of Matthew Gunn! Profiling shows that 'unique' is really cheap here and getting out just the first and last index of the repeating rows in groupIndex speeds things up considerably. I get 88x speedup with this iteration of the aggregation.

0 讨论(0)

爱一瞬间的悲伤

2020-12-20 16:24

A little late to the party, but a single loop using accumarray makes a huge difference:

function aggArray = aggregate_gnovice(array, groupIndex, collapseFn)

  [groups, ~, index] = unique(groupIndex, 'rows');
  numCols = size(array, 2);
  aggArray = nan(numel(groups), numCols);
  for col = 1:numCols
    aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn);
  end

end

Timing this using timeit in MATLAB R2016b for the sample data in the question gives the following:

original | 1.127141
 gnovice | 0.002205

Over a 500x speedup!

0 讨论(0)

佛祖请我去吃肉

2020-12-20 16:30
Method #1

You can create the mask corresponding to grIdx across all groups in one go with bsxfun(@eq,..). Now, for collapseFn as @sum, you can bring in matrix-multiplication and thus have a completely vectorized approach, like so -
```
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = M.'*array
```
For collapseFn as @mean, you need to do a bit more work, as shown here -
```
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = bsxfun(@rdivide,M,sum(M,1)).'*array
```
Method #2

In case you are working with a generic collapseFn, you can use the 2D mask M created with the previous method to index into the rows of array, thus changing the complexity from O(n^2) to O(n). Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation -
```
n = size(groups,1);
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2));
out = zeros(n,size(array,2));
for iGr = 1:n
    out(iGr,:) = collapseFn(array(M(:,iGr),:),1);
end
```
Please note that the 1 in collapseFn(array(M(:,iGr),:),1) denotes the dimension along which collapseFn would be applied, so that 1 is essential there.

Bonus

By its name groupIndex seems like would hold integer values, which could be abused to have a more efficient M creation by considering each row of groupIndex as an indexing tuple and thus converting each row of groupIndex into a scalar and finally get a 1D array version of groupIndex. This must be more efficient as the datasize would be 0(n) now. This M could be fed to all the approaches listed in this post. So, we would have M like so -
```
dims = max(groupIndex,[],1);
agg_dims = cumprod([1 dims(end:-1:2)]);
[~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//'

m = size(groupIndex,1);
M = false(m,max(idx));
M((idx-1)*m + [1:m]') = 1;
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
温柔的废话

2020-12-20 16:39
Mex Function 1

HAMMER TIME: Mex function to crush it: The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from @Divakar is:
```
groups2 = unique(groupIndex); 
aggArray2 = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)).'*array; 
```
Elapsed time is 0.589330 seconds.

Then my MEX function:
```
[groups3, aggArray3] = mg_aggregate(array, groupIndex, @(x) sum(x, 1));
```
Elapsed time is 0.079725 seconds.

Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15. Results also match original code.

Code to generate the test conditions:
```
array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
```
For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.

Mex Approach 2

Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.
```
[unique_groups, map] = mg_getRowsWithKey(groupIndex);

results = zeros(length(unique_groups), size(array,2));

for iGr = 1:length(unique_groups)
   array_subset             = array(map{iGr},:);

   %// do your collapse function on array_subset. eg.
   results(iGr,:)           = sum(array_subset, 1);
end
```
When you do array(groups(1)==groupIndex,:) to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:) is far more efficient.

There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Time series aggregation efficiency

Mex Function 1

Mex Approach 2