问题
I'm quite new to matlab, and I'm curious how to do this:
I have a rather large (27000x11) matrix, and the 8th column contains a number which changes sometimes but is constant for like 2000 rows (not necessarily consecutive).
I would like to calculate the mean of the entries in the 3rd column for those rows where the 8th column has the same value. This for each value of the 8th column. I would also like to plot the 3rd column's means as a function of the 8th column's value but that I can do if I can get a new matrix (2x2) containing [mean_of_3rd,8th].
Ex: (smaller matrix for convenience)
1 2 3 4 5
3 7 5 3 2
1 3 2 5 3
4 5 7 5 8
2 4 7 4 4
Since the 4th column has the same value in row 1 and 5 I'd like to calculate the mean of 2 and 4 (the corresponding elements of column 2, italic bold) and put it in another matrix together with the 4th column's value. The same for 3 and 5 (bold) since the 4th column has the same value for these two.
3 4
4 5
and so on... is this possible in an easy way?
回答1:
Use the all-mighty, underused accumarray
:
This line gives you mean values of 4th column accumulated by 2nd column:
means = accumarray( A(:,4) ,A(:,2),[],@mean)
This line gives you number of element in each set:
count = accumarray( A(:,4) ,ones(size(A(:,4))))
Now if you want to filter only those that have at least one occurence:
>> filtered = means(count>1)
filtered =
3
4
This will work only for positive integers in the 4th column.
Another possibility for counting amount of elements in each set:
count = accumarray( A(:,4) ,A(:,4),[],@numel)
回答2:
A slightly refined approach based on the ideas of Andrey and Rody. We can not use accumarray directly, since the data is real, not integer. But, we can use unique to find the indices of the repeating entries. Then we operate on integers.
% get unique entries in 4th column
[R, I, J] = unique(A(:,4));
% count the repeating entries: now we have integer indices!
counts = accumarray(J, 1, size(R));
% sum the 2nd column for all entries
sums = accumarray(J, A(:,2), size(R));
% compute means
means = sums./counts;
% choose only the entries that show more than once in 4th column
inds = counts>1;
result = [means(inds) R(inds)];
Time comparison for the following synthetic data:
A=randi(100, 1000000, 5);
% Rody's solution
Elapsed time is 0.448222 seconds.
% The above code
Elapsed time is 0.148304 seconds.
回答3:
My official answer:
A4 = A(:,4);
R = unique(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(R)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
result = [means(inds) R(inds)];
This is because of the following. Here's all of the alternatives we've come up with, in profiling form:
%# sample data
A = [
1 2 3 4 5
3 7 5 3 2
1 3 2 5 3
4 5 7 5 8
2 4 7 4 4];
%# accumarray
%# works only on positive integers in A(:,4)
tic
for ii = 1:1e4
means = accumarray( A(:,4) ,A(:,2),[],@mean);
count = accumarray( A(:,4) ,ones(size(A(:,4))));
filtered = means(count>1);
end
toc
%# arrayfun
%# works only on integers in A(:,4)
tic
for ii = 1:1e4
B = arrayfun(@(x) A(A(:,4)==x, 2), min(A(:,4)):max(A(:,4)), 'uniformoutput', false);
filtered = cellfun(@mean, B(cellfun(@(x) numel(x)>1, B)) );
end
toc
%# ordinary loop
%# works only on integers in A(:,4)
tic
for ii = 1:1e4
A4 = A(:,4);
R = min(A4):max(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(R)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
filtered = means(inds);
end
toc
Results:
Elapsed time is 1.238352 seconds. %# (accumarray)
Elapsed time is 7.208585 seconds. %# (arrayfun + cellfun)
Elapsed time is 0.225792 seconds. %# (for loop)
The ordinary loop is clearly the way to go here.
Note the absence of mean
in the inner loop. This is because mean
is not a Matlab builtin function (at least, on R2010), so that using it inside the loop makes the loop unqualified for JIT compilation, which slows it down by a factor of over 10. Using the form above accelerates the loop to almost 5.5 times the speed of the accumarray
solution.
Judging on your comment, it is almost trivial to change the loop to work on all entries in A(:,4)
(not just the integers):
A4 = A(:,4);
R = unique(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(A4)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
filtered = means(inds);
Which I will copy-paste to the top as my official answer :)
来源:https://stackoverflow.com/questions/12788767/in-matlab-calculate-mean-in-a-part-of-one-column-where-another-column-satisfies