optimizing manually-coded k-means in MATLAB?

前端未结

关注

 1  1745

So I\'m writing a k-means script in MATLAB, since the native function doesn\'t seem to be very efficient, and it seems to be fully operational. It appears to work on the sma

相关标签:

1条回答

无人及你

2020-12-18 14:59
Profiling will help, but the place to rework your code is to avoid the loop over the number of data points (for point = 1:size(data,1)). Vectorize that.

In your for iteration loop here is a quick partial example,
```
[nPoints,nDims] = size(data);

% Calculate all high-dimensional distances at once
kdiffs = bsxfun(@minus,data,permute(mu_k,[3 2 1])); % NxDx1 - 1xDxK => NxDxK
distances = sum(kdiffs.^2,2); % no need to do sqrt
distances = squeeze(distances); % Nx1xK => NxK

% Find closest cluster center for each point
[~,ik] = min(distances,[],2); % Nx1

% Calculate the new cluster centers (mean the data)
mu_k_new = zeros(c,nDims);
for i=1:c,
    indk = ik==i;
    clustersizes(i) = nnz(indk);
    mu_k_new(i,:) = mean(data(indk,:))';
end
```
This isn't the only (or the best) way to do it, but it should be a decent example.

Some other comments:
1. Instead of using input, make this script into a function to efficiently handle input arguments.
2. If you want an easy way to specify a file, see uigetfile.
3. With many MATLAB functions, such as max, min, sum, mean, etc., you can specify a dimension over which the function should operate. This way you an run it on a matrix and compute values for multiple conditions/dimensions at the same time.
4. Once you get decent performance, consider iterating longer, specifically until the centers no longer change or the number of samples that change clusters becomes small.
5. The cluster with the smallest distance for each point, ik, will be the same with squared Euclidean distance.
0 讨论(0)
发布评论:

提交评论
- 加载中...