optimizing manually-coded k-means in MATLAB?

前端 未结 1 1745
傲寒
傲寒 2020-12-18 14:26

So I\'m writing a k-means script in MATLAB, since the native function doesn\'t seem to be very efficient, and it seems to be fully operational. It appears to work on the sma

相关标签:
1条回答
  • 2020-12-18 14:59

    Profiling will help, but the place to rework your code is to avoid the loop over the number of data points (for point = 1:size(data,1)). Vectorize that.

    In your for iteration loop here is a quick partial example,

    [nPoints,nDims] = size(data);
    
    % Calculate all high-dimensional distances at once
    kdiffs = bsxfun(@minus,data,permute(mu_k,[3 2 1])); % NxDx1 - 1xDxK => NxDxK
    distances = sum(kdiffs.^2,2); % no need to do sqrt
    distances = squeeze(distances); % Nx1xK => NxK
    
    % Find closest cluster center for each point
    [~,ik] = min(distances,[],2); % Nx1
    
    % Calculate the new cluster centers (mean the data)
    mu_k_new = zeros(c,nDims);
    for i=1:c,
        indk = ik==i;
        clustersizes(i) = nnz(indk);
        mu_k_new(i,:) = mean(data(indk,:))';
    end
    

    This isn't the only (or the best) way to do it, but it should be a decent example.

    Some other comments:

    1. Instead of using input, make this script into a function to efficiently handle input arguments.
    2. If you want an easy way to specify a file, see uigetfile.
    3. With many MATLAB functions, such as max, min, sum, mean, etc., you can specify a dimension over which the function should operate. This way you an run it on a matrix and compute values for multiple conditions/dimensions at the same time.
    4. Once you get decent performance, consider iterating longer, specifically until the centers no longer change or the number of samples that change clusters becomes small.
    5. The cluster with the smallest distance for each point, ik, will be the same with squared Euclidean distance.
    0 讨论(0)
提交回复
热议问题