scaling the testing data for LIBSVM: MATLAB implementation

蹲街弑〆低调 提交于 2019-12-03 12:27:26

问题


I currently use the MATLAB version of the LIBSVM support vector machine to classify my data. The LIBSVM documentation mentions that scaling before applying SVM is very important and we have to use the same method to scale both training and testing data.

The "same method of scaling" is explained as: For example, suppose that we scaled the first attribute of training data from [-10, +10] to [-1, +1]. If the first attribute of testing data lies in the range [-11, +8], we must scale the testing data to [-1.1, +0.8]

Scaling the training data in the range of [0,1] can be done using the following MATLAB code :

(data - repmat(min(data,[],1),size(data,1),1))*spdiags(1./(max(data,[],1)-min(data,[],1))',0,size(data,2),size(data,2))

But I don't know how to scale the testing data correctly.

Thank you very much for your help.


回答1:


The code you give is essentially subtracting the minimum and then dividing by the range. You need to store the minimum and range of the training data features.

minimums = min(data, [], 1);
ranges = max(data, [], 1) - minimums;

data = (data - repmat(minimums, size(data, 1), 1)) ./ repmat(ranges, size(data, 1), 1);

test_data = (test_data - repmat(minimums, size(test_data, 1), 1)) ./ repmat(ranges, size(test_data, 1), 1);



回答2:


Richante's code is, unfortunately, not correct if there are columns for which all of the observations has the same value (which may happen if the data is sparse). An example:

>> data = [1 2 3; 5 2 8; 7 2 100]

data =

     1     2     3
     5     2     8
     7     2   100

>> test_data = [1 2 3; 4 5 6; 7 8 9];
>> minimums = min(data,[],1);
>> ranges = max(data, [], 1) - minimums;
>> data = (data - repmat(minimums, size(data, 1), 1)) ./ repmat(ranges, size(data, 1), 1);
>> data

data =

         0       NaN         0
    0.6667       NaN    0.0515
    1.0000       NaN    1.0000

So you have to check if there are columns which has only one single value. But what if there is only one single value in the entire training set, but there are several values in the test set? And what do we do in the Leave-one-out scenario, in which there is only one observation in the test set, then if all the values in a column of the training set is 0, and the corresponding value in the test set is 100 ? These are really degenerate cases, but it might happen. However, when I checked the file svm_scale.c in the Libsvm library, I noticed this part:

 void output(int index, double value)
{
    /* skip single-valued attribute */
    if(feature_max[index] == feature_min[index])
        return;

    if(value == feature_min[index])
        value = lower;
    else if(value == feature_max[index])
        value = upper;
    else
        value = lower + (upper-lower) * 
            (value-feature_min[index])/
            (feature_max[index]-feature_min[index]);

    if(value != 0)
    {
        printf("%d:%g ",index, value);
        new_num_nonzeros++;
    }
}

So we should ignore these cases? I don't really know. As I've said, I'm not an authority on this issue, so I'm going to wait for another answer, preferably from Libsvm's authors themselves, to clear things up .....



来源:https://stackoverflow.com/questions/10055396/scaling-the-testing-data-for-libsvm-matlab-implementation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!