scaling the testing data for LIBSVM: MATLAB implementation

问题

I currently use the MATLAB version of the LIBSVM support vector machine to classify my data. The LIBSVM documentation mentions that scaling before applying SVM is very important and we have to use the same method to scale both training and testing data.

The "same method of scaling" is explained as: For example, suppose that we scaled the first attribute of training data from [-10, +10] to [-1, +1]. If the first attribute of testing data lies in the range [-11, +8], we must scale the testing data to [-1.1, +0.8]

Scaling the training data in the range of [0,1] can be done using the following MATLAB code :

(data - repmat(min(data,[],1),size(data,1),1))*spdiags(1./(max(data,[],1)-min(data,[],1))',0,size(data,2),size(data,2))

But I don't know how to scale the testing data correctly.

Thank you very much for your help.

回答1:

The code you give is essentially subtracting the minimum and then dividing by the range. You need to store the minimum and range of the training data features.

minimums = min(data, [], 1);
ranges = max(data, [], 1) - minimums;

data = (data - repmat(minimums, size(data, 1), 1)) ./ repmat(ranges, size(data, 1), 1);

test_data = (test_data - repmat(minimums, size(test_data, 1), 1)) ./ repmat(ranges, size(test_data, 1), 1);

回答2:

Richante's code is, unfortunately, not correct if there are columns for which all of the observations has the same value (which may happen if the data is sparse). An example:

>> data = [1 2 3; 5 2 8; 7 2 100]

data =

     1     2     3
     5     2     8
     7     2   100

>> test_data = [1 2 3; 4 5 6; 7 8 9];
>> minimums = min(data,[],1);
>> ranges = max(data, [], 1) - minimums;
>> data = (data - repmat(minimums, size(data, 1), 1)) ./ repmat(ranges, size(data, 1), 1);
>> data

data =

         0       NaN         0
    0.6667       NaN    0.0515
    1.0000       NaN    1.0000

So you have to check if there are columns which has only one single value. But what if there is only one single value in the entire training set, but there are several values in the test set? And what do we do in the Leave-one-out scenario, in which there is only one observation in the test set, then if all the values in a column of the training set is 0, and the corresponding value in the test set is 100 ? These are really degenerate cases, but it might happen. However, when I checked the file svm_scale.c in the Libsvm library, I noticed this part:

 void output(int index, double value)
{
    /* skip single-valued attribute */
    if(feature_max[index] == feature_min[index])
        return;

    if(value == feature_min[index])
        value = lower;
    else if(value == feature_max[index])
        value = upper;
    else
        value = lower + (upper-lower) * 
            (value-feature_min[index])/
            (feature_max[index]-feature_min[index]);

    if(value != 0)
    {
        printf("%d:%g ",index, value);
        new_num_nonzeros++;
    }
}

So we should ignore these cases? I don't really know. As I've said, I'm not an authority on this issue, so I'm going to wait for another answer, preferably from Libsvm's authors themselves, to clear things up .....

来源：https://stackoverflow.com/questions/10055396/scaling-the-testing-data-for-libsvm-matlab-implementation

标签

matlab

testing

input

scaling

libsvm