Weka ignoring unlabeled data

问题

I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When I test the model obtained from my labeled training data on an independent set of unlabeled test data, Weka ignores all the unlabeled instances. Can anybody please guide me how to solve this? Someone has already asked this question here before but there wasn't any appropriate solution provided. Here is a sample test file:

@relation referents
@attribute feature1      NUMERIC
@attribute feature2      NUMERIC
@attribute feature3      NUMERIC
@attribute feature4      NUMERIC
@attribute class{1 -1}
@data
1, 7, 1, 0, ?
1, 5, 1, 0, ?
-1, 1, 1, 0, ?
1, 1, 1, 1, ?
-1, 1, 1, 1, ?

回答1:

The problem is that when you specify a training set -t train.arff and a test set test.arff, the mode of operation is to calculate the performance of the model based on the test set. But you can't calculate a performance of any kind without knowing the actual class. Without the actual class, how will you know if your prediction if right or wrong?

I used the data you gave as train.arff and as test.arff with arbitrary class labels assigned by me. The relevant output lines are:

=== Error on training data ===

Correctly Classified Instances           4               80      %
Incorrectly Classified Instances         1               20      %
Kappa statistic                          0.6154
Mean absolute error                      0.2429
Root mean squared error                  0.4016
Relative absolute error                 50.0043 %
Root relative squared error             81.8358 %
Total Number of Instances                5     


=== Confusion Matrix ===

 a b   <-- classified as
 2 1 | a = 1
 0 2 | b = -1

and

=== Error on test data ===

Total Number of Instances                0     
Ignored Class Unknown Instances                  5     


=== Confusion Matrix ===

 a b   <-- classified as
 0 0 | a = 1
 0 0 | b = -1

Weka can give you those statistics for the training set, because it knows the actual class labels and the predicted ones (applying the model on the training set). For the test set, it can't get any information about the performance, because it doesn't know about the true class labels.

What you might want to do is:

java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t train.arff -T test.arff -p 1-4

which in my case would give you:

=== Predictions on test data ===

 inst#     actual  predicted error prediction (feature1,feature2,feature3,feature4)
     1        1:?        1:1       1 (1,7,1,0)
     2        1:?        1:1       1 (1,5,1,0)
     3        1:?       2:-1       0.786 (-1,1,1,0)
     4        1:?       2:-1       0.861 (1,1,1,1)
     5        1:?       2:-1       0.861 (-1,1,1,1)

So, you can get the predictions, but you can't get a performance, because you have unlabeled test data.

来源：https://stackoverflow.com/questions/16432121/weka-ignoring-unlabeled-data

标签

nlp

classification

weka

arff