Weka ignoring unlabeled data

后端 未结 1 1767
我在风中等你
我在风中等你 2020-12-21 09:42

I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When

相关标签:
1条回答
  • 2020-12-21 10:17

    The problem is that when you specify a training set -t train.arff and a test set test.arff, the mode of operation is to calculate the performance of the model based on the test set. But you can't calculate a performance of any kind without knowing the actual class. Without the actual class, how will you know if your prediction if right or wrong?

    I used the data you gave as train.arff and as test.arff with arbitrary class labels assigned by me. The relevant output lines are:

    === Error on training data ===
    
    Correctly Classified Instances           4               80      %
    Incorrectly Classified Instances         1               20      %
    Kappa statistic                          0.6154
    Mean absolute error                      0.2429
    Root mean squared error                  0.4016
    Relative absolute error                 50.0043 %
    Root relative squared error             81.8358 %
    Total Number of Instances                5     
    
    
    === Confusion Matrix ===
    
     a b   <-- classified as
     2 1 | a = 1
     0 2 | b = -1
    

    and

    === Error on test data ===
    
    Total Number of Instances                0     
    Ignored Class Unknown Instances                  5     
    
    
    === Confusion Matrix ===
    
     a b   <-- classified as
     0 0 | a = 1
     0 0 | b = -1
    

    Weka can give you those statistics for the training set, because it knows the actual class labels and the predicted ones (applying the model on the training set). For the test set, it can't get any information about the performance, because it doesn't know about the true class labels.

    What you might want to do is:

    java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t train.arff -T test.arff -p 1-4
    

    which in my case would give you:

    === Predictions on test data ===
    
     inst#     actual  predicted error prediction (feature1,feature2,feature3,feature4)
         1        1:?        1:1       1 (1,7,1,0)
         2        1:?        1:1       1 (1,5,1,0)
         3        1:?       2:-1       0.786 (-1,1,1,0)
         4        1:?       2:-1       0.861 (1,1,1,1)
         5        1:?       2:-1       0.861 (-1,1,1,1)
    

    So, you can get the predictions, but you can't get a performance, because you have unlabeled test data.

    0 讨论(0)
提交回复
热议问题