_joint_log_likelihood give me wrong values

问题

I have code like this

x_train=data['TOKEN'].loc[:2]
y=data['label'].loc[:2]
x_test=data['TOKEN'].loc[3:]

that contain 3 Data training 1 class each class(-1),(0),(1) and 1 data test

#TFIDF training
tfidf= TfidfVectorizer(smooth_idf=False,norm=None)
x_tfidf2 = tfidf.fit_transform(x_train)
tfidfframe_train = pd.DataFrame(x_tfidf_train,columns=tfidf.get_feature_names())
#the output of tfidfframe_train 
    a        b       c     d        e       f
0   0.0     0.0      0.0    1.477   1.477   1.0 -> class -1 data train doc1
1   0.0     0.0      1.176  0.0     0.0     1.0  -> class 0 data train doc2
2   1.477   1.477   1.176   0.0     0.0     1.0  -> class 1 data train doc3

#TFIDF testing
x_tfidf3 = tfidf.transform(x_test)
tfidfframe_test = pd.DataFrame(x_tfidf_test,columns=tfidf.get_feature_names())
    a     b    c     d    e    f
0   0.0  0.0  1.17  0.0  0.0  1.0

so now we know that we have word c and f in our data test i fit the data to MultinomialNB

from sklearn.naive_bayes import MultinomialNB
model =MultinomialNB(alpha=1.0)
classifier = model.fit(x_tfidf_chi2_train,y)
print ('class log prrior \n',classifier.class_log_prior_)
#output (logbase10)
class log prrior #(logbase10 1/3) = -0.47712125 this output is correct
 [-0.47712125 -0.47712125 -0.47712125]

print('Conditional Probabilities :\n',classifier.feature_log_prob_) # count Conditional Prob with P(w|c)
#output #this output actually correct. this count by input the TFIDF values above in data train to logbase10 of P(w|c) calculation
     a            b           c              d         e           f
[[-0.99800822 -0.99800822 -0.99800822 -0.60406095 -0.60406095 -0.69697822] -> class -1 data train doc1
 [-0.91254573 -0.91254573 -0.57486863 -0.91254573 -0.91254573 -0.61151573] -> class 0 data train doc2
 [-0.65256092 -0.65256092 -0.70883108 -1.04650819 -1.04650819 -0.74547819]] -> class 1 data train doc3

Now the problem is when i try to calculate the Class Maximum log Posterior of the test data its should be P(c) + P(w|c) in sklearn known by _joint_log_likelihood

so we can manual calculating that by predicting word [c f]

     c            e         logbase10P(c)
-0.99800822 + -0.69697822 + -0.47712125 = -2.17210769 -> class -1 
-0.57486863 + -0.61151573 + -0.47712125 = -1.66350558 -> class 0 
-0.70883108 + -0.74547819 + -0.47712125 =  -1.92552177 -> -> class 1

but when i try to output it by the system the output was not match

jll = classifier._joint_log_likelihood(x_test) 
output sorted left to right (-1,0,1)
     class -1  class 0     class 1
[[-2.34784822 -1.76473496 -2.05624949]]

what is wrong in the MultinomialNB? of _joint_log_likelihood? on naive_bayes.py documentaion of MultinomialNB said the code

 def _joint_log_likelihood(self, X):
        """Calculate the posterior log probability of the samples X"""
        return (safe_sparse_dot(X, self.feature_log_prob_.T) +
                self.class_log_prior_)

Maybe you can do review and tell me this is the data Data HOPE you guys can answer it

来源：https://stackoverflow.com/questions/63184256/joint-log-likelihood-give-me-wrong-values

标签

python

python-3.x

machine-learning

scikit-learn

naivebayes