Inverse Document Frequency Formula

谁说胖子不能爱 提交于 2020-06-15 07:25:38

问题


I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect.

I keep reading that

idf(term) =  log(# of docs/ # of docs with term)

If so, won't you get a divide by zero error if there are no docs with the term?

To solve that problem, I read that you do

log (# of docs / # of docs with term + 1 )

But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me.

What am I not getting?


回答1:


The trick you describe is actually called Laplace smoothing (or additive, or add-by-one smoothing) and suppose to add the same summand to the other part of the fraction - nominator in your case or denominator in original case.

In other words, you should add 1 to the total number of docs:

log (# of docs + 1 / # of docs with term + 1)

Btw, it is often better to use smaller summand, especially in case of small corpus:

log (# of docs + a / # of docs with term + a),

where a = 0.001 or something like that.



来源:https://stackoverflow.com/questions/32279651/inverse-document-frequency-formula

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!