Does PostgreSQL use tf-idf?

家住魔仙堡 提交于 2019-12-12 11:12:19

问题


I would like to know whether full text search in PostgreSQL 9.3 with GIN/GiST index uses tf-idf (term frequency-inverse document frequency).

In particular, in my columns of phrases, I have some words that are more popular, whereas some are quite unique (i.e., names). I want to index these columns so that the unique words matched will be weighted higher than common words.


回答1:


No. Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":

"I want a dog"
"I want a chihuahua"

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:

"dog lovers have an average of 1.5 dogs"

In short: higher term frequency within the document results in a higher rank, but a lower term frequency in the corpus has no impact.

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)




回答2:


No Postgres does not use TF-IDF as a similarity measure among documents.

ts_rank is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.

ts_rank_cd is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.

There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF. It also lets you turn tsvectors into arrays, and supports fast indexing.




回答3:


Mostly. The details are described at http://www.postgresql.org/docs/9.1/static/textsearch-controls.html

The basic problem is that the term frequency is not really something based on the corpus you are indexing but rather set in the dictionary. So it looks to me like, as long as you properly select a language, you should be ok.



来源:https://stackoverflow.com/questions/18296444/does-postgresql-use-tf-idf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!