Getting total term frequency throughout entire index (Elasticsearch)

↘锁芯ラ 提交于 2019-12-03 11:40:47

问题


I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy.

Request:

http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true

The document id being used here is "AVmk-ky6XMskTDwIwpih", although the term statistics should not be specific to a document.

Response:

This is what I get for the term "cancer" for one of the fields:

 "cancer" : {
      "doc_freq" : 5297,
      "ttf" : 10587,
      "term_freq" : 1,
      "tokens" : [
        {
          "position" : 15,
          "start_offset" : 115,
          "end_offset" : 121
        }
      ]
    },

If I total the ttf for all fields, I get 18915. However, the actual total term frequency for "cancer" is in fact 542829. This leads me to believe that it is limiting the term_vector stats to a subset of documents within the index.

Any advice here would be greatly appreciated.


回答1:


The reason for the difference in the count is because term vectors are not accurate unless the index in question has a single shard. For indexes with multiple shards, the documents are distributed all over the shards, hence the frequency returned isn't the total but from a randomly selected shard.

Thus, the returned frequency is just a relative measure and not the absolute value you expect. see the Behaviour section. To test this, you can create a single shard index and request the frequency (it should give you the actual total).




回答2:


I believe you need to turn term_statistics to true as per elasticsearch documentation:

Term statistics Setting term_statistics to true (default is false) will return

total term frequency (how often a term occurs in all documents)

document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.



来源:https://stackoverflow.com/questions/41711305/getting-total-term-frequency-throughout-entire-index-elasticsearch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!