get the three highest values in a TreeMap

守給你的承諾、 提交于 2019-12-05 15:33:26

I would count the frequencies with a hash map, and then loop over them all, selecting the top 3. You minimize comparisons this way, and never have to sort. Use the Selection Algorithm

-edit, the wikipedia page details many different implementations of the selection algorithm. To be specific, just use a bounded priority queue, and set the size to 3. Dont get fancy and implement the queue as a heap or anything. just use an array.

If you really want a scalable and lightning-fast solution, please take a look at Lucene as this kind of thing is something it does before getting out of bed in the morning. All you'd have to do is index a single document with all your text and then retrieve the top-ranking terms. There's a piece of code somewhere to find the top-ranking terms, involving a PriorityQueue. I've got a copy in Clojure, even if you don't know the language, you can glean the relevant API calls from it (or at least google by them and find the Java version):

(defn top-terms [n]
  (let [f "field-name"
        tenum (-> ^IndexSearcher searcher .getIndexReader (.terms (Term. f)))
        q (proxy [org.apache.lucene.util.PriorityQueue] [] 
            (lessThan [a b] (< (a 0) (b 0))))]
    (-> org.apache.lucene.util.PriorityQueue
        (.getDeclaredMethod "initialize" (into-array [Integer/TYPE]))
        (doto (.setAccessible true)) (.invoke q (into-array [(Integer/valueOf n)])))
    (loop [] (when (= (-> tenum .term .field) f)
               (.insertWithOverflow q [(.docFreq tenum) (.term tenum)])
               (when (.next tenum) (recur))))
    (loop [terms nil] (if (> (.size q) 0) (recur (conj terms (.pop q))) terms))))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!