Extract most important keywords from a set of documents

前端 未结 4 1690
故里飘歌
故里飘歌 2020-12-21 16:00

I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).

I have tried the below approaches -

RAK

4条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-21 16:56

    Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.

    import java.util.List;
    
    /**
     * Class to calculate TfIdf of term.
     * @author Mubin Shrestha
     */
    public class TfIdf {
    
        /**
         * Calculates the tf of term termToCheck
         * @param totalterms : Array of all the words under processing document
         * @param termToCheck : term of which tf is to be calculated.
         * @return tf(term frequency) of term termToCheck
         */
        public double tfCalculator(String[] totalterms, String termToCheck) {
            double count = 0;  //to count the overall occurrence of the term termToCheck
            for (String s : totalterms) {
                if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                }
            }
            return count / totalterms.length;
        }
    
        /**
         * Calculates idf of term termToCheck
         * @param allTerms : all the terms of all the documents
         * @param termToCheck
         * @return idf(inverse document frequency) score
         */
        public double idfCalculator(List allTerms, String termToCheck) {
            double count = 0;
            for (String[] ss : allTerms) {
                for (String s : ss) {
                    if (s.equalsIgnoreCase(termToCheck)) {
                        count++;
                        break;
                    }
                }
            }
            return 1 + Math.log(allTerms.size() / count);
        }
    }
    

提交回复
热议问题