I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).
I have tried the below approaches -
RAK
Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.
import java.util.List;
/**
* Class to calculate TfIdf of term.
* @author Mubin Shrestha
*/
public class TfIdf {
/**
* Calculates the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(String[] totalterms, String termToCheck) {
double count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count / totalterms.length;
}
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List allTerms, String termToCheck) {
double count = 0;
for (String[] ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
}