A relatively simple approach based on the RAKE algorithm and opennlp models wrapped by the rapidrake-java library.
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.commons.io.IOUtils;
import io.github.crew102.rapidrake.model.RakeParams;
import io.github.crew102.rapidrake.model.Result;
public class KeywordExtractor {
private static String delims = "[-,.?():;\"!/]";
private static String posUrl = "model-bin/en-pos-maxent.bin";
private static String sentUrl = "model-bin/en-sent.bin";
public static void main(String[] args) throws IOException {
InputStream stream = new FileInputStream("res/stopwords-terrier.txt");
String[] stopWords = IOUtils.readLines(stream, "UTF-8").stream().toArray(String[]::new);
String[] stopPOS = {"VBD"};
RakeParams params = new RakeParams(stopWords, stopPOS, 0, true, delims);
RakeAlgorithm rakeAlg = new RakeAlgorithm(params, posUrl, sentUrl);
Result aRes = rakeAlg.rake("I'm looking for a Java library to extract keywords from a block of text.");
System.out.println(aRes);
// OUTPUT:
// [looking (1), java library (4), extract keywords (4), block (1), text (1)]
}
}
As you can see from the sample output you get a map of keywords with their relative weights.
As explained at https://github.com/crew102/rapidrake-java you need to download the files en-pos-maxent.bin and model-bin/en-sent.bin from the opennlp download page. Put them into the model-bin folder in your project root (must be a sibling of your src folder if using the maven project structure). The stopwords file can be taken for example from https://github.com/terrier-org/terrier-desktop/blob/master/share/stopword-list.txt.