Java library for keywords extraction from input text [closed]

后端未结

关注

 3  1257

谎友^ 2020-12-04 18:16

3条回答

悲&欢浪女 (楼主)

2020-12-04 18:48

A relatively simple approach based on the RAKE algorithm and opennlp models wrapped by the rapidrake-java library.

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;

import io.github.crew102.rapidrake.model.RakeParams;
import io.github.crew102.rapidrake.model.Result;

public class KeywordExtractor {

    private static String delims = "[-,.?():;\"!/]";
    private static String posUrl = "model-bin/en-pos-maxent.bin";
    private static String sentUrl = "model-bin/en-sent.bin";

    public static void main(String[] args) throws IOException {

        InputStream stream = new FileInputStream("res/stopwords-terrier.txt");
        String[] stopWords = IOUtils.readLines(stream, "UTF-8").stream().toArray(String[]::new);
        String[] stopPOS = {"VBD"};
        RakeParams params = new RakeParams(stopWords, stopPOS, 0, true, delims);
        RakeAlgorithm rakeAlg = new RakeAlgorithm(params, posUrl, sentUrl);
        Result aRes = rakeAlg.rake("I'm looking for a Java library to extract keywords from a block of text.");
        System.out.println(aRes);
        // OUTPUT:
        // [looking (1), java library (4), extract keywords (4), block (1), text (1)]
    }
}

As you can see from the sample output you get a map of keywords with their relative weights.

As explained at https://github.com/crew102/rapidrake-java you need to download the files en-pos-maxent.bin and model-bin/en-sent.bin from the opennlp download page. Put them into the model-bin folder in your project root (must be a sibling of your src folder if using the maven project structure). The stopwords file can be taken for example from https://github.com/terrier-org/terrier-desktop/blob/master/share/stopword-list.txt.

0 讨论(0)

查看其它3个回答

热议问题