quicker way to detect n-grams in a string?

I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence)

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

=> this bit of code takes by far the longest processing time (28 seconds for the detection of 1-grams, 2-grams, 3-grams and 4grams for my corpus: 4Mb of raw text), as compared to milliseconds for other operations (removal of stopwords etc.)

Does anybody know of solutions in Java which would go faster than the looping solution presented above? (I was thinking multithreading, use of collections, or maybe creative ways to split a string...?) Thanks!

You could try something like this:

public class NGram {

    private final int n;
    private final String text;

    private final int[] indexes;
    private int index = -1;
    private int found = 0;

    public NGram(String text, int n) {
        this.text = text;
        this.n = n;
        indexes = new int[n];
    }

    private boolean seek() {
        if (index >= text.length()) {
            return false;
        }
        push();
        while(++index < text.length()) {
            if (text.charAt(index) == ' ') {
                found++;
                if (found<n) {
                    push();
                } else {
                    return true;
                }
            }
        }
        return true;
    }

    private void push() {
        for (int i = 0; i < n-1; i++) {
            indexes[i] = indexes[i+1];
        }
        indexes[n-1] = index+1;
    }

    private List<String> list() {
        List<String> ngrams = new ArrayList<String>();
        while (seek()) {
            ngrams.add(get());
        }
        return ngrams;
    }

    private String get() {
        return text.substring(indexes[0], index);
    }
}

Testing on about 5mb of text it seems to perform about 10 times faster than the original code. The main difference here is that regex is not used to split and the ngram strings are not created by concatenation.

Update: This is the output I get when running on the text mentioned above, ngram 1-4. I run with 2GB of memory to decide the impact on GC during the runs. I ran multiple times to see the impact of the hotspot compiler.

Loop 01 Code mine ngram 1 time 071ms ngrams 294121
Loop 01 Code orig ngram 1 time 534ms ngrams 294121
Loop 01 Code mine ngram 2 time 016ms ngrams 294120
Loop 01 Code orig ngram 2 time 360ms ngrams 294120
Loop 01 Code mine ngram 3 time 082ms ngrams 294119
Loop 01 Code orig ngram 3 time 319ms ngrams 294119
Loop 01 Code mine ngram 4 time 014ms ngrams 294118
Loop 01 Code orig ngram 4 time 439ms ngrams 294118

Loop 10 Code mine ngram 1 time 013ms ngrams 294121
Loop 10 Code orig ngram 1 time 268ms ngrams 294121
Loop 10 Code mine ngram 2 time 014ms ngrams 294120
Loop 10 Code orig ngram 2 time 323ms ngrams 294120
Loop 10 Code mine ngram 3 time 013ms ngrams 294119
Loop 10 Code orig ngram 3 time 412ms ngrams 294119
Loop 10 Code mine ngram 4 time 014ms ngrams 294118
Loop 10 Code orig ngram 4 time 423ms ngrams 294118

Running about 5 megs of Lorus Ipsum text thru the code you supplied generally took about a little over 7 seconds to run for detecting 1 - 4 n-grams. I reworked the code to make a list of the longest n-gram and then iterate over this list producing lists of successively shorter ngrams. In testing it took about 2.6 seconds for the same text. Also, it took a lot less memory.

import java.util.*;

public class Test {

    public static List<String> ngrams(int max, String val) {
        List<String> out = new ArrayList<String>(1000);
        String[] words = val.split(" ");
        for (int i = 0; i < words.length - max + 1; i++) {
            out.add(makeString(words, i,  max));
        }
        return out;
    }

    public static String makeString(String[] words, int start, int length) {
        StringBuilder tmp= new StringBuilder(100);
        for (int i = start; i < start + length; i++) {
            tmp.append(words[i]).append(" ");
        }
        return tmp.substring(0, tmp.length() - 1);
    }

    public static List<String> reduceNgrams(List<String> in, int size) {
        if (1 < size) {
            List<String> working = reduceByOne(in);
            in.addAll(working);
            for (int i = size -2 ; i > 0; i--) {
                working = reduceByOne(working);
                in.addAll(working);
            }
        }
        return in;
    }

    public static List<String> reduceByOne(List<String> in) {
        List<String> out = new ArrayList<String>(in.size());
        int end;
        for (String s : in) {
            end = s.lastIndexOf(" ");
            out.add(s.substring(0, -1 == end ? s.length() : end));  
        }
        //the last one will always reduce twice - words 0, n-1 are in the loop this catches the words 1, n
        String s = in.get(in.size() -1);
        out.add(s.substring(s.indexOf(" ")+1));
        return out;
    }

    public static void main(String[] args) {
        long start;
        start = System.currentTimeMillis();
        List<String> ngrams = ngrams(3, "Your text goes here, actual mileage may vary");
        reduceNgrams(ngrams, 3);
        System.out.println(System.currentTimeMillis() - start);
    }
}

来源：https://stackoverflow.com/questions/8701610/quicker-way-to-detect-n-grams-in-a-string

标签

java

nlp

n-gram