Efficient Regular Expression for big data, if a String contains a word

别等时光非礼了梦想. 提交于 2019-12-13 13:41:50

问题


I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.

What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?

For example:

String keyword="ac"; 
String document"..."  //few page long file

If i use :

if(document.contains(keyword) ){
//do something
}

It will also return true if document contains a word like "account";

so I tried to use regular expression as follows:

String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
   //do something
}

Summary:

This is the summary: Hopefully it will be useful to some one else:

  1. My regular expression would work but extremely impractical while working with big data. (it didn't terminate)
  2. @anubhava perfected the regular expression. it was easy to understand and implement. It managed to terminate which is a big thing. but it was still a bit slow. (Roughly about 240 seconds)
  3. @Tomalak solution is abit complex to implement and understand but it was the fastest solution. so hats off mate.(18 seconds)

so @Tomalak solution was ~15 times faster than @anubhava.


回答1:


The fastest-possible way to find substrings in Java is to use String.indexOf().

To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:

public class IndexOfWordSample {
    public static void main(String[] args) {
        String input = "There are longer strings than this not very long one.";
        String search = "long";
        int index = indexOfWord(input, search);

        if (index > -1) {
            System.out.println("Hit for \"" + search + "\" at position " + index + ".");
        } else {
            System.out.println("No hit for \"" + search + "\".");
        }
    }

    public static int indexOfWord(String input, String word) {
        String nonWord = "^\\W?$", before, after;               
        int index, before_i, after_i = 0;

        while (true) {
            index = input.indexOf(word, after_i);
            if (index == -1 || word.isEmpty()) break;

            before_i = index - 1;
            after_i = index + word.length();
            before = "" + (before_i > -1 ? input.charAt(before_i) : "");            
            after = "" + (after_i < input.length() ? input.charAt(after_i) : "");

            if (before.matches(nonWord) && after.matches(nonWord)) {
                return index;
            }
        }
        return -1;
    }
}

This would print:

Hit for "long" at position 44.

This should perform better than a pure regular expressions approach.

Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.

For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.

I've created a Gist with an alternative implementation that does that.




回答2:


Don't think you need to have .* in your regex.

Try this regex:

String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";

Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.

Also you must be using Pattern.quote if your keyword contains special regex characters.

EDIT: You might use this regex if your keywords are separated by space.

String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";


来源:https://stackoverflow.com/questions/24674318/efficient-regular-expression-for-big-data-if-a-string-contains-a-word

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!