How to use regular expression while searching in HashSet

问题

I am writing a Java program in which I need to search a particular word from a Set. The word that has to be searched is something like ("wo.d") where '.' can be replaced by any other alphabet. I am using regex to match such type of word cases.

This is what I have so far

HashSet<String> words = new HashSet<String>();//this set is already populated
String word = "t.st";
if(word.contains(".")){
    Pattern p = Pattern.compile(word);
    Matcher m;
    boolean match = false;
    for(String setWord : words){
        m = p.matcher(setWord);
        if(m.matches())
            match = true;
    }
    if(match)
        System.out.println("Its a match");
    else
        System.out.println("Its not a match");
}
else{
    System.out.println("The word does not contain regex do other stuff");
}

The code above works but is not efficient because it is being called many times in a second. So it produces a lag in the program.

回答1:

You need to stop iterating as soon as you get a match, so assuming that you use Java 8, your for loop could be rewritten efficiently as next:

boolean match = words.stream().anyMatch(w -> p.matcher(w).matches());

You could also parallelize the research using parallelStream() instead of stream() especially if your Set has a lot of words.

If you don't use Java 7, it could still be done using FluentIterable from Google Guava but without the ability to parallelize the research unfortunately.

boolean match = FluentIterable.from(words).anyMatch(
    new Predicate<String>() {
        @Override
        public boolean apply(@Nullable final String w) {
            return p.matcher(w).matches();
        }
    }
);

But in your case, I don't believe that using FluentIterable can be more interesting than simply adding a break when you get a match, as it will still be easier to read and maintain

if (p.matcher(setWord).matches()) {
    match = true;
    break;
}

So, if you really need to use a regular expression and you cannot use Java 8, your best option is to use break as described above, there is no magic trick to consider.

Assuming that you will only have one character to replace, it could be done using startsWith(String) and endsWith(String) which will always be much faster than a regular expression. Something like this:

// Your words should be in a TreeSet to be already sorted alphabetically 
// in order to get a match as fast as possible
Set<String> words = new TreeSet<String>(); //this set is already populated
int index = word.indexOf('.');
if (index != -1) {
    String prefix = word.substring(0, index);
    String suffix = word.substring(index + 1);
    boolean match = false;
    for (String setWord : words){
        // From the fastest to the slowest thing to check 
        // to get the best possible performances
        if (setWord.length() == word.length() 
            && setWord.startsWith(prefix) 
            && setWord.endsWith(suffix)) {
            match = true;
            break;
        }
    }
    if(match)
        System.out.println("Its a match");
    else
        System.out.println("Its not a match");
}
else {
    System.out.println("The word does not contain regex do other stuff");
}

回答2:

Use TreeSet instead of HashSet. And test for sub range of the set.

TreeSet<String> words = new TreeSet<>();// this set is already populated
String word = "t.st";
if (word.contains(".")) {
    String from = word.replaceFirst("\\..*", "");
    String to = from + '\uffff';
    Pattern p = Pattern.compile(word);
    Matcher m;
    boolean match = false;
    for (String setWord : words.subSet(from, to)) {
        m = p.matcher(setWord);
        if (m.matches()) {
            match = true;
            break;
        }
    }
    if (match)
        System.out.println("Its a match");
    else
        System.out.println("Its not a match");
} else {
    System.out.println("The word does not contain regex do other stuff");
}

In this case words.subSet(from, to) contains only the words start with "t".

回答3:

Just break out of loop to stop further regex matching of your HashSet as soon as you get a match:

if(m.matches()) {
   match = true;
   break;
}

Full Code:

HashSet<String> words = new HashSet<String>();//this set is already populated
String word = "t.st";
if(word.contains(".")){
    Pattern p = Pattern.compile(word);
    Matcher m;
    boolean match = false;
    for(String setWord : words){
        m = p.matcher(setWord);
        if(m.matches()) {
            match = true;
            break:
        }
    }
    if(match)
        System.out.println("Its a match");
    else
        System.out.println("Its not a match");
}
else{
    System.out.println("The word does not contain regex do other stuff");
}

回答4:

Use original matching method like this.

static boolean match(String wild, String s) {
    int len = wild.length();
    if (len != s.length())
        return false;
    for (int i = 0; i < len; ++i) {
        char w = wild.charAt(i);
        if (w == '.')
            continue;
        else if (w != s.charAt(i))
            return false;
    }
    return true;
}

and

HashSet<String> words = new HashSet<>();// this set is already populated
String word = "t.st";
boolean match = false;
if (word.contains(".")) {
    for (String setWord : words) {
        if (match(word, setWord)) {
            match = true;
            break;
        }
    }
    if (match)
        System.out.println("Its a match");
    else
        System.out.println("Its not a match");
} else {
    System.out.println("The word does not contain regex do other stuff");
}

来源：https://stackoverflow.com/questions/39674216/how-to-use-regular-expression-while-searching-in-hashset

标签

java

regex

set