regex: Match at least two search terms

妖精的绣舞 提交于 2019-12-23 17:57:54

问题


I have a list of search terms and I would like to have a regex that matches all items that have at least two of them.

Terms: war|army|fighting|rebels|clashes

Match: The war between the rebels and the army resulted in several clashes this week. (4 hits)

Non-Match: In the war on terror, the obama administration wants to increase the number of drone strikes. (only 1 hit)

Background: I use tiny-tiny rss to collect and filter a large number of feeds for a news reporting project. I get 1000 - 2000 feed items per day and would like to filter them by keywords. By just using |OR expression, I get to many false positives, so I figured I could just ask for two matches in a feed item.

Thanks!

EDIT:

I know very little about regex, so I stuck with using the simple |OR operator so far. I tried putting the search terms in parenthesis (war|fighting|etc){2,}, but that only matches if an item uses the same word twice.

EDIT2: sorry for the confusion, I'm new to regex and the like. Fact is: the regex queries a mysql database. It is entered in the tt-rss backend as a filter, which allows only one line (although theoretically unlimited number of characters). The filter is employed upon importing of the feed item into the mysql database.


回答1:


(.*?\b(war|army|fighting|rebels|clashes)\b){2,}

If you need to avoid matching the same term, you can use:

.*?\b(war|army|fighting|rebels|clashes).*?(\b(?!\1)(war|army|fighting|rebels|clashes)\b)

which matches a term, but avoids matching the same term again by using a negative lookahead.

In java:

Pattern multiword = Pattern.compile(
    ".*?(\\b(war|army|fighting|rebels|clashes)\\b)" +
    ".*?(\\b(?!\\1)(war|army|fighting|rebels|clashes)\\b)"
);
Matcher m;
for(String str : Arrays.asList(
        "war",
        "war war war",
        "warm farmy people",
        "In the war on terror rebels eating faces"

)) {
    m = multiword.matcher(str);
    if(m.find()) {
        logger.info(str + " : " + m.group(0));
    } else {
        logger.info(str + " : no match.");
    }
}

Prints:

war : no match.
war war war : no match.
warm farmy people : no match.
In the war on terror rebels eating faces : In the war on terror rebels



回答2:


This isn't (entirely) a job for regular expressions. A better approach is to scan the text, and then count the unique match groups.

In Ruby, it would be very simple to branch based on your match count. For example:

terms = /war|army|fighting|rebels|clashes/
text = "The war between the rebels and the army resulted in..."

# The real magic happens here.
match = text.scan(terms).uniq

# Do something if your minimum match count is met.
if match.count >= 2
  p match
end

This will print ["war", "rebels", "army"].




回答3:


Regular expressions could do the trick, but the regular expression would be quite huge.

Remember, they are simple tools (based on finite-state automata) and hence don't have any memory that would let them remember what words were already seen. So such regex, even though possible, would probably just look like a huge lump of or's (as in, one "or" for every possible order of inputs or something).

I recommend to do the parsing yourself, for instance like:

var searchTerms = set(yourWords);
int found = 0;
foreach (var x in words(input)) {
    if (x in searchTerms) {
        searchTerms.remove(x);
        ++found;
    }
    if (found >= 2) return true;
}
return false;



回答4:


If you want to do it all with a regex it's not likely to be easy.

You can however do something like this:

<?php
...
$string = "The war between the rebels and the army resulted in several clashes this week. (4 hits)";


preg_match_all("@(\b(war|army|fighting|rebels|clashes))\b@", $string, $matches);
$uniqueMatchingWords = array_unique($matches[0]);
if (count($uniqueMatchingWords) >= 2) {
    //bingo
}


来源:https://stackoverflow.com/questions/10832519/regex-match-at-least-two-search-terms

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!