问题
I have a list of search terms and I would like to have a regex that matches all items that have at least two of them.
Terms: war|army|fighting|rebels|clashes
Match: The war between the rebels and the army resulted in several clashes this week. (4 hits)
Non-Match: In the war on terror, the obama administration wants to increase the number of drone strikes. (only 1 hit)
Background: I use tiny-tiny rss to collect and filter a large number of feeds for a news reporting project. I get 1000 - 2000 feed items per day and would like to filter them by keywords. By just using |OR expression, I get to many false positives, so I figured I could just ask for two matches in a feed item.
Thanks!
EDIT:
I know very little about regex, so I stuck with using the simple |OR operator so far. I tried putting the search terms in parenthesis (war|fighting|etc){2,}, but that only matches if an item uses the same word twice.
EDIT2: sorry for the confusion, I'm new to regex and the like. Fact is: the regex queries a mysql database. It is entered in the tt-rss backend as a filter, which allows only one line (although theoretically unlimited number of characters). The filter is employed upon importing of the feed item into the mysql database.
回答1:
(.*?\b(war|army|fighting|rebels|clashes)\b){2,}
If you need to avoid matching the same term, you can use:
.*?\b(war|army|fighting|rebels|clashes).*?(\b(?!\1)(war|army|fighting|rebels|clashes)\b)
which matches a term, but avoids matching the same term again by using a negative lookahead.
In java:
Pattern multiword = Pattern.compile(
".*?(\\b(war|army|fighting|rebels|clashes)\\b)" +
".*?(\\b(?!\\1)(war|army|fighting|rebels|clashes)\\b)"
);
Matcher m;
for(String str : Arrays.asList(
"war",
"war war war",
"warm farmy people",
"In the war on terror rebels eating faces"
)) {
m = multiword.matcher(str);
if(m.find()) {
logger.info(str + " : " + m.group(0));
} else {
logger.info(str + " : no match.");
}
}
Prints:
war : no match.
war war war : no match.
warm farmy people : no match.
In the war on terror rebels eating faces : In the war on terror rebels
回答2:
This isn't (entirely) a job for regular expressions. A better approach is to scan the text, and then count the unique match groups.
In Ruby, it would be very simple to branch based on your match count. For example:
terms = /war|army|fighting|rebels|clashes/
text = "The war between the rebels and the army resulted in..."
# The real magic happens here.
match = text.scan(terms).uniq
# Do something if your minimum match count is met.
if match.count >= 2
p match
end
This will print ["war", "rebels", "army"]
.
回答3:
Regular expressions could do the trick, but the regular expression would be quite huge.
Remember, they are simple tools (based on finite-state automata) and hence don't have any memory that would let them remember what words were already seen. So such regex, even though possible, would probably just look like a huge lump of or's (as in, one "or" for every possible order of inputs or something).
I recommend to do the parsing yourself, for instance like:
var searchTerms = set(yourWords);
int found = 0;
foreach (var x in words(input)) {
if (x in searchTerms) {
searchTerms.remove(x);
++found;
}
if (found >= 2) return true;
}
return false;
回答4:
If you want to do it all with a regex it's not likely to be easy.
You can however do something like this:
<?php
...
$string = "The war between the rebels and the army resulted in several clashes this week. (4 hits)";
preg_match_all("@(\b(war|army|fighting|rebels|clashes))\b@", $string, $matches);
$uniqueMatchingWords = array_unique($matches[0]);
if (count($uniqueMatchingWords) >= 2) {
//bingo
}
来源:https://stackoverflow.com/questions/10832519/regex-match-at-least-two-search-terms