I\'m writing a simple debugging program that takes as input simple strings that can contain stars to indicate a wildcard match-any
*.wav // matches
You can also use the Quotation escape characters: \\Q and \\E
- everything between them is treated as literal and not considered to be part of the regex to be evaluated. Thus this code should work:
String input = "*.wav";
String regex = "\\Q" + input.replace("*", "\\E.*?\\Q") + "\\E";
// regex = "\\Q\\E.*?\\Q.wav\\E"
Note that your * wildcard might also be best matched only against word characters using \w depending on how you want your wildcard to behave(?)
Just escape everything - no harm will come of it.
String input = "*.wav";
String regex = ("\\Q" + input + "\\E").replace("*", "\\E.*\\Q");
System.out.println(regex); // \Q\E.*\Q.wav\E
System.out.println("abcd.wav".matches(regex)); // true
Or you can use character classes:
String input = "*.wav";
String regex = input.replaceAll(".", "[$0]").replace("[*]", ".*");
System.out.println(regex); // .*[.][w][a][v]
System.out.println("abcd.wav".matches(regex)); // true
It's easier to "escape" the characters by putting them in a character class, as almost all characters lose any special meaning when in a character class. Unless you're expecting weird file names, this will work.
Regex While Accommodating A DOS/Windows Path
Implementing the Quotation escape characters \Q
and \E
is probably the best approach. However, since a backslash is typically used as a DOS/Windows file separator, a "\E
" sequence within the path could effect the pairing of \Q
and \E
. While accounting for the *
and ?
wildcard tokens, this situation of the backslash could be addressed in this manner:
Search: [^*?\\]+|(\*)|(\?)|(\\)
Two new lines would be added in the replace function of the "Using A Simple Regex" example to accommodate the new search pattern. The code would still be "Linux-friendly". As a method, it could be written like this:
public String wildcardToRegex(String wildcardStr) {
Pattern regex=Pattern.compile("[^*?\\\\]+|(\\*)|(\\?)|(\\\\)");
Matcher m=regex.matcher(wildcardStr);
StringBuffer sb=new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(sb, ".*");
else if(m.group(2) != null) m.appendReplacement(sb, ".");
else if(m.group(3) != null) m.appendReplacement(sb, "\\\\\\\\");
else m.appendReplacement(sb, "\\\\Q" + m.group(0) + "\\\\E");
}
m.appendTail(sb);
return sb.toString();
}
Code to demonstrate the implementation of this method could be written like this:
String s = "C:\\Temp\\Extra\\audio??2012*.wav";
System.out.println("Input: "+s);
System.out.println("Output: "+wildcardToRegex(s));
This would be the generated results:
Input: C:\Temp\Extra\audio??2012*.wav
Output: \QC:\E\\\QTemp\E\\\QExtra\E\\\Qaudio\E..\Q2012\E.*\Q.wav\E
There is small utility method in Apache Commons-IO library: org.apache.commons.io.FilenameUtils#wildcardMatch(), which you can use without intricacies of the regular expression.
API documentation could be found in: https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FilenameUtils.html#wildcardMatch(java.lang.String,%20java.lang.String)
Lucene has classes that provide this capability, with additional support for backslash as an escape character. ?
matches a single character, 1
matches 0 or more characters, \
escapes the following character. Supports Unicode code points. Supposed to be fast but I haven't tested.
CharacterRunAutomaton characterRunAutomaton;
boolean matches;
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Walmart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // false
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal*mart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // true
matches = characterRunAutomaton.run("Waldomart"); // true
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal\\*mart")));
matches = characterRunAutomaton.run("Walmart"); // false
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false
Using A Simple Regex
One of this method's benefits is that we can easily add tokens besides *
(see Adding Tokens at the bottom).
Search: [^*]+|(\*)
|
matches any chars that are not a star\Q
+ Match + E
.*
Here is some working code (see the output of the online demo).
Input: audio*2012*.wav
Output: \Qaudio\E.*\Q2012\E.*\Q.wav\E
String subject = "audio*2012*.wav";
Pattern regex = Pattern.compile("[^*]+|(\\*)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, ".*");
else m.appendReplacement(b, "\\\\Q" + m.group(0) + "\\\\E");
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
Adding Tokens
Suppose we also want to convert the wildcard ?
, which stands for a single character, by a dot. We just add a capture group to the regex, and exclude it from the matchall on the left:
Search: [^*?]+|(\*)|(\?)
In the replace function we the add something like:
else if(m.group(2) != null) m.appendReplacement(b, ".");