问题
Any simple unicode string like زسس or یسیتنانت matches in c# regex using the following pattern but they don’t match in java.
Can anyone explain this? How do I correct it for it to work in Java?
"\\b[\\w\\p{M}\\u200B\\u200C\\u00AC\\u001F\\u200D\\u200E\\u200F]+\\b"
c# code :(it matches the strings)
private static readonly Regex s_regexEngine;
private static readonly string s_wordPattern = @"\b[\w\p{M}\u200B\u200C\u00AC\u001F\u200D\u200E\u200F]+\b";
static PersianWordTokenizer()
{
s_regexEngine = new Regex(s_wordPattern, RegexOptions.Multiline);
}
public static List<string> Tokenize(string text, bool removeSeparators, bool standardized)
{
List<string> tokens = new List<string>();
int strIndex = 0;
foreach (Match match in s_regexEngine.Matches(text))
{
//Enter in this block
}
java code:(it dosnt matches string)
private static final String s_wordPattern = "\\b[\\w\\p{M}\\u200B\\u200C\\u00AC\\u001F\\u200D\\u200E\\u200F]+\\b";
static
{
s_regexpattern = Pattern.compile(Pattern.quote(s_wordPattern));
}
public static java.util.ArrayList<String> Tokenize(String text, boolean removeSeparators, boolean standardized)
{
java.util.ArrayList<String> tokens = new java.util.ArrayList<String>();
int strIndex = 0;
s_regexEngine=s_regexpattern.matcher(text);
while(s_regexEngine.find())
{
// it dosnt enter in this block
}
回答1:
Look at the "any letter" unicode character class, \p{L}, or at the Pattern.UNICODE_CHARACTER_CLASS parameter to the java Pattern.compile method.
I guess the second one, as being Java only, won't interest you, but is worth mentioning.
import java.util.regex.Pattern;
/**
* @author Luc
*/
public class Test {
/**
* @param args
*/
public static void main(final String[] args) {
test("Bonjour");
test("یسیتنانت");
test("世界人权宣言 ");
}
private static void test(final String text) {
showMatch(Pattern.compile("\\b\\p{L}+\\b"), text);
showMatch(Pattern.compile("\\b\\w+\\b", Pattern.UNICODE_CHARACTER_CLASS), text);
}
private static void showMatch(final Pattern pattern, final String text) {
System.out.println("With pattern \"" + pattern + "\": " + text + " " + pattern.matcher(text).find());
}
}
Results :
With pattern "\b\w+\b": Bonjour true
With pattern "\b\p{L}+\b": Bonjour true
With pattern "\b\w+\b": یسیتنانت true
With pattern "\b\p{L}+\b": یسیتنانت true
With pattern "\b\w+\b": 世界人权宣言 true
With pattern "\b\p{L}+\b": 世界人权宣言 true
回答2:
The regular expression itself does not change between .NET and Java, so here is roughly how you would use it in Java.
package regexdemo;
import java.util.regex.*;
public class void main(String[] args) {
String term = "Hello-World";
boolean found = false;
Pattern p = Pattern.compile("\\b[\\w\\p{M}\\u200B\\u200C\\u00AC\\u001F\\u200D\\u200E\\u200F]+\\b");
Matcher m = p.matcher(term);
if (matcher.find()){
found = true;
}
}
Also as a starting point for deceminating the different flavors for regex I'd recommend you look at the sites
http://docs.oracle.com/javase/tutorial/essential/regex/index.html
http://www.regular-expressions.info/
回答3:
Wrap the regex string in a call to java.util.regex.Pattern.quote. e.g., java.util.regex.Pattern.quote(yourCSharpRegexString).
来源:https://stackoverflow.com/questions/14280005/convert-c-sharp-regex-to-java-regex