Regex accent insensitive?

為{幸葍}努か 提交于 2019-11-28 10:06:49

Case-insensite works for me in this example:

     string input =@"âãäåæçèéêëìíîïðñòóôõøùúûüýþÿı";
     string pattern = @"\w+";
     MatchCollection matches = Regex.Matches (input, pattern, RegexOptions.IgnoreCase);
Paolo Moretti

You could simply replace diacritics with alphabetic (near-)equivalences, and then use use your current regex.

See for example:

How do I remove diacritics (accents) from a string in .NET?

static string RemoveDiacritics(string input)
{
    string normalized = input.Normalize(NormalizationForm.FormD);
    var builder = new StringBuilder();

    foreach (char ch in normalized)
    {
        if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
        {
            builder.Append(ch);
        }
    }

    return builder.ToString().Normalize(NormalizationForm.FormC);
}

string s1 = "Renato Núñez David DeJesús Edwin Encarnación";
string s2 = RemoveDiacritics(s1);
// s2 = "Renato Nunez David DeJesus Edwin Encarnacion"

Use this \p{L} instead of the the class \w

\p{L} is a unicode code point with the category "letter". So it includes for example "äöüéè" and so on.

You can also use it in your own character class, if you want for example include space or the dot like this [\p{L} .]

Update:

OK, I recognized that \w in .net also include the Unicode letters and not only the ASCII ones.

So I am not sure what you are asking. If you want to allow stuff that just looks like a letter, but isn't, then I think you will end up using \S (not a whitespace).

Maybe it helps if you show some examples.

Try this:

 String pattern = @"[\p{L}\w]+"; 

Can you try this and see if it works:

[\u00E9-\u00F8\w]

Don't shoot me down for this, but if you're just trying to match a filename, then why not go the other way and use excluded characters?

 [^<>:"/\|?*]

Did you try . it should: Matches any single character except a newline character. \w: Matches any word character including underscore. Equivalent to "[A-Za-z0-9_]". So it makes sense that accented letters are excluded.

http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!