Regex accent insensitive?

后端 未结 7 1674
天命终不由人
天命终不由人 2020-12-09 17:38

I need a Regex in a C# program.


I\'ve to capture a name of a file with a specific structure.

I used the \\w cha

相关标签:
7条回答
  • 2020-12-09 18:17

    Use this \p{L} instead of the the class \w

    \p{L} is a unicode code point with the category "letter". So it includes for example "äöüéè" and so on.

    You can also use it in your own character class, if you want for example include space or the dot like this [\p{L} .]

    Update:

    OK, I recognized that \w in .net also include the Unicode letters and not only the ASCII ones.

    So I am not sure what you are asking. If you want to allow stuff that just looks like a letter, but isn't, then I think you will end up using \S (not a whitespace).

    Maybe it helps if you show some examples.

    0 讨论(0)
  • 2020-12-09 18:21

    Don't shoot me down for this, but if you're just trying to match a filename, then why not go the other way and use excluded characters?

     [^<>:"/\|?*]
    
    0 讨论(0)
  • 2020-12-09 18:32

    Case-insensite works for me in this example:

         string input =@"âãäåæçèéêëìíîïðñòóôõøùúûüýþÿı";
         string pattern = @"\w+";
         MatchCollection matches = Regex.Matches (input, pattern, RegexOptions.IgnoreCase);
    
    0 讨论(0)
  • 2020-12-09 18:35

    Did you try . it should: Matches any single character except a newline character. \w: Matches any word character including underscore. Equivalent to "[A-Za-z0-9_]". So it makes sense that accented letters are excluded.

    http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

    0 讨论(0)
  • 2020-12-09 18:37

    You could simply replace diacritics with alphabetic (near-)equivalences, and then use use your current regex.

    See for example:

    How do I remove diacritics (accents) from a string in .NET?

    static string RemoveDiacritics(string input)
    {
        string normalized = input.Normalize(NormalizationForm.FormD);
        var builder = new StringBuilder();
    
        foreach (char ch in normalized)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
            {
                builder.Append(ch);
            }
        }
    
        return builder.ToString().Normalize(NormalizationForm.FormC);
    }
    
    string s1 = "Renato Núñez David DeJesús Edwin Encarnación";
    string s2 = RemoveDiacritics(s1);
    // s2 = "Renato Nunez David DeJesus Edwin Encarnacion"
    
    0 讨论(0)
  • 2020-12-09 18:40

    Try this:

     String pattern = @"[\p{L}\w]+"; 
    
    0 讨论(0)
提交回复
热议问题