How can I perform a culture-sensitive “starts-with” operation from the middle of a string?

前端 未结 3 1634
长情又很酷
长情又很酷 2020-12-28 11:12

I have a requirement which is relatively obscure, but it feels like it should be possible using the BCL.

For context, I\'m parsing a date/time string in Nod

3条回答
  •  臣服心动
    2020-12-28 12:09

    I'll consider the problem of many<->one/many casemappings first and separately from handling different Normalization forms.

    For example:

    x heiße y
      ^--- cursor
    

    Matches heisse but then moves cursor 1 too much. And:

    x heisse y
      ^--- cursor
    

    Matches heiße but then moves cursor 1 too less.

    This will apply to any character that doesn't have a simple one-to-one mapping.

    You would need to know the length of the substring that was actually matched. But Compare, IndexOf ..etc throw that information away. It could be possible with regular expressions but the implementation doesn't do full case folding and so doesn't match ß to ss/SS in case-insensitive mode even though .Compare and .IndexOf do. And it would probably be costly to create new regexes for every candidate anyway.

    The simplest solution to this is to just internally store strings in case folded form and do binary comparisons with case folded candidates. Then you can move the cursor correctly with just .Length since the cursor is for internal representation. You also get most of the lost performance back from not having to use CompareOptions.IgnoreCase.

    Unfortunately there is no case fold function built-in and the poor man's case folding doesn't work either because there is no full case mapping - the ToUpper method doesn't turn ß into SS.

    For example this works in Java (and even in Javascript), given string that is in Normal Form C:

    //Poor man's case folding.
    //There are some edge cases where this doesn't work
    public static String toCaseFold( String input, Locale cultureInfo ) {
        return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
    }
    

    Fun to note that Java's ignore case comparison doesn't do full case folding like C#'s CompareOptions.IgnoreCase. So they are opposite in this regard: Java does full casemapping, but simple case folding - C# does simple casemapping, but full case folding.

    So it's likely that you need a 3rd party library to case fold your strings before using them.


    Before doing anything you have to be sure that your strings are in normal form C. You can use this preliminary quick check optimized for Latin script:

    public static bool MaybeRequiresNormalizationToFormC(string input)
    {
        if( input == null ) throw new ArgumentNullException("input");
    
        int len = input.Length;
        for (int i = 0; i < len; ++i)
        {
            if (input[i] > 0x2FF)
            {
                return true;
            }
        }
    
        return false;
    }
    

    This gives false positives but not false negatives, I don't expect it to slow down 460k parses/s at all when using Latin script characters even though it needs to be performed on every string. With a false positive you would use IsNormalized to get a true negative/positive and only after that normalize if necessary.


    So in conclusion, the processing is to ensure normal form C first, then case fold. Do binary comparisons with the processed strings and move cursor as you are moving it currently.

提交回复
热议问题