We have recently upgraded all our projects from .NET 3.5 to .NET 4. I have come across a rather strange issue with respect to string.IndexOf()
.
My code
Your string exists of two characters: a soft hyphen (Unicode code point 173) and a hyphen (Unicode code point 45).
Wiki: According to the Unicode standard, a soft hyphen is not displayed if the line is not broken at that point.
When using "\xAD\x2D".IndexOf("\xAD\x2D")
in .NET 4, it seems to ignore that you're looking for the soft hyphen, returning a starting index of 1 (the index of \x2D
). In .NET 3.5, this returns 0.
More fun, if you run this code (so when only looking for the soft hyphen):
string text = "\xAD\x2D";
string shy = "\xAD";
int i1 = text.IndexOf(shy);
then i1
becomes 0, regardless of the .NET version used. The result of text.IndexOf(text);
varies indeed, which at a glance looks like a bug to me.
As far as I can track back through the framework, older .NET versions use an InternalCall to IndexOfString()
(I can't figure out to which API call that goes), while from .NET 4 a QCall to InternalFindNLSStringEx()
is made, which in turn calls FindNLSStringEx().
The issue (I really can't figure out if this is intended behaviour) indeed occurs when calling FindNLSStringEx
:
LPCWSTR lpStringSource = L"\xAD\x2D";
LPCWSTR lpStringValue = L"\xAD";
int length;
int i = FindNLSStringEx(
LOCALE_NAME_SYSTEM_DEFAULT,
FIND_FROMSTART,
lpStringSource,
-1,
lpStringValue,
-1,
&length,
NULL,
NULL,
1);
Console::WriteLine(i);
i = FindNLSStringEx(
LOCALE_NAME_SYSTEM_DEFAULT,
FIND_FROMSTART,
lpStringSource,
-1,
lpStringSource,
-1,
&length,
NULL,
NULL,
1);
Console::WriteLine(i);
Console::ReadLine();
Prints 0 and then 1. Note that length
, an out parameter indicating the length of the found string, is 0 after the first call and 1 after the second; the soft hyphen is counted as having a length of 0.
The workaround is to use text.IndexOf(text, StringComparison.OrdinalIgnoreCase);
, as you've noted. This makes a QCall to InternalCompareStringOrdinalIgnoreCase()
which in turn calls FindStringOrdinal(), which returns 0 for both cases.
It seems be a bug in .NET4, and new changes reverted in .NET4 Beta1 to previous version same as .NET 2.0/3.0/3.5.
What's New in the BCL in .NET 4.0 CTP (MSDN blogs):
String Security Changes in .NET 4
The default partial matching overloads on System.String (StartsWith, EndsWith, IndexOf, and LastIndexOf) have been changed to be culture-agnostic (ordinal) by default.
This change affected the behavior of the String.IndexOf
method by changing them to perform an ordinal (byte-for-byte) comparison by default an will be changed to use CultureInfo.InvariantCulture
instead of CultureInfo.CurrentCulture
.
UPDATE for .NET 4 Beta 1
In order to maintain high compatibility between .NET 4 and previous releases, we have decided to revert this change. The behavior of String's default partial matching overloads and String and Char's ToUpper and ToLower methods now behave the same as they did in .NET 2.0/3.0/3.5. The change back to the original behavior is present in .NET 4 Beta 1.
To fix this, change the string comparison method to an overload that accepts the System.StringComparison
enumeration as a parameter, and specify either Ordinal
or OrdinalIgnoreCase
.
// string contains 'unicode dash' \x2D
string text = "\xAD\x2D";
// woks in .NET 2.0/3.0/3.5 and .NET 4 Beta 1 and later
// but seems be buggy in .NET 4 because of 'culture-sensitive' comparison
int index = text.IndexOf(text);
// fixed version
index = text.IndexOf(text, StringComparison.Ordinal);
From the documentation (my emphasis):
This method performs a word (case-sensitive and culture-sensitive) search using the current culture.
Ie. some distinct code-points will be treated as equal.
What happens if you use an overload that takes a StringComparison
value and pass StringComparison.Ordinal
to avoid the cultural dependencies?