According to the following table for the ISO-8859-1 standard, there seems to be an entity name and an entity number associated with each reserved HTML character.
So
That's how the method has been implemented. For some known characters it uses the corresponding entity and for everything else it uses the corresponding hex value and there is not much you could do to modify this behavior. Excerpt from the implementation of System.Net.WebUtility.HtmlEncode
(as seen with reflector):
...
if (ch <= '>')
{
switch (ch)
{
case '&':
{
output.Write("&");
continue;
}
case '\'':
{
output.Write("'");
continue;
}
case '"':
{
output.Write(""");
continue;
}
case '<':
{
output.Write("<");
continue;
}
case '>':
{
output.Write(">");
continue;
}
}
output.Write(ch);
continue;
}
if ((ch >= '\x00a0') && (ch < 'Ā'))
{
output.Write("&#");
output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
output.Write(';');
}
...
This being said you shouldn't care as this method will always produce valid, safe and correctly encoded HTML.
ISO-8859-1 is not really relevant to HTML character encoding. From Wikipedia:
Numeric references always refer to Unicode code points, regardless of the page's encoding.
Only for undefined Unicode code points ISO-8859-1 is often used:
Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so "", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.
Now to answer your question: For search to work best, you should really search the unencoded HTML (stripping the HTML tags first) using an unencoded search string. Matching encoded strings will lead to unexpected results, like hits based on HTML tags or comments, and hits missing because of differences in the HTML that are invisible in the text.
HtmlEncode
is following the spec. The ISO standard specifies both a name and a number for every entity, and the name and the number are equivalent. Therefore, a conforming implementation of HtmlEncode
is free to encode all points as numbers, or all as names, or some mixture of the two.
I suggest that you approach your problem from the other direction: call HtmlDecode
on the target text, then search through the decoded text using the raw string.
I made this function, I think it will help
string BasHtmlEncode(string x)
{
StringBuilder sb = new StringBuilder();
foreach (char c in x.ToCharArray())
sb.Append(String.Format("&#{0};", Convert.ToInt16(c)));
return(sb.ToString());
}
I developed following code to keep a-z,A-Z and 0-1 not encoded but rest:
public static string Encode(string source)
{
if (string.IsNullOrEmpty(source)) return string.Empty;
var sb = new StringBuilder(source.Length);
foreach (char c in source)
{
if (c >= 'a' && c <= 'z')
{
sb.Append(c);
}
else if (c >= 'A' && c <= 'Z')
{
sb.Append(c);
}
else if (c >= '0' && c <= '9')
{
sb.Append(c);
}
else
{
sb.AppendFormat("&#{0};",Convert.ToInt32(c));
}
}
return sb.ToString();
}