C# HtmlEncode - ISO-8859-1 Entity Names vs Numbers

后端未结

关注

 5  1343

名媛妹妹 2020-12-10 13:36

According to the following table for the ISO-8859-1 standard, there seems to be an entity name and an entity number associated with each reserved HTML character.

5条回答

予麋鹿 (楼主)

2020-12-10 14:13

ISO-8859-1 is not really relevant to HTML character encoding. From Wikipedia:

Numeric references always refer to Unicode code points, regardless of the page's encoding.

Only for undefined Unicode code points ISO-8859-1 is often used:

Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so "™", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.

Now to answer your question: For search to work best, you should really search the unencoded HTML (stripping the HTML tags first) using an unencoded search string. Matching encoded strings will lead to unexpected results, like hits based on HTML tags or comments, and hits missing because of differences in the HTML that are invisible in the text.

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...