Spec for handling of HTML entities in a[href]

北慕城南 提交于 2021-02-10 08:38:49

问题


I'm looking for a spec on handling HTML entities in the href attribute of <a> tags. So far, no luck (I might be searching for something too specific).

In detail:

The bug I'm trying to fix is part of the cheerio project.

Some entities don't require a semicolon at the end. One of them is &curren. Anyway, this leads to problems when a source links to /test/example.jsp?item=123&currentSize=S&currentQty=1.

Browsers (at least Chrome) handle this nicely. I still haven't figured out why though.


回答1:


Regarding HTML up to and including HTML 4.01, see @Quentin’s answer.

Regarding any flavor of XHTML, including HTML5 in XHTML serialization, &currentSize= contains a well-formedness error, so any display of the document is aborted (when the document is processed as truly XHTML).

In HTML5 in HTML serialization, there are tricky ad hoc rules for parsing character references. They imply that in text content, &currentSize= would be parsed as if it were written &curr;entSize=, i.e. as ¤entSize=. But within an attribute value, as in <a href="...">, then, under certain conditions, the reference is not recognized, since it is not terminated by a semicolon.

Specifically, the conditions described there are: “If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or in the range ASCII digits, uppercase ASCII letters, or lowercase ASCII letters, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned.” So no &foobar= will be recognized in an attribute value, even if foobar is a defined name

The reason is that authors have widely written URLs in attribute values without escaping & and browsers have adapted to this.




回答2:


I might be searching for something too specific.

You are. They are treated the same way as they are everywhere else (outside of elements defined as containing CDATA).

I can't find anything specific and explicit that says where character references are evaluated in HTML, but the attributes section implies it with:

all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (&#34;) and single quotes (&#39;). For double quotes authors can also use the character entity reference &quot;.

HTML 5 changes the rules with:

must be one that is terminated by a ";" (U+003B) character.

… and variations of the same.

However, some browsers still support the old standard where the semi-colon was optional when the entity was followed by a non-name character. The standard for this is the ISO SGML spec which you have to pay for, but HTML 4.0 says:

Note: In SGML, it is possible to eliminate the final ";" after a numeric or named character reference in some cases (e.g., at a line break or directly before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

In short, for backwards compatibility and clarity, if you want to include a & character in a URL in an href attribute, then just represent it as &amp;. That works everywhere.



来源:https://stackoverflow.com/questions/16164835/spec-for-handling-of-html-entities-in-ahref

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!