When receiving user input on forms I want to detect whether fields like \"username\" or \"address\" does not contain markup that has a special meaning in XML (RSS feeds) or
If you're just "looking for protection for print '", then yes, at least the
second approach is adequate, since it checks whether the value would be interpreted as markup if it weren't
escaped. (In this case, the area where ' . $name . '
'$name would appear is element content, and only the characters &, <, and > have special meaning when they appear in element content.) (For href and similar attributes, the check for "JavaScript: " may be necessary, but as you stated in a comment, that isn't a goal.)
For official sources, I can refer to the XML specification:
Content production in section 3.1: Here, content consists of elements, CDATA sections, processing instructions, and comments (which must begin with <), references (which must begin with &), and character data (which contains any other legal character). (Although a leading > is treated as character data in element content, many people usually escape it along with <, and it's better safe than sorry to treat it as special.)
Attribute value production in section 2.3: A valid attribute value consists of either references (which must begin with &) or character data (which contains any other legal character, but not < or the quote symbol used to wrap the attribute value). If you need to place string inputs in attributes in addition to element content, the characters " and ' need to be checked in addition to &, <, and possibly > (and other characters illegal in XML).
Section 2.2: Defines what Unicode code points are legal in XML. In particular, null is illegal in an XML document and may not display properly in HTML.
HTML5 (the latest working draft, which is a work in progress, describes a very elaborate parsing algorithm for HTML documents:
< (which begins a new tag), or &
(which begins a character reference)." (which ends the attribute value), or & (which begins a character reference).If string inputs are to be placed in attribute values (unless placing them there is solely for display purposes), there are additional considerations to keep in mind. For example, HTML 4 specifies:
User agents should interpret attribute values as follows:
- Replace character entities with characters,
- Ignore line feeds,
- Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values[.]
Attribute value normalization is also specified in the XML specification, but apparently not in HTML5.
EDIT (Apr. 25, 2019): Also, be suspicious of inputs containing—
...assuming htmlspecialchars doesn't escape those code points already.