问题
Why do I get:
Nokogiri::HTML('<a href="/test_$4b.html">test</a>').to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><a href=\"/test_%244b.html\">test</a></body></html>\n"
I thought $ symbol was valid in the url?
Followup:
Why do browsers handle this differently. E.g. In the page: http://www.pmlive.com/pharma_news/its_on_shire_and_abbvie_agree_32bn_takeover_586969
The link: http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_generics_in_$5.3bn_deal_585883 works.
But nokogiri would parse this link as: http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_generics_in_%245.3bn_deal_585883 which does not work (returns 404).
Are they making the decision that $ is actually safe and a better choice?
回答1:
There's this RFC3986 here which lists the dollar sign as a reserved sub-delimiter (page 12).
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
It also recommends how reserved characters should be handle:
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
The authors of Nokogiri liked decided that since their library may be used by anyone for any purpose, there is no way to automatically determine whether a reserved character would conflict or not, and therefore the "safest" way to handle it (short of testing a URI directly) would be to escape it as per the recommendation.
来源:https://stackoverflow.com/questions/24878109/what-is-nokogiri-encoding-character