Nokogiri leaving HTML entities untouched

一个人想着一个人 提交于 2019-12-04 00:49:52

Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding:

#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>&reg;</p>')
puts html.to_html                              #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' )       #=> <p>&#174;</p>

It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original.

The root of the problem is that, in HTML, the following all describe the exact same content:

<p>®</p>
<p>&reg;</p>
<p>&#xAE;</p>  
<p>&#174;</p>

If you wanted the to_s representation of a text node to be actually &reg; then the markup describing that would really be: <p>&amp;reg;</p>.

If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this (Nokogiri::XML::EntityReference):

require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo&reg;</p>

However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of Nokogiri::XML::ParseOptions::NOENT during parsing does not appear to cause one to be created:

require 'nokogiri'
html = "<p>Foo&reg;</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
  Nokogiri::XML::ParseOptions::DEFAULT_HTML,
  Nokogiri::XML::ParseOptions::DEFAULT_XML,
  Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
  p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!