Nokogiri producing different results on heroku?

南楼画角 提交于 2019-12-11 02:51:32

问题


I'm having a very strange problem and I'd appreciate help tracking it down.

I'm using the nokogiri gem to parse some html, and I am parsing a file which has a weird character in it. Not entirely sure what this character is, in vim it shows as ^Q.

On my own computer, everything works fine, however on heroku it inserts a </body></html><html> when it hits the character and selectors only return the elements before the weird character.

To illustrate: Nokogiri::HTML( open("http://thoms.net.nz/e2.html")).css("body div").count is 1 on heroku, and two on my computer. - The file containing this character can be downloaded from http://thoms.net.nz/e2.html.

Both my computer and heroku are running nokogiri 1.5.5 with ruby 1.9.3.


回答1:


The ^Q is a software control character (XON), which isn't supposed to be in HTML. I suspect its unexpected presence is confusing both Nokogiri and Heroku, but in different ways.

HTML documents from the wilds of the internet can be corrupted in any numbers of ways. I've seen all sorts of garbage in them, and if I couldn't make sense of it using iconv or a Unicode transliteration, I'd resort to a quick global search and replace to remove anything not in the normal ASCII range before further processing.


In Ruby, global search and replace uses String#gsub.

doc = Nokogiri::HTML(html.gsub("\u0011", ''))


来源:https://stackoverflow.com/questions/12085250/nokogiri-producing-different-results-on-heroku

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!