Open iso-8859-1 encoded html with nokogiri messes up accents

杀马特。学长 韩版系。学妹 提交于 2019-12-11 08:02:34

问题


I'm trying to make some changes to an html page encoded with charset=iso-8859-1

doc = Nokogiri::HTML(open(html_file))

puts doc.to_html messes up all the accents in the page. So if I save it back it looks broken in the browser as well.

I'm still on Rails 3.0.6... Any hints how to fix this problem?

Here's one of the pages suffering from that for example: http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html

I've asked also in Github but I have the feeling this will be faster. I'll update both places if I get a cure for the problem.

UPDATE 1 24 March 2012

Thanks for the comments. I managed to partially solve this issue. I believe this has nothing to do with Nokogiri however. As I mentioned in some comment I just need to open and save the file to get the accents messed up.

The closest to a fix I got is doing this:

thefile = File.open(html_file, "r") 
text =  thefile.read
doc = Nokogiri::HTML(text)
... do any stuff with nokogiri
File.open(html_file, 'w') {|f| f.write(doc.to_html) }

The original file came with iso-8859-1, the save one goes in utf-8 pretty much it looks ok. Accents are in place. Except for the access in the capital letter :-P I get question marks like in Econom�a , there should be í (i with an accent)

Getting closer I think. If someone has a hint to cover the capital letters as well it might be almost done.

Cheers.


回答1:


The method you used to download the file may have changed the encoding, breaking the accents in the file. Try this to see it working correctly:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = 'http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html'
doc = Nokogiri::HTML(open(url))
File.open("1331108705.html", "w") {|f| f.write(doc.to_html)}
system('open', '1331108705.html') # on Mac OS X, this will open the html file in your browser

How did you download the file?



来源:https://stackoverflow.com/questions/9741023/open-iso-8859-1-encoded-html-with-nokogiri-messes-up-accents

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!