I am using open-uri to read a webpage which claims to be encoded in iso-8859-1. When I read the contents of the page, open-uri returns a string encoded in ASCII-8BIT.
open("http://www.nigella.com/recipes/view/DEVILS-FOOD-CAKE-5310") {|f| p f.content_type, f.charset, f.read.encoding }
=> ["text/html", "iso-8859-1", #<Encoding:ASCII-8BIT>]
I am guessing this is because the webpage has the byte (or character) \x92 which is not a valid iso-8859 character. http://en.wikipedia.org/wiki/ISO/IEC_8859-1.
I need to store webpages as utf-8 encoded files. Any ideas on how to deal with webpage where the encoding is incorrect. I could catch the exception and try to guess the correct encoding but that seems cumbersome and error-prone.
ASCII-8BIT is an alias for BINARYopen-uridoes a funny thing: if the file is less than 10kb (or something like that), it returns aStringand if it's bigger then it returns aStringIO. That can be confusing if you're trying to deal with encoding issues.
If the files aren't huge, I would recommend manually loading them into strings:
require 'uri'
require 'net/http'
require 'net/https'
uri = URI.parse url_to_file
http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == 'https'
http.use_ssl = true
# possibly useful if you see ssl errors
# http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body
Then you can use the https://rubygems.org/gems/ensure-encoding gem
require 'ensure/encoding'
utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)
I have been pretty happy with ensure-encoding... we use it in production at http://data.brighterplanet.com
Note that you can also say :invalid_characters => :ignore instead of :transcode.
Also, if you know the encoding somehow, you can pass :external_encoding => 'ISO-8859-1' instead of :sniff
来源:https://stackoverflow.com/questions/5712096/open-uri-returning-ascii-8bit-from-webpage-encoded-in-iso-8859