How to get Mechanize to auto-convert body to UTF8?

余生长醉 提交于 2019-12-05 12:17:58
nagoya0

Since Mechanize 2.0, arguments of pre_connect_hooks() and post_connect_hooks() were changed.

See the Mechanize documentation:

pre_connect_hooks()

A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

 

post_connect_hooks()

A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:

class MyParser
  def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
    # insert your conversion code here. For example:
    # thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
    Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
  end
end

agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...
Dmitry Polushkin

I found a solution that works pretty well:

class HtmlParser
  def self.parse(body, url, encoding)
    body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '')
    Nokogiri::HTML::Document.parse(body, url, 'UTF-8')
  end
end

Mechanize.new.tap do |web|
  web.html_parser = HtmlParser
end

No issues were found yet.

How about something like this:

class Mechanize
    alias_method :original_get, :get
    def get *args
        doc = original_get *args
        doc.encoding = 'utf-8'
        doc
    end
end

In your script, just enter: page.encoding = 'utf-8'

However, depending on your scenario, you may alternatively need to enter the reverse (the encoding of the website Mechanize is working with) instead. For that, open Firefox, open the website you want Mechanize to work with, select Tools in the menubar, and then open Page Info. Determine what the page is encoded in from there.

Using that info, you would instead enter what the page is encoded in (such as page.encoding = 'windows-1252').

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!