How to get Mechanize to auto-convert body to UTF8?

送分小仙女□ 提交于 2019-12-07 11:10:48

问题


I found some solutions using post_connect_hook and pre_connect_hook, but it seems like they don't work. I'm using the latest Mechanize version (2.1). There are no [:response] fields in the new version, and I don't know where to get them in the new version.

  • https://gist.github.com/search?q=pre_connect_hooks
  • https://gist.github.com/search?q=post_connect_hooks

Is it possible to make Mechanize return a UTF8 encoded version, instead of having to convert it manually using iconv?


回答1:


Since Mechanize 2.0, arguments of pre_connect_hooks() and post_connect_hooks() were changed.

See the Mechanize documentation:

pre_connect_hooks()

A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

 

post_connect_hooks()

A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:

class MyParser
  def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
    # insert your conversion code here. For example:
    # thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
    Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
  end
end

agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...



回答2:


I found a solution that works pretty well:

class HtmlParser
  def self.parse(body, url, encoding)
    body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '')
    Nokogiri::HTML::Document.parse(body, url, 'UTF-8')
  end
end

Mechanize.new.tap do |web|
  web.html_parser = HtmlParser
end

No issues were found yet.




回答3:


How about something like this:

class Mechanize
    alias_method :original_get, :get
    def get *args
        doc = original_get *args
        doc.encoding = 'utf-8'
        doc
    end
end



回答4:


In your script, just enter: page.encoding = 'utf-8'

However, depending on your scenario, you may alternatively need to enter the reverse (the encoding of the website Mechanize is working with) instead. For that, open Firefox, open the website you want Mechanize to work with, select Tools in the menubar, and then open Page Info. Determine what the page is encoded in from there.

Using that info, you would instead enter what the page is encoded in (such as page.encoding = 'windows-1252').



来源:https://stackoverflow.com/questions/8864493/how-to-get-mechanize-to-auto-convert-body-to-utf8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!