Nokogiri, open-uri, and Unicode Characters

前端 未结 8 2035
故里飘歌
故里飘歌 2020-11-30 01:56

I\'m using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What\'s the best way to deal with these

8条回答
  •  伪装坚强ぢ
    2020-11-30 02:26

    Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read and pass the resulting string to Nokogiri.

    Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo". But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:

    # encoding: UTF-8
    require 'nokogiri'
    require 'open-uri'
    
    doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
    doc.encoding = 'utf-8'
    h52 = doc.css('h5')[1]
    puts h52.text, h52.text.encoding
    #=> Genealogà a de Jesucristo
    #=> UTF-8
    

    We can see that this is not the fault of open-uri:

    html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
    gene = html.read[/Gene\S+/]
    puts gene, gene.encoding
    #=> Genealogía
    #=> UTF-8
    

    This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:

    # encoding: UTF-8
    require 'nokogiri'
    require 'open-uri'
    
    html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
    doc = Nokogiri::HTML(html.read)
    doc.encoding = 'utf-8'
    h52 = doc.css('h5')[1].text
    puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
    #=> Genealogía de Jesucristo
    #=> UTF-8
    #=> true
    

提交回复
热议问题