python lxml can't parse japanese in some case [duplicate]

ⅰ亾dé卋堺 提交于 2020-03-23 10:44:11

问题


I am using lxml 4.5.0 to scraping data from website.

it works well in the following example

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://www.yahoo.co.jp')
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(resp.text), parser)
    result = tree.xpath('//*[@id="tabTopics1"]/a')[0]

result.text

as the result.text give me the right text 'ニュース'

but when I try another side, it failed to prase the japanese properly.

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://travel.rakuten.co.jp/')
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(resp.text), parser)
    result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]

result.text

the result.text give me 'å\x9b½å\x86\x85æ\x97\x85è¡\x8c' , it should be '国内旅行'

I tried to use parser = etree.HTMLParser(encoding='utf-8'), but it still not work.

How can I make lxml parse japanese properly in this case?


回答1:


Using

print(resp.encoding)

you can see it used ISO-8859-1 to convert resp.content to resp.text

but you can get directly resp.content and decode it with different encoding

StringIO( resp.content.decode('utf-8') )

Using module chardet you can try to detect what encoding you should use

print( chardet.detect(resp.content) )

Result

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

import requests
from lxml import etree
from io import StringIO
import chardet

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://travel.rakuten.co.jp/')

    print(resp.encoding)
    print( chardet.detect(resp.content) )
    detected_encoding = chardet.detect(resp.content)['encoding']

    parser = etree.HTMLParser()
    #tree = etree.parse(StringIO(resp.content.decode('utf-8')), parser)
    tree = etree.parse(StringIO(resp.content.decode(detected_encoding)), parser)
    result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]

result.text

EDIT: as @usr2564301 found in answer

python requests.get() returns improperly decoded text instead of UTF-8?

it can be resolved with

 resp.encoding = resp.apparent_encoding 

which uses chardet to recognize encoding.



来源:https://stackoverflow.com/questions/60505000/python-lxml-cant-parse-japanese-in-some-case

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!