问题
I am using lxml 4.5.0 to scraping data from website.
it works well in the following example
chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
"(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"
with requests.Session() as s:
s.headers.update({'User-Agent': chrome_ua})
resp = s.get('https://www.yahoo.co.jp')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(resp.text), parser)
result = tree.xpath('//*[@id="tabTopics1"]/a')[0]
result.text
as the result.text
give me the right text 'ニュース'
but when I try another side, it failed to prase the japanese properly.
chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
"(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"
with requests.Session() as s:
s.headers.update({'User-Agent': chrome_ua})
resp = s.get('https://travel.rakuten.co.jp/')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(resp.text), parser)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
result.text
the result.text
give me 'å\x9b½å\x86\x85æ\x97\x85è¡\x8c'
, it should be '国内旅行'
I tried to use parser = etree.HTMLParser(encoding='utf-8')
, but it still not work.
How can I make lxml parse japanese properly in this case?
回答1:
Using
print(resp.encoding)
you can see it used ISO-8859-1
to convert resp.content
to resp.text
but you can get directly resp.content
and decode it with different encoding
StringIO( resp.content.decode('utf-8') )
Using module chardet you can try to detect what encoding you should use
print( chardet.detect(resp.content) )
Result
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
import requests
from lxml import etree
from io import StringIO
import chardet
chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
"(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"
with requests.Session() as s:
s.headers.update({'User-Agent': chrome_ua})
resp = s.get('https://travel.rakuten.co.jp/')
print(resp.encoding)
print( chardet.detect(resp.content) )
detected_encoding = chardet.detect(resp.content)['encoding']
parser = etree.HTMLParser()
#tree = etree.parse(StringIO(resp.content.decode('utf-8')), parser)
tree = etree.parse(StringIO(resp.content.decode(detected_encoding)), parser)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
result.text
EDIT: as @usr2564301 found in answer
python requests.get() returns improperly decoded text instead of UTF-8?
it can be resolved with
resp.encoding = resp.apparent_encoding
which uses chardet
to recognize encoding.
来源:https://stackoverflow.com/questions/60505000/python-lxml-cant-parse-japanese-in-some-case