parsing xml and html page with lxml and requests package in python

久未见 提交于 2019-12-13 14:13:45

问题


I have been trying to parse xml and html page by using lxml and requests package in python. I using the following code for this purpose:

in python:

import requests
import lxml.etree
url = ""
req = requests.get(url)
tree = html.fromstring(req.content)
root = tree.xpath('')
for item in root:
     print(item.text)

This code works fine but for some web pages can't show their contents properly and need to set encoding utf-8 but i don't know how i can add set encoding in this code


回答1:


requests automatically decodes content from the server.

Important to understand:

r.content - contains not yet decoded response content

r.encoding - contains information about response content encoding

r.text - according to the official doc it is already decoded version of r.content

Following the unicode standard, I get used to r.text but you still can decode your content manually using

r.content.decode(r.encoding)

Hope it helps.



来源:https://stackoverflow.com/questions/40447117/parsing-xml-and-html-page-with-lxml-and-requests-package-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!