lxml

HTML scraping using lxml and requests gives a unicode error [duplicate]

半世苍凉 提交于 2019-11-27 14:32:10
问题 This question already has answers here : parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) (2 answers) Closed 4 years ago . I'm trying to use HTML scraper like the one provided here. It works fine for the example they provided. However, when I try using it with my webpage, I receive this error - Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. I've tried googling but couldn't find a solution. I'd

parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

孤者浪人 提交于 2019-11-27 14:30:58
I send a GET request to the CareerBuilder API : import requests url = "http://api.careerbuilder.com/v1/jobsearch" payload = {'DeveloperKey': 'MY_DEVLOPER_KEY', 'JobTitle': 'Biologist'} r = requests.get(url, params=payload) xml = r.text And get back an XML that looks like this . However, I have trouble parsing it. Using either lxml >>> from lxml import etree >>> print etree.fromstring(xml) Traceback (most recent call last): File "<pyshell#4>", line 1, in <module> print etree.fromstring(xml) File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311) File "parser.pxi

Parse SGML with Open Arbitrary Tags in Python 3

六月ゝ 毕业季﹏ 提交于 2019-11-27 14:04:16
I am trying to parse a file such as: http://www.sec.gov/Archives/edgar/data/1409896/000118143112051484/0001181431-12-051484.hdr.sgml I am using Python 3 and have been unable to find a solution with existing libraries to parse an SGML file with open tags. SGML allows implicitly closed tags. When attempting to parse the example file with LXML, XML, or beautiful soup I end up with implicitly closed tags being closed at the end of the file instead of at the end of line. For example: <COMPANY>Awesome Corp <FORM> 24-7 <ADDRESS> <STREET>101 PARSNIP LN <ZIP>31337 </ADDRESS> This ends up being

爬虫之解析库beautiful soup

梦想的初衷 提交于 2019-11-27 14:03:07
一:简介 1:介绍 (1)在request模块中我们不能进行数据的解析 (2)如果使用正则匹配数据比较繁琐 (3)Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库 (4)它能够通过你喜欢的转换器实现惯用的文档导航 2:安装 pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib 3:文档手册 https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html # 中文文档 二:基本使用 1:基本使用方式 html_doc = """ <html>

Why doesn't xpath work when processing an XHTML document with lxml (in python)?

♀尐吖头ヾ 提交于 2019-11-27 13:58:01
I am testing against the following test document: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>hi there</title> </head> <body> <img class="foo" src="bar.png"/> </body> </html> If I parse the document using lxml.html, I can get the IMG with an xpath just fine: >>> root = lxml.html.fromstring(doc) >>> root.xpath("//img") [<Element img at 1879e30>] However, if I parse the document as XML and try to get the IMG tag, I get an empty result:

Python爬虫(四)lxml、xpath

北城余情 提交于 2019-11-27 12:46:15
安装 pip install lxml 模块导入 from lxml import etree 使用xpath查询 解析html源代码,得到html节点对象 html=etree.HTML(r.text) 查看html元素节点的内容 print(etree.tostring(html,encoding="utf-8").decode("utf-8")) 查找节点 xpath() 返回查找到的元素列表 nodename,直接写节点名称,查找标签,/代表层级 html.xpath("head") #查找head标签 html.xpath("head/title") #查找head标签下的title标签 html.xpath("body/div") #查找body下的所有div列表,bs4的.语法只能找到第一个 /开头代表最顶层 html.xpath("/html/head") #查找根节点下的html节点下一层的head //代表从任意位置查找节点 html.xpath("//img") #查找任意位置的img标签,所有的img标签列表 html.xpath("//li/div") #查找所有li标签下面的div标签 属性查找 @ 符号使用 html.xpath("//li[@class='column']") #查找class是column的所有li元素列表 如果有多个class

Python pretty XML printer with lxml

时光怂恿深爱的人放手 提交于 2019-11-27 12:27:29
After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True) . I have the following XML: <testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests"> <testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306"> .... And I use it like this: tree = etree.parse('original.xml') root = tree.getroot() ... # modifications ... with open(FILE_NAME, "w") as f: tree.write(f, pretty_print=True) For me, this issue was not solved until I noticed this

Installing lxml with pip in virtualenv Ubuntu 12.10 error: command 'gcc' failed with exit status 4

人走茶凉 提交于 2019-11-27 12:03:08
I'm having the following error when trying to run "pip install lxml" into a virtualenv in Ubuntu 12.10 x64. I have Python 2.7. I have seen other related questions here about the same problem and tried installing python-dev, libxml2-dev and libxslt1-dev. Please take a look of the traceback from the moment I tip the command to the moment when the error occurs. Downloading/unpacking lxml Running setup.py egg_info for package lxml /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url' warnings.warn(msg) Building lxml version 3.1.2. Building without

Installing lxml, libxml2, libxslt on Windows 8.1

假如想象 提交于 2019-11-27 11:56:40
After additional exploration, I found a solution to installing lxml with pip and wheel. Additional comments on approach welcomed. I'm finding the existing Python documentation for Linux distributions excellent. For Windows... not so much. I've configured my Linux system fine but I need some help getting a Windows 8.1 tablet ready as well. My project requires the lxml module for Python 3.4. I've found many tutorials on how to install lxml but each has failed. https://docs.python.org/3/installing/ I've downloaded the "get-pip.py" and successfully ran it from the Windows cmd line with the result:

Python: Using xpath locally / on a specific element

南楼画角 提交于 2019-11-27 11:54:29
I'm trying to get the links from a page with xpath. The problem is that I only want the links inside a table, but if I apply the xpath expression on the whole page I'll capture links which I don't want. For example: tree = lxml.html.parse(some_response) links = tree.xpath("//a[contains(@href, 'http://www.example.com/filter/')]") The problem is that applies the expression to the whole document. I located the element I want, for example: tree = lxml.html.parse(some_response) root = tree.getroot() table = root[1][5] #for example links = table.xpath("//a[contains(@href, 'http://www.example.com