lxml

What’s the most forgiving HTML parser in Python?

﹥>﹥吖頭↗ 提交于 2019-12-10 17:18:50
问题 I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but the results are almost same. I can recall several HTML parser options available in Python from the top of my head: BeautifulSoup lxml pyquery I intend to test all of these, but I wanted to know which one in your tests come as most forgiving and can even try to parse bad HTML. 回答1: They all are.

What is right way to use cyrillic in python lxml library

风流意气都作罢 提交于 2019-12-10 15:58:13
问题 I try to generate .xml files fith cyrillic symbols within. But result is unexpected. What is the simplest way to avoid this result? Example: from lxml import etree root = etree.Element('пример') print(etree.tostring(root)) What I get is: b'<пример/>' Istead of: b'<пример/>' 回答1: etree.tostring() without additional arguments outputs ASCII-only data as a bytes object. You could use etree.tounicode(): >>> from lxml import etree >>> root = etree.Element('пример') >>> print(etree.tostring(root)) b

Why getparent() don't work as expected?

别等时光非礼了梦想. 提交于 2019-12-10 15:55:05
问题 I need to make some manipulations with text inside one of tags and want to get parent tag for every found text node for it Code: import lxml.etree import pprint s = ''' <data> data text <foo>foo - <bar>bar</bar> text</foo> data text <bar> bar text <baz>baz text</baz> <baz>baz text</baz> bar text </bar> data text </data> ''' etree = lxml.etree.fromstring(s) text = etree.xpath("//text()[normalize-space()]") pprint.pprint([(s.getparent().tag, s.strip()) for s in text]) Output: [('data', 'data

How to update xml file using lxml and python?

家住魔仙堡 提交于 2019-12-10 15:38:20
问题 <example> <login> <id>1</id> <username>kites</username> <password>kites</password> </login> </example> How can i update password using lxml? and now can i add one more record to the same file? please provide me a sample code 回答1: example = etree.Element("example") login = etree.SubElement(example, "login") password = etree.SubElement(login,"password") password.text = "newPassword" This is a good tutorial 来源: https://stackoverflow.com/questions/2108334/how-to-update-xml-file-using-lxml-and

lxml iterparse in python can't handle namespaces

拟墨画扇 提交于 2019-12-10 13:01:27
问题 from lxml import etree import StringIO data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three</a></root>') docs = etree.iterparse(data,tag='a') a,b = docs.next() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:95348) File "iterparse.pxi", line 534, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:95938) StopIteration Works fine

LXML 3.3 with Python 3.3 on windows 7 32-bit

本秂侑毒 提交于 2019-12-10 12:25:59
问题 I am having major issues with this install. Please provide a detailed, step-by-step guide . 回答1: These instructions are for Windows7 or Windows8 with Python3.3 . However , they should work for various versions as the releases of python and other respective prerequisites change/evolve: Install Python3.3 : Download the last release of Python3.3 (currently 3.3.5) from the downloads page HERE Direct link for Win32 MSI installer -> HERE Simply run the MSI to install python. It will register itself

How do I skip validating the URI in lxml?

人走茶凉 提交于 2019-12-10 12:13:44
问题 I am using lxml to parse some xml files. I don't create them, I'm just parsing them. Some of the files contain invalid uri's for the namespaces. For instance: 'D:\Path\To\some\local\file.xsl' I get an error when I try to process it: lxml.etree.XMLSyntaxError: xmlns:xsi: 'D:\Path\To\some\local\file.xsl' is not a valid URI Is there an easy way to replace any invalid uri's with something (anything, such as 'http://www.googlefsdfsd.com/')? I thought of writing a regex but was hoping for an easier

lxml and fast_iter eating all the memory

南楼画角 提交于 2019-12-10 12:03:24
问题 I want to parse a 1.6 GB XML file with Python (2.7.2) using lxml (3.2.0) on OS X (10.8.2). Because I had already read about potential issues with memory consumption, I already use fast_iter in it, but after the main loop, it eats up about 8 GB RAM, even it doesn't keep any data from the actual XML file. from lxml import etree def fast_iter(context, func, *args, **kwargs): # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ # Author: Liza Daly for event, elem in context: func(elem,

How to parse file XML by lxml, get element & attribute?

匆匆过客 提交于 2019-12-10 11:48:35
问题 I have a xml description like this: <Car xmlns="http://example.com/vocab/xml/cars#"> <dateStarted>{{date_started|escape}}</dateStarted> <dateSold>{{date_sold|escape}}</dateSold> <name type="{{name_type}}" abbrev="{{name_abbrev}}" value="{{name_value}}" >{{name|escape}}</name> <brandName type="{{brand_name_type}}" abbrev="{{brand_name_abbrev}}" value="{{brand_name_value}}" >{{brand_name|escape}}</brandName> <maxspeed> <value>{{speed_value}}</value> <unit type="{{speed_unit_type}}" value="{

Should Python 2.6 on OS X deal with multiple easy-install.pth files in $PYTHONPATH?

回眸只為那壹抹淺笑 提交于 2019-12-10 11:24:34
问题 I am running ipython from sage and also am using some packages that aren't in sage (lxml, argparse) which are installed in my home directory. I have therefore ended up with a $PYTHONPATH of $HOME/sage/local/lib/python:$HOME/lib/python Python is reading and processing the first easy-install.pth it finds ($HOME/sage/local/lib/python/site-packages/easy-install.pth) but not the second, so eggs installed in $HOME/lib/python aren't added to the path. On reading the off-the-shelf site.py, I cannot