lxml

Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

烈酒焚心 提交于 2019-11-30 03:53:14
问题 I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. Because it is already 41G, I don't want to decompress the file completely. So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code from lxml import etree as et from bz2 import BZ2File path = "where/my/fileis.osm.bz2" with BZ2File(path) as xml_file: parser = et.iterparse(xml_file, events=('end',)) for events, elem in parser: if elem.tag == "tag": continue if elem.tag ==

How to match a text node then follow parent nodes using XPath

混江龙づ霸主 提交于 2019-11-30 03:39:56
I'm trying to parse some HTML with XPath. Following the simplified XML example below, I want to match the string 'Text 1', then grab the contents of the relevant content node. <doc> <block> <title>Text 1</title> <content>Stuff I want</content> </block> <block> <title>Text 2</title> <content>Stuff I don't want</content> </block> </doc> My Python code throws a wobbly: >>> from lxml import etree >>> >>> tree = etree.XML("<doc><block><title>Text 1</title><content>Stuff I want</content></block><block><title>Text 2</title><content>Stuff I d on't want</content></block></doc>") >>> >>> # get all

Write xml file using lxml library in Python

断了今生、忘了曾经 提交于 2019-11-30 00:19:46
I'm using lxml to create an XML file from scratch; having a code like this: from lxml import etree root = etree.Element("root") root.set("interesting", "somewhat") child1 = etree.SubElement(root, "test") How do I write root Element object to an xml file using write() method of ElementTree class? Mark You can get a string from the element and then write that from lxml tutorial str = etree.tostring(root, pretty_print=True) or convert to an element tree et = etree.ElementTree(root) et.write(sys.stdout, pretty_print=True) Here's a succinct answer from lxml import etree root = etree.Element("root")

can lxml/requests select dropdown options then parse resulting ajax?

浪子不回头ぞ 提交于 2019-11-29 23:57:16
问题 I have a site I'm trying to test and although I can get a list of options in a dropdown I am not sure how select it? There is no submit button so if I select it then it will load an ajax table below. I'm just not sure if lxml/requests can do this or how it could be done? I would appreciate it if anyone could confirm or knows the function that could do it ? edit: My site is internal and not accessible but here is a sample site: https://www.tsx.com/listings/listing-with-us/listed-company

Efficient way of XML parsing in ElementTree(1.3.0) Python

僤鯓⒐⒋嵵緔 提交于 2019-11-29 22:38:27
问题 I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django). Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text <?xml VERSION="1.0" encoding="ISO-8859-1"?> <mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema

lxml install on windows 7 using pip and python 2.7

孤者浪人 提交于 2019-11-29 21:37:19
When I try to upgrade lxml using pip on my windows 7 machine I get the log printed below. When I uninstall and try to install from scratch I get the same errors. Any ideas? Downloading/unpacking lxml from https://pypi.python.org/packages/source/l/lxml/l xml-3.2.4.tar.gz#md5=cc363499060f615aca1ec8dcc04df331 Downloading lxml-3.2.4.tar.gz (3.3MB): 3.3MB downloaded Running setup.py egg_info for package lxml Building lxml version 3.2.4. Building without Cython. ERROR: Nazwa 'xslt-config' nie jest rozpoznawana jako polecenie wewnętrzne l ub zewnętrzne, program wykonywalny lub plik wsadowy. ** make

Unable to pass an lxml etree object to a separate process

≡放荡痞女 提交于 2019-11-29 21:18:50
问题 I'm working on a project to parse multiple xml files concurrently in python using lxml. When I initialize the process I want my main class to do some work on the XML before it passes the etree object to the process, but I am finding that when the etree object arrives in the new process the class survives but the XML is gone from within the object and getroot() returns None. I know that I can only pass pickable data using the queue, but is this also the case with what I pass to the process

Remove all javascript tags and style tags from html with python and the lxml module

旧时模样 提交于 2019-11-29 20:42:48
I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain all contents? but the method described in that post leaves all the text, stripping the tags with out removing the actual script. I have also found a class reference to lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html but this is clear as mud as to how to actually use the class to clean the document. Any help, perhaps a short example would be helpful to me! aculich Below is an example to do what

(still) cannot properly install lxml 2.3 for python, but at least 2.2.8 works

∥☆過路亽.° 提交于 2019-11-29 20:02:29
问题 30 jun 2011 -- I am awarding @Pablo for this question, because of his answer. I am still unable to properly install lxml 2.3 for reasons discussed in his comments. I gather for a little bit of work I could, but I have already spent a ridiculous amount of time on this problem. I have, however, written the code I needed and successfully installed lxml 2.2.8. The code functions with this version. Better yet, Pablo was the only one to properly diagnose the error. Which was libxslt needed to be

out of memory issue in installing packages on Ubuntu server

梦想的初衷 提交于 2019-11-29 19:53:45
I am using a Ubuntu cloud server with limited 512MB RAM and 20 GB HDD. Its 450MB+ RAM is already used by processes. I need to install a new package called lxml which gets complied using Cpython while installation and its a very heavy process so it always exits with error gcc: internal compiler error: Killed (program cc1) which is due to no RAM available for it to run. Upgrading the machine is a choice but it has its own issues and few of my services/websites live from this server itself. But on my local machine lxml is already installed properly. And since my need is lxml only, so is it