lxml

using lxml and iterparse() to parse a big (+- 1Gb) XML file

匿名 (未验证) 提交于 2019-12-03 09:06:55
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content": MM/DD/YY Last Name, Name Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula. MM/DD/YY Last Name, Name Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula. [...] MM/DD/YY Last Name, Name Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula. So far I've tried two things: i) reading the whole file and going through

What are the differences between lxml and ElementTree?

对着背影说爱祢 提交于 2019-12-03 09:06:33
问题 When it comes to generating XML data in Python, there are two libraries I often see recommended: lxml and ElementTree From what I can tell, the two libraries are very similar to each other. They both seem to have similar module names, usage guidelines, and functionality. Even the import statements are fairly similar. # Importing lxml and ElementTree import lxml.etree import xml.etree.ElementTree What are the differences between the lxml and ElementTree libraries for Python? 回答1: ElementTree

random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

匿名 (未验证) 提交于 2019-12-03 09:05:37
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am, for the sake of testing my web app, pasting some random characters from /dev/random into my web frontend. This line throws an error: print repr(comment) import html5lib print html5lib.parse(comment, treebuilder="lxml") 'a\xef\xbf\xbd\xef\xbf\xbd\xc9\xb6E\xef\xbf\xbd\xef\xbf\xbd`\xef\xbf\xbd]\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd2 \x14\xef\xbf\xbd\xc7\xbe\xef\xbf\xbdy\xcb\x9c\xef\xbf\xbdi1O\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdZ\xef\xbf\xbd.\xef\xbf\xbd\x17^C' Unhandled Error Traceback (most recent call last): File "

Pip install lxml centOSFailed building wheel for lxml

匿名 (未验证) 提交于 2019-12-03 09:02:45
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Doing pip install lxml or pip install pyquery gives this error: gcc: error trying to exec 'cc1': execvp: No such file or directory error: command 'gcc' failed with exit status 1 ---------------------------------------- Failed building wheel for lxml Failed to build lxml And also this error later on gcc: error trying to exec 'cc1': execvp: No such file or directory error: command 'gcc' failed with exit status 1 ---------------------------------------- Command "/usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-B0y8MK

ImportError: No module named lxml on Mac

匿名 (未验证) 提交于 2019-12-03 08:59:04
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am having a problem running a Python script and it is showing this message: ImportError: No module named lxml I suppose I have to install somewhat called lxml but I am really newbie to Python and I don't really have too much idea on that. I think I have two versions of Python installed on my Mac from what I have read in other threads, but I am not sure. How can I solve this issue? Python Version: 2.7.6 Mac OS X 10.9.2 Thanks 回答1: I've installed recently using pip , but before it would all work, I needed to issue the following command as

Wildcard namespaces in lxml

匿名 (未验证) 提交于 2019-12-03 08:46:08
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: How to query using xpath ignoring the xml namespace? I am using python lxml library. I tried the solution from this question but doesn't seem to work. In [151]: e.find("./*[local-name()='Buckets']") File "<string>", line unknown SyntaxError: invalid predicate 回答1: Use e.xpath , not e.find : import lxml.etree as ET content = '''\ <Envelope xmlns="http://www.example.com/zzz/yyy"> <Header> <Version>1</Version> </Header> <Buckets> some stuff </Buckets> </Envelope> ''' root = ET.fromstring(content) print(root.xpath("./*[local-name()='Buckets']"))

Equivalent to InnerHTML when using lxml.html to parse HTML

匿名 (未验证) 提交于 2019-12-03 08:44:33
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed. I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag. A title Some text InnerHtml is therefore: A title Some text I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am

builtins.TypeError: must be str, not bytes

匿名 (未验证) 提交于 2019-12-03 08:41:19
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I've converted my scripts form python 2.7 to 3.2,and I have some bug. # -*- coding: utf-8 -*- import time from datetime import date from lxml import etree from collections import OrderedDict # Create the root element page = etree.Element('results') # Make a new document tree doc = etree.ElementTree(page) # Add the subelements pageElement = etree.SubElement(page, 'Country',Tim = 'Now', name='Germany', AnotherParameter = 'Bye', Code='DE', Storage='Basic') pageElement = etree.SubElement(page, 'City', name='Germany', Code='PZ', Storage='Basic'

lxml etree xmlparser remove unwanted namespace

匿名 (未验证) 提交于 2019-12-03 08:28:06
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have an xml doc that I am trying to parse using Etree.lxml 1 some stuff My code is: path = "path to xml file" from lxml import etree as ET parser = ET.XMLParser(ns_clean=True) dom = ET.parse(path, parser) dom.getroot() When I try to get dom.getroot() I get: However I only want: When i do dom.getroot().find("Body") I get nothing returned. However, when I dom.getroot().find("{http://www.example.com/zzz/yyy}Body") I get a result. I thought passing ns_clean=True to the parser would prevent this. Any ideas? 回答1: import io import lxml.etree as

Find python lxml version

徘徊边缘 提交于 2019-12-03 08:08:26
问题 How can I find the installed python-lxml version in a Linux system? >>> import lxml >>> lxml.__version__ Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute '__version__' >>> from pprint import pprint >>> pprint(dir(lxml)) ['__builtins__', '__doc__', '__file__', '__name__', '__package__', '__path__', 'get_include', 'os'] >>> Can't seem to find it 回答1: You can get the version by looking at etree : >>> from lxml import etree >>