lxml | 易学教程

using lxml and iterparse() to parse a big (+- 1Gb) XML file

阅读更多关于 using lxml and iterparse() to parse a big (+- 1Gb) XML file

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content": MM/DD/YY Last Name, Name Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula. MM/DD/YY Last Name, Name Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula. [...] MM/DD/YY Last Name, Name Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula. So far I've tried two things: i) reading the whole file and going through

What are the differences between lxml and ElementTree?

阅读更多关于 What are the differences between lxml and ElementTree?

问题 When it comes to generating XML data in Python, there are two libraries I often see recommended: lxml and ElementTree From what I can tell, the two libraries are very similar to each other. They both seem to have similar module names, usage guidelines, and functionality. Even the import statements are fairly similar. # Importing lxml and ElementTree import lxml.etree import xml.etree.ElementTree What are the differences between the lxml and ElementTree libraries for Python? 回答1: ElementTree

random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

阅读更多关于 random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I am, for the sake of testing my web app, pasting some random characters from /dev/random into my web frontend. This line throws an error: print repr(comment) import html5lib print html5lib.parse(comment, treebuilder="lxml") 'a\xef\xbf\xbd\xef\xbf\xbd\xc9\xb6E\xef\xbf\xbd\xef\xbf\xbd`\xef\xbf\xbd]\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd2 \x14\xef\xbf\xbd\xc7\xbe\xef\xbf\xbdy\xcb\x9c\xef\xbf\xbdi1O\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdZ\xef\xbf\xbd.\xef\xbf\xbd\x17^C' Unhandled Error Traceback (most recent call last): File "

Pip install lxml centOSFailed building wheel for lxml

阅读更多关于 Pip install lxml centOSFailed building wheel for lxml

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: Doing pip install lxml or pip install pyquery gives this error: gcc: error trying to exec 'cc1': execvp: No such file or directory error: command 'gcc' failed with exit status 1 ---------------------------------------- Failed building wheel for lxml Failed to build lxml And also this error later on gcc: error trying to exec 'cc1': execvp: No such file or directory error: command 'gcc' failed with exit status 1 ---------------------------------------- Command "/usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-B0y8MK

ImportError: No module named lxml on Mac

阅读更多关于 ImportError: No module named lxml on Mac

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I am having a problem running a Python script and it is showing this message: ImportError: No module named lxml I suppose I have to install somewhat called lxml but I am really newbie to Python and I don't really have too much idea on that. I think I have two versions of Python installed on my Mac from what I have read in other threads, but I am not sure. How can I solve this issue? Python Version: 2.7.6 Mac OS X 10.9.2 Thanks 回答1: I've installed recently using pip , but before it would all work, I needed to issue the following command as

Wildcard namespaces in lxml

阅读更多关于 Wildcard namespaces in lxml

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: How to query using xpath ignoring the xml namespace? I am using python lxml library. I tried the solution from this question but doesn't seem to work. In [151]: e.find("./*[local-name()='Buckets']") File "<string>", line unknown SyntaxError: invalid predicate 回答1: Use e.xpath , not e.find : import lxml.etree as ET content = '''\ <Envelope xmlns="http://www.example.com/zzz/yyy"> <Header> <Version>1</Version> </Header> <Buckets> some stuff </Buckets> </Envelope> ''' root = ET.fromstring(content) print(root.xpath("./*[local-name()='Buckets']"))

Equivalent to InnerHTML when using lxml.html to parse HTML

阅读更多关于 Equivalent to InnerHTML when using lxml.html to parse HTML

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed. I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag. A title Some text InnerHtml is therefore: A title Some text I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am

builtins.TypeError: must be str, not bytes

阅读更多关于 builtins.TypeError: must be str, not bytes

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I've converted my scripts form python 2.7 to 3.2,and I have some bug. # -*- coding: utf-8 -*- import time from datetime import date from lxml import etree from collections import OrderedDict # Create the root element page = etree.Element('results') # Make a new document tree doc = etree.ElementTree(page) # Add the subelements pageElement = etree.SubElement(page, 'Country',Tim = 'Now', name='Germany', AnotherParameter = 'Bye', Code='DE', Storage='Basic') pageElement = etree.SubElement(page, 'City', name='Germany', Code='PZ', Storage='Basic'

lxml etree xmlparser remove unwanted namespace

阅读更多关于 lxml etree xmlparser remove unwanted namespace

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I have an xml doc that I am trying to parse using Etree.lxml 1 some stuff My code is: path = "path to xml file" from lxml import etree as ET parser = ET.XMLParser(ns_clean=True) dom = ET.parse(path, parser) dom.getroot() When I try to get dom.getroot() I get: However I only want: When i do dom.getroot().find("Body") I get nothing returned. However, when I dom.getroot().find("{http://www.example.com/zzz/yyy}Body") I get a result. I thought passing ns_clean=True to the parser would prevent this. Any ideas? 回答1: import io import lxml.etree as

Find python lxml version

阅读更多关于 Find python lxml version

问题 How can I find the installed python-lxml version in a Linux system? >>> import lxml >>> lxml.__version__ Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute '__version__' >>> from pprint import pprint >>> pprint(dir(lxml)) ['__builtins__', '__doc__', '__file__', '__name__', '__package__', '__path__', 'get_include', 'os'] >>> Can't seem to find it 回答1: You can get the version by looking at etree : >>> from lxml import etree >>

订阅 lxml