lxml | 易学教程

How can one replace an element with text in lxml?

阅读更多关于 How can one replace an element with text in lxml?

问题 It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input: input = '''<everything> <m>Some text before <r/></m> <m><r/> and some text after.</m> <m><r/></m> <m>Text before <r/> and after</m> <m><b/> Text after a sibling <r/> Text before a sibling<b/></m> </everything> ''' ... you could easily remove every <r>

error with parse function in lxml

阅读更多关于 error with parse function in lxml

问题 i have installed lxml2.2.2 on windows platform(i m using python version 2.6.5).i tried this simple command: from lxml.html import parse p= parse(‘http://www.google.com’).getroot() but i am getting the following error: Traceback (most recent call last): File “”, line 1, in p=parse(‘http://www.google.com’).getroot() File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw) File “lxml

LXML - Sorting Tag Order

阅读更多关于 LXML - Sorting Tag Order

问题 I have a legacy file format which I'm converting into XML for processing. The structure can be summarised as: <A> <A01>X</A01> <A02>Y</A02> <A03>Z</A03> </A> The numerical part of the tags can go from 01 to 99 and there may be gaps. As part of the processing certain records may have additional tags added. After the processing is completed I'm converting the file back to the legacy format by iterwalking the tree. The files are reasonably large (~150,000 nodes). A problem with this is that some

How to use xpath from lxml on null namespaced nodes?

阅读更多关于 How to use xpath from lxml on null namespaced nodes?

问题 What is the best way to handle the lack of a namespace on some of the nodes in an xml document using lxml? Should I first modify all None named nodes to add the "gmd" name and then change the tree attributes to name http://www.isotc211.org/2005/gmd as "gmd"? If so, is there a clean way to do this with lxml or something else that would be relatively clean/safe? from lxml import etree nsmap = charts_tree.nsmap nsmap.pop(None) # complains without this on the xpath with # TypeError: empty

How can I preserve <br> as newlines with lxml.html text_content() or equivalent?

阅读更多关于 How can I preserve as newlines with lxml.html text_content() or equivalent?

问题 I want to preserve <br> tags as \n when extracting the text content from lxml elements. Example code: fragment = '<div>This is a text node.<br/>This is another text node.<br/><br/><span>And a child element.</span><span>Another child,<br> with two text nodes</span></div>' h = lxml.html.fromstring(fragment) Output: > h.text_content() 'This is a text node.This is another text node.And a child element.Another child, with two text nodes' 回答1: Prepending an \n character to the tail of each <br />

Setup.py: install lxml with Python2.6 on CentOS

阅读更多关于 Setup.py: install lxml with Python2.6 on CentOS

问题 I have installed Python 2.6.6 on CentOS 5.4, [@SC-055 lxml-2.3beta1]$ python Python 2.6.6 (r266:84292, Jan 4 2011, 09:49:55) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> I want to use the lxml module, but build from sources failed: src/lxml/lxml.etree.c:157929: error: ‘xsltLibxsltVersion’ undeclared (first use in this function) src/lxml/lxml.etree.c:157941: error: ‘__pyx_v_4lxml_5etree_XSLT_DOC_DEFAULT_LOADER’

Close a tag with no text in lxml

阅读更多关于 Close a tag with no text in lxml

问题 I am trying to output a XML file using Python and lxml However, I notice one thing that if a tag has no text, it does not close itself. An example of this would be: root = etree.Element('document') rootTree = etree.ElementTree(root) firstChild = etree.SubElement(root, 'test') The output of this is: <document> <test/> </document I want the output to be: <document> <test> </test> </document> So basically I want to close a tag which has no text, but is used to the attribute value. How do I do

Remove all html in python?

阅读更多关于 Remove all html in python?

问题 Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html. 回答1: Try the .text_content() method on an element, probably best after using lxml.html.clean to get rid of unwanted content (script tags etc...). For example: from lxml import html from lxml.html.clean import clean_html tree = html.parse('http://www.example.com') tree = clean_html(tree) text = tree.getroot().text_content() 回答2: I

Beautiful Soup and Table Scraping - lxml vs html parser

阅读更多关于 Beautiful Soup and Table Scraping - lxml vs html parser

问题 I'm trying to extract the HTML code of a table from a webpage using BeautifulSoup. <table class="facts_label" id="facts_table">...</table> I would like to know why the code bellow works with the "html.parser" and prints back none if I change "html.parser" for "lxml" . #! /usr/bin/python from bs4 import BeautifulSoup from urllib import urlopen webpage = urlopen('http://www.thewebpage.com') soup=BeautifulSoup(webpage, "html.parser") table = soup.find('table', {'class' : 'facts_label'}) print

saving an 'lxml.etree._ElementTree' object

阅读更多关于 saving an 'lxml.etree._ElementTree' object

问题 I've spent the last couple of days getting to grips with the basics of lxml; in particular using lxml.html to parse websites and create an ElementTree of the content. Ideally, I want to save the returned ElementTree so that I can load it up and experiment with it, without having to parse the website every time I modify my script. I assumed that pickling would be the way to go, however I'm now beginning to wonder. Although I am able to retrieve an ElementTree object after pickling... type