lxml

How can one replace an element with text in lxml?

ε祈祈猫儿з 提交于 2019-12-18 03:59:07
问题 It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input: input = '''<everything> <m>Some text before <r/></m> <m><r/> and some text after.</m> <m><r/></m> <m>Text before <r/> and after</m> <m><b/> Text after a sibling <r/> Text before a sibling<b/></m> </everything> ''' ... you could easily remove every <r>

error with parse function in lxml

雨燕双飞 提交于 2019-12-18 03:32:44
问题 i have installed lxml2.2.2 on windows platform(i m using python version 2.6.5).i tried this simple command: from lxml.html import parse p= parse(‘http://www.google.com’).getroot() but i am getting the following error: Traceback (most recent call last): File “”, line 1, in p=parse(‘http://www.google.com’).getroot() File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw) File “lxml

LXML - Sorting Tag Order

纵然是瞬间 提交于 2019-12-18 03:01:35
问题 I have a legacy file format which I'm converting into XML for processing. The structure can be summarised as: <A> <A01>X</A01> <A02>Y</A02> <A03>Z</A03> </A> The numerical part of the tags can go from 01 to 99 and there may be gaps. As part of the processing certain records may have additional tags added. After the processing is completed I'm converting the file back to the legacy format by iterwalking the tree. The files are reasonably large (~150,000 nodes). A problem with this is that some

How to use xpath from lxml on null namespaced nodes?

百般思念 提交于 2019-12-18 01:07:10
问题 What is the best way to handle the lack of a namespace on some of the nodes in an xml document using lxml? Should I first modify all None named nodes to add the "gmd" name and then change the tree attributes to name http://www.isotc211.org/2005/gmd as "gmd"? If so, is there a clean way to do this with lxml or something else that would be relatively clean/safe? from lxml import etree nsmap = charts_tree.nsmap nsmap.pop(None) # complains without this on the xpath with # TypeError: empty

How can I preserve <br> as newlines with lxml.html text_content() or equivalent?

拟墨画扇 提交于 2019-12-17 23:44:56
问题 I want to preserve <br> tags as \n when extracting the text content from lxml elements. Example code: fragment = '<div>This is a text node.<br/>This is another text node.<br/><br/><span>And a child element.</span><span>Another child,<br> with two text nodes</span></div>' h = lxml.html.fromstring(fragment) Output: > h.text_content() 'This is a text node.This is another text node.And a child element.Another child, with two text nodes' 回答1: Prepending an \n character to the tail of each <br />

Setup.py: install lxml with Python2.6 on CentOS

妖精的绣舞 提交于 2019-12-17 22:56:46
问题 I have installed Python 2.6.6 on CentOS 5.4, [@SC-055 lxml-2.3beta1]$ python Python 2.6.6 (r266:84292, Jan 4 2011, 09:49:55) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> I want to use the lxml module, but build from sources failed: src/lxml/lxml.etree.c:157929: error: ‘xsltLibxsltVersion’ undeclared (first use in this function) src/lxml/lxml.etree.c:157941: error: ‘__pyx_v_4lxml_5etree_XSLT_DOC_DEFAULT_LOADER’

Close a tag with no text in lxml

女生的网名这么多〃 提交于 2019-12-17 20:52:31
问题 I am trying to output a XML file using Python and lxml However, I notice one thing that if a tag has no text, it does not close itself. An example of this would be: root = etree.Element('document') rootTree = etree.ElementTree(root) firstChild = etree.SubElement(root, 'test') The output of this is: <document> <test/> </document I want the output to be: <document> <test> </test> </document> So basically I want to close a tag which has no text, but is used to the attribute value. How do I do

Remove all html in python?

≡放荡痞女 提交于 2019-12-17 19:55:30
问题 Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html. 回答1: Try the .text_content() method on an element, probably best after using lxml.html.clean to get rid of unwanted content (script tags etc...). For example: from lxml import html from lxml.html.clean import clean_html tree = html.parse('http://www.example.com') tree = clean_html(tree) text = tree.getroot().text_content() 回答2: I

Beautiful Soup and Table Scraping - lxml vs html parser

余生颓废 提交于 2019-12-17 19:27:56
问题 I'm trying to extract the HTML code of a table from a webpage using BeautifulSoup. <table class="facts_label" id="facts_table">...</table> I would like to know why the code bellow works with the "html.parser" and prints back none if I change "html.parser" for "lxml" . #! /usr/bin/python from bs4 import BeautifulSoup from urllib import urlopen webpage = urlopen('http://www.thewebpage.com') soup=BeautifulSoup(webpage, "html.parser") table = soup.find('table', {'class' : 'facts_label'}) print

saving an 'lxml.etree._ElementTree' object

风流意气都作罢 提交于 2019-12-17 19:19:09
问题 I've spent the last couple of days getting to grips with the basics of lxml; in particular using lxml.html to parse websites and create an ElementTree of the content. Ideally, I want to save the returned ElementTree so that I can load it up and experiment with it, without having to parse the website every time I modify my script. I assumed that pickling would be the way to go, however I'm now beginning to wonder. Although I am able to retrieve an ElementTree object after pickling... type