Equivalent to InnerHTML when using lxml.html to parse HTML

匿名 (未验证) 提交于 2019-12-03 08:44:33

问题:

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.

I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.

 

A title

Some text

InnerHtml is therefore:

A title

Some text

I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.

EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:

from lxml import html from cStringIO import StringIO t = html.parse(StringIO( """ 

A title

Some text

Untagged text

Unclosed p tag """)) root = t.getroot() body = root.body print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])

Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.

回答1:

You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:

>>> from lxml import etree >>> from cStringIO import StringIO >>> t = etree.parse(StringIO(""" ... 

A title

...

Some text

... """)) >>> root = t.getroot() >>> for child in root.iterdescendants(),: ... print etree.tostring(child) ...

A title

Some text

This can be shorthanded as follows:

print ''.join([etree.tostring(child) for child in root.iterdescendants()]) 


回答2:

Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:

This text is ignored 

Title

Some text

Text directly under the root element is ignored. I ended up doing this:

(body.text or '') +\ ''.join([html.tostring(child) for child in body.iterchildren()]) 


回答3:

import lxml.etree as ET       body = t.xpath("//body");      for tag in body:          h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");          p = html.fromstring(  ET.tostring(tag[1]) ).xpath("//p");                       htext = h[0].text_content();          ptext = h[0].text_content(); 

you can also use .get('href') for a tag and .attrib for attribute ,

here tag no is hardcoded but you can also do this dynamic



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!