Get all text inside a tag in lxml

感情迁移 提交于 2019-11-26 17:22:25

Try:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

Example:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

Ed Summers

Does text_content() do what you need?

Arthur Debert

Just use the node.itertext() method, as in:

 ''.join(node.itertext())

The following snippet which uses python generators works perfectly and is very efficient.

''.join(node.itertext()).strip()

anana

A version of albertov 's stringify-content that solves the bugs reported by hoju:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)
import urllib2
from lxml import etree
url = 'some_url'

getting url

test = urllib2.urlopen(url)
page = test.read()

getting all html code within including table tag

tree = etree.HTML(page)

xpath selector

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res is the html code of table this was doing job for me.

so you can extract the tags content with xpath_text() and tags including their content using tostring()

div = tree.xpath("//div")
div_res = etree.tostring(div)
text = tree.xpath_text("//content") 

or text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line with strip method using is not nice, but it just works

Defining stringify_children this way may be less complicated:

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

or in one line

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.

Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is

etree.tostring(html, method="text")

where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.

In response to @Richard's comment above, if you patch stringify_children to read:

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

it seems to avoid the duplication he refers to.

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
    RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' 
    content_with_parent = etree.tostring(parent_element)    

    def _replace_html_entities(s):
        RE_ENTITY = r'&#(\d+);'

        def repl(m):
            return unichr(int(m.group(1)))

        replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

        return replaced

    if not html_entities:
        content_with_parent = _replace_html_entities(content_with_parent)

    content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

    start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

    if start_tag != end_tag:
        raise Exception('Start tag does not match to end tag while getting content with tags.')

    return content_without_parent

parent_element must have Element type.

Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.

lxml have a method for that:

node.text_content()
David

If this is an a tag, you can try:

node.values()
import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1) 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!