Prevent python lxml from adding plain text a tag

问题

I don't want lxml add anything to plain text. I left them as they are on purpose. lxml adds plain text a  tag. Here value might be html or plaintext. I need lxml to process html and leave plaintext along.

import lxml.html
mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>']
for text in mixed:
    html = lxml.html.fromstring(text)
    print(lxml.html.tostring(html))

The output: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>'

What I need is: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>'

So I come up with several questions.

How to know a snippet is pure, without any html tags? (so that I don't have to pass them to lxml), or
How to stop lxml from adding a  tag to plain text?

回答1:

try this library... save my but from having to use "re" module when dealing with a XML page where for some dumb reason scrapy selctors working wonky...

from w3lib.html import remove_tags

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    follow = hxs.xpath('//loc').re('.*type=videos.*')
    follow = [remove_tags(x) for x in follow]
    # It wont remove regex lines like \n

来源：https://stackoverflow.com/questions/39865678/prevent-python-lxml-from-adding-plain-text-a-p-tag

标签

python

lxml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!

Prevent python lxml from adding plain text a <p> tag

问题

回答1: