问题
I don't want lxml add anything to plain text. I left them as they are on purpose. lxml adds plain text a <p>
tag. Here value
might be html or plaintext. I need lxml to process html and leave plaintext along.
import lxml.html
mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>']
for text in mixed:
html = lxml.html.fromstring(text)
print(lxml.html.tostring(html))
The output:
b'<p>plaintext</p>'
b'<a>HTML</a>'
b'<a>HTML</a>'
What I need is:
b'plaintext'
b'<a>HTML</a>'
b'<a>HTML</a>'
So I come up with several questions.
- How to know a snippet is pure, without any html tags? (so that I don't have to pass them to lxml), or
- How to stop lxml from adding a
<p>
tag to plain text?
回答1:
try this library... save my but from having to use "re" module when dealing with a XML page where for some dumb reason scrapy selctors working wonky...
from w3lib.html import remove_tags
def parse(self, response):
hxs = HtmlXPathSelector(response)
follow = hxs.xpath('//loc').re('.*type=videos.*')
follow = [remove_tags(x) for x in follow]
# It wont remove regex lines like \n
来源:https://stackoverflow.com/questions/39865678/prevent-python-lxml-from-adding-plain-text-a-p-tag