Prevent python lxml from adding plain text a <p> tag

你。 提交于 2019-12-11 08:44:08

问题


I don't want lxml add anything to plain text. I left them as they are on purpose. lxml adds plain text a <p> tag. Here value might be html or plaintext. I need lxml to process html and leave plaintext along.

import lxml.html
mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>']
for text in mixed:
    html = lxml.html.fromstring(text)
    print(lxml.html.tostring(html))

The output: b'<p>plaintext</p>' b'<a>HTML</a>' b'<a>HTML</a>'

What I need is: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>'

So I come up with several questions.

  1. How to know a snippet is pure, without any html tags? (so that I don't have to pass them to lxml), or
  2. How to stop lxml from adding a <p> tag to plain text?

回答1:


try this library... save my but from having to use "re" module when dealing with a XML page where for some dumb reason scrapy selctors working wonky...

from w3lib.html import remove_tags

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    follow = hxs.xpath('//loc').re('.*type=videos.*')
    follow = [remove_tags(x) for x in follow]
    # It wont remove regex lines like \n


来源:https://stackoverflow.com/questions/39865678/prevent-python-lxml-from-adding-plain-text-a-p-tag

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!