Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)

爱⌒轻易说出口 提交于 2019-12-11 05:32:31

问题


I am looking for a python module that will help me get rid of HTML tags but keep the text values. I tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.

Below is a sample code from Ruby's sanitize library and that's what I am after in Python:

require 'rubygems'
require 'sanitize'

html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

Sanitize.clean(html) # => 'foo'

Thanks for your suggestions.

-e


回答1:


>>> import BeautifulSoup
>>> html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)  
>>> bs.findAll(text=True)
[u'foo']

This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist).




回答2:


If you don't want to use separate libs then you can import standard django utils. For example:

from django.utils.html import strip_tags
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped 
# you got: foo

Also its already included in Django templates, so you dont need anything else, just use filter, like this:

{{ unsafehtml|striptags }}

Btw, this is one of the fastest way.




回答3:


Using lxml:

htmlstring = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

from lxml.html import fromstring

mySearchTree = fromstring(htmlstring)

for item in mySearchTree.cssselect('a'):
    print item.text



回答4:


#!/usr/bin/python

from xml.dom.minidom import parseString

def getText(el):
    ret = ''
    for child in el.childNodes:
        if child.nodeType == 3:
            ret += child.nodeValue
        else:
            ret += getText(child)
    return ret

html = '<b>this is <a href="http://foo.com/">a link </a> and some bold text  </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)

Prints:

this is a link and some bold text followed by an image




回答5:


Late, but.

You can use Jinja2.Markup()

http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags

from jinja2 import Markup 
Markup("<div>About</div>").striptags()
u'About'


来源:https://stackoverflow.com/questions/2415008/remove-html-tags-in-appengine-python-env-equivalent-to-rubys-sanitize

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!