How to strip entire HTML, CSS and JS code or tags from HTML page in python [duplicate]

本小妞迷上赌 提交于 2019-12-12 06:24:13

问题


Possible Duplicate:
BeautifulSoup Grab Visible Webpage Text
Web scraping with Python

Say I am a very complex HTML page consisting usual HTML tags, CSS & JS in the middle. We might see all worst cases.

All I want is strip all the above tags/ code and return "text".

In simple terms:

<html><body>Text</body></html>

This might contain JS, CSS etc. etc..

I am trying to use BeautifulSoup but its not removing JS from the code.. Now ,I am thinking to use Regex.. but not sure how to do

edit1

Here is my try on a simple bootstrap html page...

from bs4 import BeautifulSoup as bs
import requests

bs( requests.get(MY-URL).text ).get_text()

$ return text

html
Home
Le styles
body {
        padding-top: 10%;
        padding-left: 30%;
      }
HTML5 shim, for IE6-8 support of HTML5 elements
[if lt IE 9]>
      <script src="http://htm...html5.js"></script>
    <![endif]
Home | Under Construction
Sample Page 1
The app
might
face some ........
Firefox
. Ple..
/container
var _gaq = _gaq || [];

  _gaq.push(['_trackPageview']);

  (function() {
    var ga = do...............
  })();

回答1:


Django using this function to strip tags from text:

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    return re.sub(r'<[^>]*?>', '', force_unicode(value))

(You won't need the force_unicode part)



来源:https://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!