How to strip entire HTML, CSS and JS code or tags from HTML page in python [duplicate]

问题

Possible Duplicate:
BeautifulSoup Grab Visible Webpage Text
Web scraping with Python

Say I am a very complex HTML page consisting usual HTML tags, CSS & JS in the middle. We might see all worst cases.

All I want is strip all the above tags/ code and return "text".

In simple terms:

<html><body>Text</body></html>

This might contain JS, CSS etc. etc..

I am trying to use BeautifulSoup but its not removing JS from the code.. Now ,I am thinking to use Regex.. but not sure how to do

edit1

Here is my try on a simple bootstrap html page...

from bs4 import BeautifulSoup as bs
import requests

bs( requests.get(MY-URL).text ).get_text()

$ return text

html
Home
Le styles
body {
        padding-top: 10%;
        padding-left: 30%;
      }
HTML5 shim, for IE6-8 support of HTML5 elements
[if lt IE 9]>
      <script src="http://htm...html5.js"></script>
    <![endif]
Home | Under Construction
Sample Page 1
The app
might
face some ........
Firefox
. Ple..
/container
var _gaq = _gaq || [];

  _gaq.push(['_trackPageview']);

  (function() {
    var ga = do...............
  })();

回答1:

Django using this function to strip tags from text:

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    return re.sub(r'<[^>]*?>', '', force_unicode(value))

(You won't need the force_unicode part)

来源：https://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python

标签

python

regex

html-parsing

beautifulsoup