I'm trying to get rid of <script>
tags and the content inside the tag utilizing beatifulsoup. I went to the documentation and seems to be a really simple function to call. More information about the function is here. Here is the content of the html page that I have parsed so far...
<body class="pb-theme-normal pb-full-fluid"> <div class="pub_300x250 pub_300x250m pub_728x90 text-ad textAd text_ad text_ads text-ads text-ad-links" id="wp-adb-c" style="width: 1px !important; height: 1px !important; position: absolute !important; left: -10000px !important; top: -1000px !important; "> </div> <div id="pb-f-a"> </div> <div class="" id="pb-root"> <script> (function(a){ TWP=window.TWP||{}; TWP.Features=TWP.Features||{}; TWP.Features.Page=TWP.Features.Page||{}; TWP.Features.Page.PostRecommends={}; TWP.Features.Page.PostRecommends.url="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/hybrid.json?callback\x3d?"; TWP.Features.Page.PostRecommends.trackUrl="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/tracker.json?callback\x3d?"; TWP.Features.Page.PostRecommends.profileUrl="https://usersegment.wpdigital.net/usersegments"; TWP.Features.Page.PostRecommends.canonicalUrl="" })(jQuery); </script> </div> </body>
Imagine you have some web content like that and you have that in a BeautifulSoup object called soup_html
. If I run soup_html.script.decompose()
and them call the object soup_html
the script tags still there. How I can get rid of the <script>
and the content inside those tags?
markup = 'The html above' soup = BeautifulSoup(markup) html_body = soup.body soup.script.decompose() html_body