Beautifulsoup decompose()

匿名 (未验证) 提交于 2019-12-03 09:05:37

问题:

I'm trying to get rid of <script> tags and the content inside the tag utilizing beatifulsoup. I went to the documentation and seems to be a really simple function to call. More information about the function is here. Here is the content of the html page that I have parsed so far...

<body class="pb-theme-normal pb-full-fluid">     <div class="pub_300x250 pub_300x250m pub_728x90 text-ad textAd text_ad text_ads text-ads text-ad-links" id="wp-adb-c" style="width: 1px !important;     height: 1px !important;     position: absolute !important;     left: -10000px !important;     top: -1000px !important;     "> </div> <div id="pb-f-a"> </div>     <div class="" id="pb-root">     <script>     (function(a){         TWP=window.TWP||{};         TWP.Features=TWP.Features||{};         TWP.Features.Page=TWP.Features.Page||{};         TWP.Features.Page.PostRecommends={};         TWP.Features.Page.PostRecommends.url="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/hybrid.json?callback\x3d?";         TWP.Features.Page.PostRecommends.trackUrl="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/tracker.json?callback\x3d?";         TWP.Features.Page.PostRecommends.profileUrl="https://usersegment.wpdigital.net/usersegments";         TWP.Features.Page.PostRecommends.canonicalUrl=""     })(jQuery);      </script>     </div> </body> 

Imagine you have some web content like that and you have that in a BeautifulSoup object called soup_html. If I run soup_html.script.decompose() and them call the object soup_html the script tags still there. How I can get rid of the <script> and the content inside those tags?

markup = 'The html above' soup = BeautifulSoup(markup) html_body = soup.body  soup.script.decompose()  html_body 

回答1:

soup.script.decompose()

This would remove a single script element from the "Soup" only. Instead, I think you meant to decompose all of them:

for script in soup("script"):     script.decompose() 


回答2:

To elaborate on the answer provided by alecxe, here is a full script for anyone's reference:

selects = soup.findAll('select') for match in selects:     match.decompose() 


回答3:

The soup.script.decompose() would only remove it from the soup variable... not the html_body variable. you would have to remove it from the html_body variable as well. (I think.)



回答4:

I was able to fix the issue with the following code...

scripts = soup.findAll(['script', 'style'])     for match in scripts:         match.decompose()         file_content = soup.get_text()         # Striping 'ascii' code         content = re.sub(r'[^\x00-\x7f]', r' ', file_content)     # Creating 'txt' files     with open(my_params['q'] + '_' + str(count) + '.txt', 'w+') as webpage_out:         webpage_out.write(content)         print('The file ' + my_params['q'] + '_' + str(count) + '.txt ' + 'has been created successfully.')         count += 1 

The error was that the with open(... was part or the for match...

Code that did not work...

scripts = soup.findAll(['script', 'style'])     for match in scripts:         match.decompose()         file_content = soup.get_text()         # Striping 'ascii' code         content = re.sub(r'[^\x00-\x7f]', r' ', file_content)         # Creating 'txt' files         with open(my_params['q'] + '_' + str(count) + '.txt', 'w+') as webpage_out:             webpage_out.write(content)             print('The file ' + my_params['q'] + '_' + str(count) + '.txt ' + 'has been created successfully.')             count += 1 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!