BeautifulSoup - getting rid of paragraph whitespace/line breaks

不打扰是莪最后的温柔 提交于 2019-12-23 04:03:05

问题


similarlist = res.find_all_next("div", class_="result-wrapper")
for item in similarlist:
    print(item)

This returns:

<div class="result-wrapper">
<div class="row-fluid result-row">
<div class="span6 result-left">
<p>
<a class="tooltipLink warn-cs" data-original-title="Listen" href="..." rel="tooltip"><i class="..."></i></a>
<a class="muted-link" href="/dictionary/german-english/aa-machen">Aa <b>machen</b></a>
</p>
</div>   
<div class="span6 result-right row-fluid">
<span class="span9">
<a class="muted-link" href="/dictionary/english-german/do-a-poo">to do a poo</a>, <a class="muted-link" href="/dictionary/english-german/pooh">to pooh</a>
</span>
</div>
</div>
</div>

When I choose to print item.get_text() instead, I get

abgeneigt machen
to disincline




abhängig machen
2137

to predicate




Absenker machen
to layer

So basically a lot of new lines between the list items that I don't need. Is this because of the <p> tags? How do I get rid of them?


回答1:


Yes, between tags the HTML contains whitespace (including newlines) too.

You can easily collapse all multi-line whitespace with a regular expression:

import re

re.sub(r'\n\s*\n', r'\n\n', item.get_text().strip(), flags=re.M)

This removes any whitespace (newlines, spaces, tabs, etc.) between two newlines.




回答2:


You can the the strip() function in python

item.get_text().strip()



来源:https://stackoverflow.com/questions/24558075/beautifulsoup-getting-rid-of-paragraph-whitespace-line-breaks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!