BeautifulSoup .children or .content without whitespace between tags

你离开我真会死。 提交于 2020-01-03 04:32:12

问题


I want all children of a tag without the whitespace between the tags. But BeautifulSoups .contents and .children also returns the whitespace between the tags.

html = """
<div id="list">
  <span>1</span>
  <a href="2.html">2</a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').contents)

This prints:

['\n', <span>1</span>, '\n', <a href="2.html">2</a>, '\n', <a href="3.html">3</a>, '\n']

Same with

print(list(soup.find(id='list').children))

What I want:

[<span>1</span>, <a href="2.html">2</a>, <a href="3.html">3</a>]

Is there any way to tell BeautifulSoup to return only the tags and ignore the whitespace?

The documentation is not very helpful on this topic. The html in the example does not contain any whitespace between tags.

Indeed stripping the html of all whitespace solves my problem:

html = """<div id="list"><span>1</span><a href="2.html">2</a><a href="3.html">3</a></div>"""

This html delivers the tags without whitespace between the tags. But I hoped to use BeautifoulSoup so I do not have to mess around in the html source code. I was hoping BeautifulSoup does that for me.

Another workaround might be:

print(list(filter(lambda t: t != '\n', soup.find(id='list').contents)))

But that seems flaky because is the whitespace guaranteed to be always '\n'?


A note to the duplicate marking brigade:

There are many questions asking about BeautifulSoup and whitespace. Most are asking about getting rid of whitespace from the "rendered text".

For example:

BeautifulSoup - getting rid of paragraph whitespace/line breaks

Removing new line '\n' from the output of python BeautifulSoup

Both questions want the text without whitespace. I want the tags without whitespace. The solutions there don't apply to my question.

Another example:

Regular expression for class with whitespaces using Beautifulsoup

This question is about whitespace in the class attribute.


回答1:


BeautifulSoup has .find_all(True), which returns all tags without the whitespace between the tags:

html = """
<div id="list">
  <span>1</span>
  <a href="2.html">2</a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').find_all(True))

Prints:

[<span>1</span>, <a href="2.html">2</a>, <a href="3.html">3</a>]

Combine with recursive=False, and you get only the direct children and not children of children.

html = """
<div id="list">
  <span>1</span>
  <a href="2.html"><b>2</b></a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').find_all(True, recursive=False))

(i added <b> to the second child. that would be a grandchild)

Prints:

[<span>1</span>, <a href="2.html"><b>2</b></a>, <a href="3.html">3</a>]

(<b> is not separately listed. it would be if recursive=True)


Trivia: now that I have the solution I found another seemingly unrelated question and answer in StackOverflow where the solution was hidden in a comment:

Why does BeautifulSoup .children contain nameless elements as well as the expected tag(s)



来源:https://stackoverflow.com/questions/56021345/beautifulsoup-children-or-content-without-whitespace-between-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!