问题
I am working on web scraping in Python using beautifulsoup. I am trying to extract text in bold or italics or both. Consider the following HTML snippet.
<div>
<b>
<i>
HelloWorld
</i>
</b>
</div>
If I use the command sp.find_all(['i', 'b'])
, understandably, I get two results, one corresponding to bold and the other to italics. i.e.
['< b>< i>HelloWorld< /i>< /b>', '< i>HelloWorld< /i>']
My question is, is there a way to uniquely extract it and get the tags?. My desired output is something like -
tag : text - HelloWorld, tagnames : [b,i]
Please note that comparing the text and weeding out non-unique occurrences of the text is not a feasible option, since I might have 'HelloWorld' repeated many times in the text, which I would want to extract.
Thanks!
回答1:
The most natural way of finding nodes that have both <b>
and <i>
among their ancestors would be XPath:
//node()[ancestor::i or ancestor::b]
Instead of node()
you could use text()
to find text nodes, or *
to find elements, depending on the situation. This would not select any duplicates and it does not care in what order <i>
and <b>
are nested.
The issue with this idea is that BeautifulSoup does not support XPath. For this reason, I would use lxml instead of BeautifulSoup for web scraping.
回答2:
I would say that it is not clearly defined. What if you have <b>foo<i>bar</i><b>
(it can be even more complicated) ?
Anyway, I would say that you have to implement the recursion.
Here is an example:
import bs4
html = """
<div>
<b>
<i>
HelloWorld
</i>
</b>
</div>
"""
def recursive_find(soup):
for child in soup.children:
result = child.find_all(['i', 'b'], recursive=False)
if result:
if len(result) == 1:
result_s_result = result[0].find_all(['i', 'b'], recursive=False)
if len(result_s_result) == 1:
print(result_s_result[0].contents)
else:
print(result)
else:
recursive_find(child)
oneline_html = "".join(line.strip() for line in html.split("\n"))
soup = bs4.BeautifulSoup(oneline_html, 'html.parser')
recursive_find(soup)
来源:https://stackoverflow.com/questions/61475200/python-webscraping-beautifulsoup-avoid-repetition-in-find-all