Python Webscraping beautifulsoup avoid repetition in find_all()

问题

I am working on web scraping in Python using beautifulsoup. I am trying to extract text in bold or italics or both. Consider the following HTML snippet.

<div>
  <b> 
    <i>
      HelloWorld
   </i>
  </b>
</div>

If I use the command sp.find_all(['i', 'b']), understandably, I get two results, one corresponding to bold and the other to italics. i.e.

['HelloWorld', 'HelloWorld']

My question is, is there a way to uniquely extract it and get the tags?. My desired output is something like -

tag : text - HelloWorld, tagnames : [b,i]

Please note that comparing the text and weeding out non-unique occurrences of the text is not a feasible option, since I might have 'HelloWorld' repeated many times in the text, which I would want to extract.

Thanks!

回答1:

The most natural way of finding nodes that have both  and  among their ancestors would be XPath:

//node()[ancestor::i or ancestor::b]

Instead of node() you could use text() to find text nodes, or * to find elements, depending on the situation. This would not select any duplicates and it does not care in what order  and  are nested.

The issue with this idea is that BeautifulSoup does not support XPath. For this reason, I would use lxml instead of BeautifulSoup for web scraping.

回答2:

I would say that it is not clearly defined. What if you have foobar (it can be even more complicated) ?

Anyway, I would say that you have to implement the recursion.

Here is an example:

import bs4

html = """
<div>
  <b> 
    <i>
      HelloWorld
   </i>
  </b>
</div>
"""

def recursive_find(soup):
    for child in soup.children:
        result = child.find_all(['i', 'b'], recursive=False)
        if result:
            if len(result) == 1:
                result_s_result = result[0].find_all(['i', 'b'], recursive=False)
                if len(result_s_result) == 1:
                    print(result_s_result[0].contents)
            else:
                print(result)
        else:
            recursive_find(child)

oneline_html = "".join(line.strip() for line in html.split("\n"))

soup = bs4.BeautifulSoup(oneline_html, 'html.parser')

recursive_find(soup)

来源：https://stackoverflow.com/questions/61475200/python-webscraping-beautifulsoup-avoid-repetition-in-find-all

标签

python

html

web-scraping

beautifulsoup