Python Webscraping beautifulsoup avoid repetition in find_all()

坚强是说给别人听的谎言 提交于 2021-01-29 15:51:48

问题


I am working on web scraping in Python using beautifulsoup. I am trying to extract text in bold or italics or both. Consider the following HTML snippet.

<div>
  <b> 
    <i>
      HelloWorld
   </i>
  </b>
</div>

If I use the command sp.find_all(['i', 'b']), understandably, I get two results, one corresponding to bold and the other to italics. i.e.

['< b>< i>HelloWorld< /i>< /b>', '< i>HelloWorld< /i>']

My question is, is there a way to uniquely extract it and get the tags?. My desired output is something like -

tag : text - HelloWorld, tagnames : [b,i]

Please note that comparing the text and weeding out non-unique occurrences of the text is not a feasible option, since I might have 'HelloWorld' repeated many times in the text, which I would want to extract.

Thanks!


回答1:


The most natural way of finding nodes that have both <b> and <i> among their ancestors would be XPath:

//node()[ancestor::i or ancestor::b]

Instead of node() you could use text() to find text nodes, or * to find elements, depending on the situation. This would not select any duplicates and it does not care in what order <i> and <b> are nested.

The issue with this idea is that BeautifulSoup does not support XPath. For this reason, I would use lxml instead of BeautifulSoup for web scraping.




回答2:


I would say that it is not clearly defined. What if you have <b>foo<i>bar</i><b> (it can be even more complicated) ?

Anyway, I would say that you have to implement the recursion.

Here is an example:

import bs4

html = """
<div>
  <b> 
    <i>
      HelloWorld
   </i>
  </b>
</div>
"""

def recursive_find(soup):
    for child in soup.children:
        result = child.find_all(['i', 'b'], recursive=False)
        if result:
            if len(result) == 1:
                result_s_result = result[0].find_all(['i', 'b'], recursive=False)
                if len(result_s_result) == 1:
                    print(result_s_result[0].contents)
            else:
                print(result)
        else:
            recursive_find(child)

oneline_html = "".join(line.strip() for line in html.split("\n"))

soup = bs4.BeautifulSoup(oneline_html, 'html.parser')

recursive_find(soup)


来源:https://stackoverflow.com/questions/61475200/python-webscraping-beautifulsoup-avoid-repetition-in-find-all

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!