HTML and BeautifulSoup: how to iteratively parse when the structure is not always known beforehand?

て烟熏妆下的殇ゞ 提交于 2019-12-21 05:31:40

问题


I began with a simple HTML structure, something like this:

Thanks to the help of @alecxe, I was able to create this JSON dict:

{u'Outer List': {u'Inner List': [u'info 1', u'info 2', u'info 3']}}

using his code:

from bs4 import BeautifulSoup

data = """your html goes here: see the very end of post""" 
soup = BeautifulSoup(data)

inner_ul = soup.find('ul', class_='innerUl')
inner_items = [li.text.strip() for li in inner_ul.ul.find_all('li')]

outer_ul_text = soup.ul.span.text.strip()
inner_ul_text = inner_ul.span.text.strip()

result = {outer_ul_text: {inner_ul_text: inner_items}}
print result

The code is fantastic, and I've been trying to rewrite it in an iterable manner.

My 'real' HTML dataset is much larger and nastier, and I need to scale the code up in a way that I could handle something like this:

Or, maybe the data looks as such:

To make things even worse, perhaps underneath sublist we have yet another sublist! Ultimately, this is my real situation.

My problem is this: I can't find a way to generalize the aforementioned BeautifulSoup code to deal with either of the above situations (much less the 3rd 'even worse' scenario!).

How can I recursively / iteratively plumb the depths of my HTML, and extract the information, when I don't have access to the exact structure of the HTML beforehand? Is this even possible with BeautifulSoup? Surely there must be some way that I'm missing, to determine the depth first and then proceed.

Thanks a lot for making this far!

The HTML for the last example is here:

<html>
 <body>
  <ul class="rootList">
   <li class="liItem endPlus">
    <span class="itemToBeAdded">
     Outer List
    </span>
   </li>
   <li class="noBulletsLi ">
    <ul class="innerUl">
     <li class="liItem crossPlus">
      <span class="itemToBeAdded">
       Inner List
      </span>
      <ul class="grayStarUl ">
       <li class="">
        <span class="phrasesToBeAdded">
         info 1
        </span>
       </li>
       <li class="">
        <span class="phrasesToBeAdded">
         info 2
         </span>
       </li>
       <li class="">
        <span class="phrasesToBeAdded">
         info 3
        </span>
             <ul class="grayStarUl">
                 <li class="">
                     <span class="phrasesToBeAdded">sublist</span>
                 </li>
             </ul>            
       </li>
      </ul>
     </li>
      </ul>
     </li>
    </ul>
 </body>
</html>

回答1:


You can write two parsers which recursively call each other:

def parse_list(tag):
    return map(parse_list_item, tag.find_all('li', recursive=False))

def parse_list_item(tag):
    text = tag.find(text=True, recursive=False).strip()
    text += '\n' + tag.span.text.strip() if tag.span.parent == tag else ''
    inner = tag.find('ul', recursive=False)
    if inner is None:  # no more nesting:
        return text.strip()
    else:  # more nesting
        return {text.strip():parse_list(inner)} if text else parse_list(inner)

Above does not utilize any class information, and should work regardless of depth of the inner lists:

>>> parse_list(soup.find('ul'))
[u'Outer List', [{u'Inner List': [u'info 1', u'info 2', {u'info 3': [u'sublist']}]}]]



回答2:


I'm a bit unsure of what you are trying to achieve. So I'm going to presume that you want to extract data from all spans and don't care about the structure. If you explain more precisely what you want to achieve I'll update my answer.

soup = BeautifulSoup(html_doc)
spans = soup.findall(class="phrasesToBeAdded")
text = []
for element in spans:
    text.append(element.get_text())


来源:https://stackoverflow.com/questions/22672292/html-and-beautifulsoup-how-to-iteratively-parse-when-the-structure-is-not-alway

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!