Beautiful Soup parsing multiple <div> and successive <p> tags into dictionary

六眼飞鱼酱① 提交于 2020-01-07 04:19:28

问题


I have multiple inline divs (which are 'headers), and paragraph tags beneath (not IN the divs), that are theoretically 'children'... I would like to convert this to a dictionary. I can't quite figure out the best way to do it. Here is roughly what the site looks like:

<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<p>This should be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>

The Python code I have working looks like this:

soup = bs.BeautifulSoup(source,'lxml')

full_discussion = soup.find(attrs={'class' : 'field field-type-text field-field-discussion'})

ava_discussion = full_discussion.find(attrs = {'class': 'field-item odd'})

for div in ava_discussion.find_all("div"):
    discussion = []

    if div.findNextSibling('p'):
        discussion.append(div.findNextSibling('p').get_text())

    location = div.get_text()

    ava_dict.update({location: {"discussion": discussion}}

However, the problem is that this code only adds the FIRST <p> tag, then it moves onto the next div. Ultimately, I think I'd like to add each <p> into a list into discussion. Help!

UPDATE:

Adding a while loop yields me duplicates of the first

tags for how many exist. Here is the code:

for div in ava_discussion.find_all("div"):
    ns = div.nextSibling

    discussion = []

    while ns is not None and ns.name != "div":
        if ns.name == "p":
            discussion.append(div.findNextSibling('p').get_text())
        ns = ns.nextSibling

    location = div.get_text()

    ava_dict.update({location : {"discussion": discussion}})

print(json.dumps(ava_dict, indent=2))

回答1:


I wasn't adding the correct text. This code works:

for div in ava_discussion.find_all("div"):
    ns = div.nextSibling

    discussion = []

    while ns is not None and ns.name != "div":
        if ns.name == "p":
            discussion.append(ns.get_text())
        ns = ns.nextSibling

    location = div.get_text()

    ava_dict.update({location : {"discussion": discussion}})

print(json.dumps(ava_dict, indent=2))



回答2:


What about this?

    paragraphs = div.findNextSiblings('p') 
    for sibling in div.findNextSiblings():
        if sibling in paragraphs:
            discussion.append(sibling.get_text())
        else:
            break

now, who can show me how to make this more elegant :)



来源:https://stackoverflow.com/questions/42728401/beautiful-soup-parsing-multiple-div-and-successive-p-tags-into-dictionary

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!