Extracting the text between two header tags using BeautifulSoup in Python

一笑奈何 提交于 2021-02-18 18:55:47

问题


I am trying to extract the plot of a movie, from the wikipedia page, in Python using BeautifulSoup. I am new to Python and BeautifulSoup so I am not sure how to actually approach it.

This is the input code.

<h2><span class="mw-headline" id="Plot">Plot</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php? title=Moana_(2016_film)&amp;action=edit&amp;section=1" title="Edit section: Plot">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>A small <a href="/wiki/Pounamu" title="Pounamu">pounamu</a> stone that is    the mystical heart of the island <a href="/wiki/Goddess" title="Goddess">goddess</a> Te Fiti is stolen by the <a href="/wiki/Demigod" title="Demigod">demigod</a> <a href="/wiki/M%C4%81ui_(mythology)" title="Māui (mythology)">Maui</a>, who was planning to give it to humanity as a gift. As Maui makes his escape, he is attacked by the lava <a href="/wiki/Demon" title="Demon">demon</a> Te Kā, causing the heart of Te Fiti as well as his power-granting magical fish hook to be lost in the ocean.</p><p>A millennium later, young Moana Waialiki, daughter and heir of the chief on the small <a href="/wiki/Polynesia" title="Polynesia">Polynesian</a> island of Motunui, is chosen by the ocean to receive the heart, but drops it when her father, Chief Tui, comes to get her. He insists the island provides everything the villagers need. But years later, fish become scarce and the island's vegetation begins dying. Moana proposes going beyond the reef to find more fish. Tui rejects her request, as sailing past the reef is forbidden.</p>`
<p>Moana's grandmother Tala shows Moana a secret cave behind a waterfall, where she finds boats inside and discovers her ancestors were voyagers, sailing and discovering new islands across the world. Tala explains that they stopped voyaging because Maui stole the heart of Te Fiti, causing Te Kā and monsters to appear in the ocean. Tala then says Te Kā's darkness has been spreading from island to island, slowly killing them. Tala gives Moana the heart of Te Fiti, which she has kept safe for her granddaughter.</p>
<p>Tala falls ill and with her dying breaths tells Moana to set sail. Moana and her pet <a href="/wiki/Rooster" title="Rooster">rooster</a> Heihei depart in a <a href="/wiki/Drua" title="Drua">drua</a> to find Maui. A <a href="/wiki/Manta_ray" title="Manta ray">manta ray</a>, Tala's reincarnation, follows. After a <a href="/wiki/Typhoon" title="Typhoon">typhoon</a> wave flips her sailboat and knocks her unconscious, she awakens the next morning on an island inhabited by Maui, who traps her in a cave and takes her sailboat to search for his fishhook. After escaping and catching up to Maui, Moana tries to convince him to return the heart, but Maui refuses, fearing its power will attract dark creatures.</p>
<p>Sentient coconut pirates called Kakamora surround the boat and steal the heart, but Maui and Moana retrieve it. Maui agrees to help return the heart, but only after he reclaims his hook, which is hidden in Lalotai, the Realm of Monsters. At Lalotai, they retrieve it by tricking Tamatoa, a giant <a href="/wiki/Coconut_crab" title="Coconut crab">coconut crab</a>. Maui teaches Moana how to properly sail and navigate. They arrive at Te Fiti, where Te Kā attacks. Maui is overpowered and Te Kā severely damages his hook and repels their boat far out to sea. Fearful that returning to fight Te Kā will destroy his hook, Maui abandons Moana.</p>
<p>Distraught, Moana begs the ocean to take the heart and choose another person to return it to Te Fiti. The spirit of Tala comes to her and encourages to find her true calling within herself. Inspired, Moana retrieves the heart from the ocean and returns to Te Fiti alone. Maui, having had a change of heart, returns to distract the lava demon, and his hook is destroyed in the battle. Realizing that Te Kā is actually Te Fiti without her heart, Moana asks the ocean to clear a path for Te Kā to approach her. She sings a song, asking Te Kā to remember who she truly is, allowing Moana to restore her heart. Te Fiti returns and gives a new canoe to Moana and a new magical hook to Maui before returning to her island form.</p>
<p>In a <a href="/wiki/Post-credits_scene" title="Post-credits scene">post-credits scene</a>, Tamatoa, who has been stranded on his back during Moana and Maui's escape, grumbles to the audience that they would help him if he was a <a href="/wiki/Sebastian_(Disney)" title="Sebastian (Disney)">Jamaican crab named Sebastian</a>.</p>
<h2><span class="mw-headline" id="Cast">Cast</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Moana_(2016_film)&amp;action=edit&amp;section=2" title="Edit section: Cast">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<div class="thumb tright">

So I want to extract only the text between both the h2, which is the plot. How should I extract that using BeautifulSoup?

EDIT 1: This is the code I have right now.

from BeautifulSoup import *

movie = raw_input('Enter:')
main = "http://www.wikipedia.org"
url = "http://www.wikipedia.org/wiki/"+movie+"_(disambiguation)"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve a list of the anchor tags
# Each tag is like a dictionary of HTML attributes
tags = soup('a')
for tag in tags:
    chk = tag.get('href', None)
    chk = str(chk)
    if "film" in chk:
        final = chk

html = urllib.urlopen(main+final).read()
soup = BeautifulSoup(html)
new = []
spa = soup.findAll("span",id = "Plot")
spa_1 = soup.findAllNext("p")
for i in spa_1:
    print i

I tried to reach the id=Plot and try to print all the p tags after it.


回答1:


The structure of the document is something like this:

[h2] / [span id=Plot]
...
[h2]

What we can do is search for the span with id of "Plot", then navigate through the parent sibling nodes, collecting their text, until we get to the next H2 header.

# collect plot in this list
plot = []

# find the node with id of "Plot"
mark = soup.find(id="Plot")

# walk through the siblings of the parent (H2) node 
# until we reach the next H2 node
for elt in mark.parent.nextSiblingGenerator():
    if elt.name == "h2":
        break
    if hasattr(elt, "text"):
        plot.append(elt.text)

# enjoy
print("".join(plot))


来源:https://stackoverflow.com/questions/42450743/extracting-the-text-between-two-header-tags-using-beautifulsoup-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!