Extracting text between tags using BeautifulSoup

大城市里の小女人 提交于 2019-12-08 07:11:02

问题


I am trying to extract text from a series of webpages that all follow a similar format using BeautifulSoup. The html for the text I wish to extract is below. The actual link is here: http://www.p2016.org/ads1/bushad120215.html.

 <p><span style="color: rgb(153, 153, 153);"></span><font size="-1">      <span
 style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font><span style="color: rgb(153, 153, 153);"></span><font size="-1"><span style="font-family: Arial;"><big><span
 style="color: rgb(153, 153, 153);"></span></big></span></font><font
 size="-1"><span style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font><font size="-1"><span style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font></p>   <p><span style="color: rgb(153, 153, 153);">[Music]</span><span
 style="text-decoration: underline;"><br>
</span></p>
<p><small><span style="text-decoration: underline;">TEXT</span>: The
Medal of Honor is the highest award for valor in action against an
enemy force</small><span style="text-decoration: underline;"><br>
</span></p>
<p><span style="text-decoration: underline;">Col. Jay Vargas</span>:&nbsp;
We
were
completely
surrounded,
116 Marines locking heads with 15,000
North Vietnamese.&nbsp; Forty hours with no sleep, fighting hand to
hand.<span style="text-decoration: underline;"><br>
<span style="font-family: helvetica,sans-serif;"><br>
</span>

I'd like to find a way to iterate through all the html files in my folder and extract the text between all the markers. I've included here the relevant sections of my code:

text=[]

for page in pages:
        html_doc = codecs.open(page, 'r')
        soup = BeautifulSoup(html_doc, 'html.parser')
        for t in soup.find_all('<p>'):
            t = t.get_text()
            text.append(t.encode('utf-8'))
            print t

However, nothing is coming up. Apologies for the noob question and thanks in advance for your help.


回答1:


for t in soup.find_all('<p>'):

Just specify the tag name, not it's representation:

for t in soup.find_all('p'):

Here is how you can narrow down the search to the dialogue paragraphs:

for span in soup.find_all("span", style="text-decoration: underline;"):
    text = span.next_sibling

    if text:
        print(span.text, text.strip())


来源:https://stackoverflow.com/questions/34388284/extracting-text-between-tags-using-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!