BeautifulSoup Tag Removal

懵懂的女人 提交于 2019-12-24 15:48:03

问题


I have am looking to parse a HTML table with Python/BeautifulSoup...

This is my first attempt at coding anything in Python, so its probably not the most efficient.

I grabbed a function another post here (works great for the most part), but I am running into a couple of problems.

The code I am running is here:

def strip_tags(html, invalid_tags):
    bs2 = BeautifulSoup(str(html))
    for tag in bs2.findAll(True):
        if tag.name in invalid_tags:
            s = ""      

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)
    return bs2

invalid_tags = ['td','b']

for row in bs.findAll('tr'):
    col = row.findAll('td')

for index,item in enumerate(col):
    t = item.findAll('a')
    for ta in t:
        ta.replaceWithChildren()
        col[index] == item  

for item in col:
    print(strip_tags(item.string,invalid_tags).string

The raw data table (HTML) looks like this:

<td align="left">11/10</td>
<td>N ARMY</td>
<td>-7.5</td>
<td>NL</td>
<td><b>76-65</b></td>
<td><span style="color:green">W</span></td>
<td><span style="color:green">W</span></td>
<td></td>
<td class="cell4">50.0%</td>
<td class="cell4">76.9%</td>
<td class="cell4">37.5%</td>
<td class="cell5">37.1%</td>
<td class="cell5">90.0%</td>
<td class="cell5">29.4%</td>

When I run the strip_tags function, It works for all the tags except for the second line... 'None' is returned as the output.

If anyone could provide any insight on why this is happening I would greatly appreciate it.

edit: wow thanks for everyone's quick responses. anyhow, here is what happens when I run the code:

11/10
None
-7.5
NL
76-65
W
W
None
50.0%
76.9%
37.5%
37.1%
90.0%
29.4%

The problem lies around the second line, where it returns 'None' instead of 'N ARMY'. So yes, ideally I would like just the text that is found within the tags.


回答1:


If I'm understanding the output you want correctly, you shouldn't need to do any manual removing of tags -- that's why we use BeautifulSoup! ;)

What you need to call is the get_text() method on the tag instances that find_all() returns.

Using your sample html:

<table>
    <tr>
        <td align="left">11/10</td>
        <td>N ARMY</td>
        <td>-7.5</td>
        <td>NL</td>
        <td><b>76-65</b></td>
        <td><span style="color:green">W</span></td>
        <td><span style="color:green">W</span></td>
        <td></td>
        <td class="cell4">50.0%</td>
        <td class="cell4">76.9%</td>
        <td class="cell4">37.5%</td>
        <td class="cell5">37.1%</td>
        <td class="cell5">90.0%</td>
        <td class="cell5">29.4%</td>
    </tr>
</table>

A simple iteration over the tds, and a call to get_text() and we're good to go!

from bs4 import BeautifulSoup

with open('test.html', 'rb') as html: #My local version of your html file
    soup = BeautifulSoup(html.read())

for td in soup.find_all('td'):
    print td.get_text()

This gives the output:

11/10
N ARMY
-7.5
NL
76-65
W
W

50.0%
76.9%
37.5%
37.1%
90.0%
29.4%
[Finished in 0.1s]


来源:https://stackoverflow.com/questions/15934562/beautifulsoup-tag-removal

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!