问题
I have am looking to parse a HTML table with Python/BeautifulSoup...
This is my first attempt at coding anything in Python, so its probably not the most efficient.
I grabbed a function another post here (works great for the most part), but I am running into a couple of problems.
The code I am running is here:
def strip_tags(html, invalid_tags):
bs2 = BeautifulSoup(str(html))
for tag in bs2.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return bs2
invalid_tags = ['td','b']
for row in bs.findAll('tr'):
col = row.findAll('td')
for index,item in enumerate(col):
t = item.findAll('a')
for ta in t:
ta.replaceWithChildren()
col[index] == item
for item in col:
print(strip_tags(item.string,invalid_tags).string
The raw data table (HTML) looks like this:
<td align="left">11/10</td>
<td>N ARMY</td>
<td>-7.5</td>
<td>NL</td>
<td><b>76-65</b></td>
<td><span style="color:green">W</span></td>
<td><span style="color:green">W</span></td>
<td></td>
<td class="cell4">50.0%</td>
<td class="cell4">76.9%</td>
<td class="cell4">37.5%</td>
<td class="cell5">37.1%</td>
<td class="cell5">90.0%</td>
<td class="cell5">29.4%</td>
When I run the strip_tags function, It works for all the tags except for the second line... 'None' is returned as the output.
If anyone could provide any insight on why this is happening I would greatly appreciate it.
edit: wow thanks for everyone's quick responses. anyhow, here is what happens when I run the code:
11/10 None -7.5 NL 76-65 W W None 50.0% 76.9% 37.5% 37.1% 90.0% 29.4%
The problem lies around the second line, where it returns 'None' instead of 'N ARMY'. So yes, ideally I would like just the text that is found within the tags.
回答1:
If I'm understanding the output you want correctly, you shouldn't need to do any manual removing of tags -- that's why we use BeautifulSoup
! ;)
What you need to call is the get_text()
method on the tag
instances that find_all()
returns.
Using your sample html:
<table>
<tr>
<td align="left">11/10</td>
<td>N ARMY</td>
<td>-7.5</td>
<td>NL</td>
<td><b>76-65</b></td>
<td><span style="color:green">W</span></td>
<td><span style="color:green">W</span></td>
<td></td>
<td class="cell4">50.0%</td>
<td class="cell4">76.9%</td>
<td class="cell4">37.5%</td>
<td class="cell5">37.1%</td>
<td class="cell5">90.0%</td>
<td class="cell5">29.4%</td>
</tr>
</table>
A simple iteration over the td
s, and a call to get_text()
and we're good to go!
from bs4 import BeautifulSoup
with open('test.html', 'rb') as html: #My local version of your html file
soup = BeautifulSoup(html.read())
for td in soup.find_all('td'):
print td.get_text()
This gives the output:
11/10
N ARMY
-7.5
NL
76-65
W
W
50.0%
76.9%
37.5%
37.1%
90.0%
29.4%
[Finished in 0.1s]
来源:https://stackoverflow.com/questions/15934562/beautifulsoup-tag-removal