Beautifulsoup get value in table

99封情书 提交于 2019-12-12 09:28:17

问题


I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have:

soup = BeautifulSoup(url_opener.open(url))            
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next

The relevant HTML is

<td valign="top">
    <table border="1" cellpadding="1" cellspacing="0" align="right">
    <tbody><tr class="tableheaders">
    <td>Owner Name(s)</td>
    </tr>

    <tr>

    <td>PILCHER DONALD L                         </td>
    </tr>

    </tbody></table>
</td>

Wow, there are lots of questions about beautifulsoup, I looked through them but didn't find an answer that helped me, hopefully this is not a duplicate question


回答1:


(Edit: apparently the HTML the OP posted lies -- there is in fact no tbody tag to look for, even though he made it a point of including in that HTML. So, changing to use table instead of tbody).

As there may be several table-rows you want (e.g., see the sibling URL to the one you give, with the last digit, 4, changed into a 5), I suggest a loop such as the following:

# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
  print x

this is reasonably robust to minor changes in page structure: having located the cell of interest, it loops up its parents until it's found the table tag, then over all navigable strings within that table that aren't empty (or just whitespace), excluding the owner header.




回答2:


This is Aaron DeVore's answer from the Beautifulsoup discussion group, It work well for me.

soup = BeautifulSoup(...)
label = soup.find(text="Owner Name(s)")

Needs Tag.string to get to the actual name string

name = label.findNext('td').string

If you're doing a bunch of them, you can even go for a list comprehension.

names = [unicode(label.findNext('td').string) for label in
soup.findAll(text="Owner Name(s)")]



回答3:


This is a slight improvement, but I couldn't figure out how to get rid of the three parents.

x[0].parent.parent.parent.findAll('td')[1].string


来源:https://stackoverflow.com/questions/1817184/beautifulsoup-get-value-in-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!