Why is this tag empty when parsed with beautiful soup?

问题

I am parsing this page with beautiful soup:

https://au.finance.yahoo.com/q/is?s=AAPL

I am attempting to get the total revenue for 27/09/2014 (42,123,000) which is one of the first values on the statement near the top.

I inspected the element in chrome tools and found that the value is in a table with class name yfnc_tabledata1.

My python code is as follows:

import requests
import bs4

#get webpage
page = requests.get("https://au.finance.yahoo.com/q/is?s=AAPL")

#put into beautiful soup
soup = bs4.BeautifulSoup(page.content)

#select tag
tag = soup.select("table.yfnc_tabledata1")

So far so good, this grabs the table that has the needed data but this is where I am stuck.

The chain that leads to the data I want is as follows:

tag > tbody > tr > td > table > tbody > (then the second tr)

But when I try to use this I get an empty element.

Can anybody help me with this?

Also for bonus points can anyone tell me how I can learn to extract data like this in a more general sense? I constantly need to extract data buried deep within an HTML document and can never seem to work out the correct code to get to the data I want.

Thanks a lot any help appreciated.

回答1:

Let's be specific and practical.

The idea is to find the Total Revenue label and get the next cell's text using .next_sibling:

table = soup.find("table", class_="yfnc_tabledata1")
total_revenue_label = table.find(text=re.compile(r'Total Revenue'))
print total_revenue_label.parent.parent.next_sibling.get_text(strip=True)

Demo:

>>> import re
>>> import requests
>>> import bs4
>>> 
>>> page = requests.get("https://au.finance.yahoo.com/q/is?s=AAPL")
>>> soup = bs4.BeautifulSoup(page.content)
>>> 
>>> table = soup.find("table", class_="yfnc_tabledata1")
>>> total_revenue_label = table.find(text=re.compile(r'Total Revenue'))
>>> total_revenue_label.parent.parent.next_sibling.get_text(strip=True)
42,123,000

回答2:

There is no <tbody> tag in the HTML.

If you look at the page with a browser (e.g. with Chrome developer tools) it looks like there is a <tbody> tag, but that's a fake tag inserted into the DOM by Chrome.

Try omitting both tags in your search chain. I am certain the first one isn't there and (although the HTML is hard to read) I'm pretty sure the second isn't there either.

Update: Here are the HTML beginning with the table you are interested in:

<TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0">
  <TR>
    <TD>
      <TABLE width="100%" cellpadding="2" ...>
        <TR class="yfnc_modtitle1" style="border-top:none;">
          <td colspan="2" style="border-top:2px solid #000;">
            <small><span class="yfi-module-title">Period Ending</span></small>
          </td>
          <th scope="col" style="border-top:2px ...">27/09/2014</th>
          <th scope="col" style="border-top:2px ...">28/06/2014</th>
          ...

so no <tbody> tags.

回答3:

To answer your general question:

I suggest book "Mining the Social Web" second edition. Specially chapter 5 - "Mining Web Pages".

Source code for the book is available here on github.

回答4:

I think there are probably better ways of getting the data you want? It's been provided for free for a number of years by a number of institutions, e.g. is the information you want in here somewhere?

http://www.afr.com/share_tables/

来源：https://stackoverflow.com/questions/27327998/why-is-this-tag-empty-when-parsed-with-beautiful-soup

标签

python

html

beautifulsoup

html-parsing