BeautifulSoup returns None even though the element exists

喜欢而已 提交于 2020-07-06 18:47:05

问题


I have gone through most of the solutions for similar issues but haven't found one that works and more importantly haven't found an explanation of why this occurs outside of when Javascript or something else is being called on the site being scraped.

I am trying to scrape the table for game "Officials" from the site: http://www.pro-football-reference.com/boxscores/201309050den.htm

my code is:

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = urlopen(url)    
bsObj = BeautifulSoup(html, "lxml")
officials = bsObj.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))

I am just printing to the console for now, but I get an empty list with findAll or None with find. I have also tried this with the basic html.parser with no luck.

Can someone with a better understanding of html educate me on what is different about this webpage specifically? Thanks in advance!


回答1:


try this code:

from selenium import webdriver
import time
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
url= "http://www.pro-football-reference.com/boxscores/201309050den.htm"
driver.maximize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))


driver.quit()

It will print:

<table class="suppress_all sortable stats_table now_sortable" data-cols-to-freeze="0" id="officials"><thead><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr></thead><caption>Officials Table</caption><tbody>
<tr data-row="0"><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr data-row="1"><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr data-row="2"><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr data-row="3"><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr data-row="4"><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr data-row="5"><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr data-row="6"><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</tbody></table>



回答2:


It is in the source, it is just commented out, it is trivial to removes the comments using a regex:

from bs4 import BeautifulSoup
import requests
import re

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = requests.get(url).content
bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
officials = bsObj.find_all("table",{"id":"officials"})

for entry in officials:
    print(entry)

There is only one table so you don't need find_all and your loop is a bit pointless, just use find:

In [1]: from bs4 import BeautifulSoup
   ...: import requests
   ...: import re
   ...: url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
   ...: 
   ...: html = requests.get(url).content
   ...: bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
   ...: officials = bsObj.find(id="officials")
   ...: print(officials)
   ...: 

<table class="suppress_all sortable stats_table" data-cols-to-freeze="0" id="officials"><caption>Officials Table</caption><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</table>

In [2]: 



回答3:


You don't see it because is not there. Try to turn JS off and open it with you browser, you will see it's not there - the website does some JS DOM manipulation.

You choices are:

  1. In your case, the HTML you want is over there - just in comment, extract it from the comment with beautifulsoup.
  2. Use Selenium or equivalent tool to render the JS(that exactly how your browser does it)


来源:https://stackoverflow.com/questions/40146128/beautifulsoup-returns-none-even-though-the-element-exists

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!