Scraping table with BeautifulSoup

岁酱吖の 提交于 2021-01-28 12:01:03

问题


In this first code, I can use BS to get all the info within the table of interest:

from urllib import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html)

for i in soup.find("table",{"id":"giftList"}).children:
    print child

That prints the product lists.

I want to print the rows in the tournamentTable here (desired info is in class=deactivate, class=odd deactivate and date in class=center nob-border):

from urllib import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.oddsportal.com/hockey/russia/khl/results/#/page/2.html")
soup = BeautifulSoup(html)

#for i in soup.find("table",{"id":"tournamentTable"}).children:
#    print i
for i in soup.find("table",{"class":"table-main"}).children:
    print i

But that's printing other tables on the page. When I try to specify the table of interest with {"id":"tournamentTable"} it returns Nonetype.

What am I missing that I can't access the desired table & the information within?


回答1:


When urllib.urlopen returns the content of a webpage, it returns the HTML from a URL with JavaScript turned off. In your case, this means that when urllib loads the relevant URL, the table with id="tournamentTable" never actually loads.

You can observe this behaviour by turning off JavaScript in your browser and loading the URL.

To scrape a webpage with content rendered by JavaScript you might want to consider using a browser automation package such as Selenium. If you scrape regularly you might also want to download a 'JavaScript switcher' plugin which allows you to toggle JavaScript on and off with ease.



来源:https://stackoverflow.com/questions/35119529/scraping-table-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!