Is this site not suited for web scraping using beautifulsoup?

醉酒当歌 提交于 2021-01-29 07:17:47

问题


I try to use beautifulsoup to get the odds for each match on the following site:

https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches

The goal is to end up with some kind of text file containing the following:

Match1, Team1, Odds for team1 winning, Team2, Odds for team2 winning

Match2, Team1, Odds for team1 winning, Team2, Odds for team2 winning

and so on...

I am new to beautifulsoup so things already go wrong at a very elementary level. My approach is to "walk" through the html tree until I arrive in a div tag, where I can see all the matches are contained. This works well until hit a div tag with class="sgd-wrapper", there is a link below to see a picture for clarification.

This picture is for clarification.

The following is my code, and neither m1 or m2 works. Python just responses with none.

from bs4 import BeautifulSoup as bs
import requests as res

#Load the webpage content
r = res.get('https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches').text

#Convert to a beautiful soup object
soup = bs(r,'lxml')

m1 = soup.find("div", attrs={"id": "wrapper"}).find("div", attrs={"class": "page-box"}).find("div", attrs={"class": "page-area"}).find("div", attrs={"id": "oddset-nashville"}).find("div", attrs={"class": "sgd-wrapper"})
m2 = soup.find("div", attrs={"class": "sgd-wrapper"})

If I remove the last find in m1 or redefine m2

m1 = soup.find("div", attrs={"id": "wrapper"}).find("div", attrs={"class": "page-box"}).find("div", attrs={"class": "page-area"}).find("div", attrs={"id": "oddset-nashville"})
m2 = soup.find("div", attrs={"id": "oddset-nashville"})

Then I get the response

print(m1)
<div data-digital-portal-loader-url="https://assets.sb.danskespil.dk/front-end/digitalPortal.js?noCache=20201011001813" id="oddset-nashville"></div>

Can someone explain me why this div class="sgd-wrapper" is so special?


回答1:


the problem is at the line with r = res.get('https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches').text

Python requests library just sent your HTTP/HTTPS request to the server and get the raw html and it does not help you to load more resources like pictures and scripts, which means that some elements is manipulate in javascript scripts (for example, create an element, set class name and insert into DOM tree):

another example, if you GET main.html via requests, it does not load main.js and the class of div t1 will not be set as sgd-wrapper

# main.html
<html>
   <body>
      <div id="t1"></div>
      <script src="main.js"></script>
   </body>
</html>

# in main.js
document.querySelector('#t1').classList.add('sgd-wrapper');

what you need to do is to use headless Chrome (like google-chorme --headless to launch Chrome) and use Chrome API to hook on page loading events then dump whole complete contents.



来源:https://stackoverflow.com/questions/64298879/is-this-site-not-suited-for-web-scraping-using-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!