Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

时光毁灭记忆、已成空白 提交于 2021-01-28 14:11:37

问题


This question follows this previous question. I want to scrap data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:

from bs4 import BeautifulSoup
import urllib.request
import re

url = "https://www.tipico.de/de/live-wetten/"

try:
 page = urllib.request.urlopen(url)
except:
 print(“An error occured.”)

soup = BeautifulSoup(page, ‘html.parser’)

regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)

I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:

from bs4 import BeautifulSoup
import urllib.request
import re

url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"

try:
    page = urllib.request.urlopen(url)
except Exception as e:
    print(f"An error occurred: {e}")

soup = BeautifulSoup(page, 'html.parser')

regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)

The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line

print(soup)

When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?

EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.


回答1:


That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.

First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)

No <button> tag in the page source :

So I suggest using Selenium as the solution, and I tried a basic scrape of the website.

Here is the code I used :

from selenium import webdriver

option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'

browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)

browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")

span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
    print(span_tag.text)

browser.quit()

This is the output:

There are some junk data present in this output, but that's for you to figure out what you need and what you don't!



来源:https://stackoverflow.com/questions/65509322/web-scraping-with-python-and-beautifulsoup-what-is-saved-by-the-beautifulsoup-f

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!