parsing site with beautifulsoup

后端 未结 2 1000
野的像风
野的像风 2021-01-14 14:57

i\'m trying to learn how to parse html with python and i`m currently stuck with soup.findAll return me an empty array,therefore there are elements which could be found Here

2条回答
  •  南方客
    南方客 (楼主)
    2021-01-14 15:47

    i'm trying to learn how to parse html with python

    You happened to pick a webpage which isn't very beginner-friendly when it comes to webscraping. Broadly speaking, most webpages use one or both of these two common methods for loading / displaying data:

    • The user makes a request to a server (visits a page, for example). The server gets the necessary data from a database. The server generates an HTML response using a templating engine, and returns the response for the user's browser to render.
    • The user makes a request to a server. The server returns an HTML-skeleton response which gets populated with data dynamically by making other requests / using APIs etc.

    The webpage you picked is of the second type. Just because you can see the elements in the "Elements" tab of Chrome's Dev Tools doesn't mean that that's what the server sent you. By looking at the network tab of Chrome's Dev Tools you can see that a request is made to these two resources: https://fb.oddsportal.com/ajax-next-games/2/0/1/20191114/yje3d.dat?=1574007087150 https://fb.oddsportal.com/ajax-next-games-odds/2/0/X0/20191114/1/yje3d.dat?=1574007087151

    (The Query String parameters will not be the same for you. Visiting those urls also won't be very interesting unless you provide the right payload.)

    The first resource seems to be a jQuery script which makes a request, the response of which contains HTML (this is your table). It looks something like this:

    You can see that they seem to have assigned unique IDs to each of the matches. Giron Marcos vs. Holt Brandon in this case has an ID of ATM9GmXG.

    The second resource is similar. It's also a jQuery script which seems to be making a request to their main API. The response this time is JSON, which is always desirable for webscraping. Here's what part of that looks like (notice the same ID):

提交回复
热议问题