Selenium scraping JS loaded pages

荒凉一梦 提交于 2021-01-28 12:18:18

问题


I'm trying to scrape some of the loaded JS data from https://surviv.io/stats/player787, such as the number of total kills. Could someone tell me how I can scrape the js loaded data with selenium. Thanks.

EDIT: Here is some of the code

from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://surviv.io/stats/player787')
b = browser.find_element_by_tag_name('tr')

The 'tr' which contains the data that I want is not grabbed by selenium


回答1:


To get the count of kills.Induce WebDriverWait and visibility_of_all_elements_located

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://surviv.io/stats/player787')
allkills = WebDriverWait(browser,20).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[@class='card-mode-stat-name' and text()='KILLS']/following-sibling::div[1]")))
for item in allkills:
    print(item.text)



回答2:


The reason it's not finding it is because the page isn't fully rendered. You can add a wait with selenium so will not move on until the specified element is rendered first.

Also, if it's in a <table> tag, let pandas do the parsing for you (it uses beautifulsoup under the hood to pull out the <table>, <th>, <tr>, and <td> tags, returns them as a list of dataframes once you get the rendered html source:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas as pd

browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
browser.get('https://surviv.io/stats/player787')
delay = 3 # seconds
WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'player-stats-overview')))

df = pd.read_html(browser.page_source)[0]

print (df.loc[0,'Kills'])

browser.close()

Output:

18884


print (df)
   Wins  Kills  Games  K/G
0   638  18884   8896  2.1



回答3:


You could avoid the overhead of a browser and simply mimic the POST request the page makes.

import requests

headers = {'content-type': 'application/json; charset=UTF-8'}
data = {"slug":"player787","interval":"all","mapIdFilter":"-1"}
r = requests.post('https://surviv.io/api/user_stats', headers=headers, json=data)
data = r.json()
desired_stats = ['wins', 'kills', 'games', 'kpg'] 
for stat in desired_stats:
    print(stat, ': ' , data[stat])

For OP:

View of payload in network tab visible when you click on the appropriate xhr indicated by the url in my answer (you need to scroll down to see the payload info)




回答4:


To scrape the values 652, 19152, 8926, 2.1, etc from JS loaded pages you you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get('https://surviv.io/stats/player787')
    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.player-stats-overview td")))])
    
  • Using XPATH:

    driver.get('https://surviv.io/stats/player787')
    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='player-stats-overview']//td")))])
    
  • Console Output:

    ['652', '19152', '8926', '2.1']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    


来源:https://stackoverflow.com/questions/59456651/selenium-scraping-js-loaded-pages

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!