How to scrap data off page loaded via Javascript

问题

I want to scrap the comments off this page using beautifulsoup - https://www.x....s.com/video_id/the-suburl

The comments are loaded on click via Javascript. The comments are paginated and each page loads comments on click too. I wish to fetch all comments, for each comment, I want to get the poster profile url, the comment, no. of likes, no of dislikes, and time posted (as stated on the page).

The comments can be a list of dictionaries.

How do I go about this?

回答1:

This script will print all comments found on the page:

import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.x......com/video_id/gggjggjj/'
video_id = url.rsplit('/', maxsplit=2)[-2].replace('video', '')

u = 'https://www.x......com/threads/video/ggggjggl/{video_id}/0/0'.format(video_id=video_id)
comments = requests.post(u, data={'load_all':1}).json()

for id_ in comments['posts']['ids']:
    print(comments['posts']['posts'][id_]['date'])
    print(comments['posts']['posts'][id_]['name'])
    print(comments['posts']['posts'][id_]['url'])
    print(BeautifulSoup(comments['posts']['posts'][id_]['message'], 'html.parser').get_text())
    # ...etc.
    print('-'*80)

回答2:

This would be done with Selenium. Selenium emulates a browser. Depending on your preferences you can use a chrome driver or the Firefox driver which is the geckodriver.

Here is a link on how to install the chrome webdriver: http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-windows-install/

Then in your code here is how you would set it up:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# this part may change depending on where you installed the webdriver. 
# You may have to define the path to the driver. 
# For me my driver is in C:/bin so I do not need to define the path
chrome_options = Options()

# or '-start maximized' if you want the browser window to open
chrome_options.add_argument('--headless') 

driver = webdriver.Chrome(options=chrome_options)

driver.get(your_url)
html = driver.page_source # downloads the html from the driver

Selenium has several functions that you can use to perform certain actions such as click on elements on the page. Once you find an element with selenium you can use the .click() method to interact with the element. Let me know if this helps

来源：https://stackoverflow.com/questions/63039645/how-to-scrap-data-off-page-loaded-via-javascript

标签

python-3.x

beautifulsoup