Use BeautifulSoup to obtain “View Element” code instead of “View Source” code

问题

I'm using the following code to obtain all <script>...</script> content from a webpage (see url in code):

import urllib2
from bs4 import BeautifulSoup
import re
import imp

url = "http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

script = soup.find_all("script")
print script #just to check the output of script

However, BeautifulSoup searches within the source code (Ctrl+U in chrome) of the webpage. However, I want to make BeautifulSoup search within the element code (Ctrl+Shift+I in chrome) of the webpage.

I want it to do this because the piece of code I'm really interested in is in the Element code and not in the Source code.

回答1:

First thing to understand is that neither BeautifulSoup, nor urllib2 is a browser. urllib2 would only get/download you the initial "static" page - it cannot execute JavaScript as a real browser would do. Hence, you will always get the "View Page Source" content.

To solve your problem - fire up a real browser via selenium, wait for the page to load, get the .page_source and pass it to BeautifulSoup to parse:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")

# wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fluid-width-video-wrapper")))

# get the page source
page_source = driver.page_source

driver.close()

# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)

This is the general approach, but your case is a little bit different - there is an iframe element which contains the video player. If you want to access the script elements inside the iframe, you would need to switch to it and then get the .page_source:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")

# wait for the page to load, switch to iframe
wait = WebDriverWait(driver, 10)
frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[src*=video]")))
driver.switch_to.frame(frame)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".controls")))

# get the page source
page_source = driver.page_source

driver.close()

# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)

来源：https://stackoverflow.com/questions/36129963/use-beautifulsoup-to-obtain-view-element-code-instead-of-view-source-code

标签

javascript

python

html

beautifulsoup