问题
I'm using the following code to obtain all <script>...</script> content from a webpage (see url in code):
import urllib2
from bs4 import BeautifulSoup
import re
import imp
url = "http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
script = soup.find_all("script")
print script #just to check the output of script
However, BeautifulSoup searches within the source code (Ctrl+U in chrome) of the webpage. However, I want to make BeautifulSoup search within the element code (Ctrl+Shift+I in chrome) of the webpage.
I want it to do this because the piece of code I'm really interested in is in the Element code and not in the Source code.
回答1:
First thing to understand is that neither BeautifulSoup, nor urllib2 is a browser. urllib2 would only get/download you the initial "static" page - it cannot execute JavaScript as a real browser would do. Hence, you will always get the "View Page Source" content.
To solve your problem - fire up a real browser via selenium, wait for the page to load, get the .page_source and pass it to BeautifulSoup to parse:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
# wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fluid-width-video-wrapper")))
# get the page source
page_source = driver.page_source
driver.close()
# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)
This is the general approach, but your case is a little bit different - there is an iframe element which contains the video player. If you want to access the script elements inside the iframe, you would need to switch to it and then get the .page_source:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
# wait for the page to load, switch to iframe
wait = WebDriverWait(driver, 10)
frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[src*=video]")))
driver.switch_to.frame(frame)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".controls")))
# get the page source
page_source = driver.page_source
driver.close()
# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)
来源:https://stackoverflow.com/questions/36129963/use-beautifulsoup-to-obtain-view-element-code-instead-of-view-source-code