Scraping dynamic content through Selenium?

问题

I'm trying to scrap dynamic content from a Blog through Selenium but it always returns un rendered JavaScript.

To test this behavior I tried to wait till iframe loads completely and printed it's content which prints fine but again when I move back to parent frame it just displays un rendered JavaScript.

I'm looking for something in which I'm able to print completely rendered HTML content

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions

driver = webdriver.Chrome("path to chrome driver")   
driver.get('http://justgivemechocolateandnobodygetshurt.blogspot.com/')

WebDriverWait(driver, 40).until(expected_conditions.frame_to_be_available_and_switch_to_it((By.ID, "navbar-iframe")))

# Rendered iframe HTML is printed.
content = driver.page_source
print content.encode("utf-8")

# When I switch back to parent frame it again prints non rendered JavaScript.
driver.switch_to.parent_frame()
content = driver.page_source
print content.encode("utf-8")

回答1:

The problem is - the .page_source works only in the current context. There is that "current top-level browsing context" notation..Meaning, if you would call it on a default content - you would not get the inner HTML of the child iframeelements - for that you would have to switch into the context of a frame and call .page_source.

In other words, to get the very complete HTML of the page including the page source of the iframes, you would have to switch into the iframe contexts one by one and get the sources separately.

See also:

Command Contexts
Switch To Frame

Old answer:

I would wait for at least one blog entry to be loaded before getting the page_source:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 40)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".entry-content")))

print(driver.page_source)

来源：https://stackoverflow.com/questions/36779288/scraping-dynamic-content-through-selenium

标签

javascript

html

python-2.7

selenium

web-scraping