I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and rep
The "recommended" (however still ugly) solution could be to use explicit wait:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
old_value = browser.find_element_by_id('thing-on-old-page').text
browser.find_element_by_link_text('my link').click()
WebDriverWait(browser, 3).until(
expected_conditions.text_to_be_present_in_element(
(By.ID, 'thing-on-new-page'),
'expected new text'
)
)
The naive attempt would be something like this:
def wait_for(condition_function):
start_time = time.time()
while time.time() < start_time + 3:
if condition_function():
return True
else:
time.sleep(0.1)
raise Exception(
'Timeout waiting for {}'.format(condition_function.__name__)
)
def click_through_to_new_page(link_text):
browser.find_element_by_link_text('my link').click()
def page_has_loaded():
page_state = browser.execute_script(
'return document.readyState;'
)
return page_state == 'complete'
wait_for(page_has_loaded)
Another, better one would be (credits to @ThomasMarks):
def click_through_to_new_page(link_text):
link = browser.find_element_by_link_text('my link')
link.click()
def link_has_gone_stale():
try:
# poll the link with an arbitrary call
link.find_elements_by_id('doesnt-matter')
return False
except StaleElementReferenceException:
return True
wait_for(link_has_gone_stale)
And the final example includes comparing page ids as below (which could be bulletproof):
class wait_for_page_load(object):
def __init__(self, browser):
self.browser = browser
def __enter__(self):
self.old_page = self.browser.find_element_by_tag_name('html')
def page_has_loaded(self):
new_page = self.browser.find_element_by_tag_name('html')
return new_page.id != self.old_page.id
def __exit__(self, *_):
wait_for(self.page_has_loaded)
And now we can do:
with wait_for_page_load(browser):
browser.find_element_by_link_text('my link').click()
Above code samples are from Harry's blog.
Here is solution proposed by Tommy Beadle (by using staleness approach):
import contextlib
from selenium.webdriver import Remote
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import staleness_of
class MyRemote(Remote):
@contextlib.contextmanager
def wait_for_page_load(self, timeout=30):
old_page = self.find_element_by_tag_name('html')
yield
WebDriverWait(self, timeout).until(staleness_of(old_page))
As far as i know, your readystate_complete
is not doing anything as driver.get() is already checking for that condition. Anyway, i have seen it not working in many cases. One thing you could try is to route your traffic thru a proxy and use that for pinging for any network traffic. Ie browsermob has wait_for_traffic_to_stop method:
def wait_for_traffic_to_stop(self, quiet_period, timeout):
"""
Waits for the network to be quiet
:Args:
- quiet_period - number of seconds the network needs to be quiet for
- timeout - max number of seconds to wait
"""
r = requests.put('%s/proxy/%s/wait' % (self.host, self.port),
{'quietPeriodInMs': quiet_period, 'timeoutInMs': timeout})
return r.status_code
I have a similar situation as I wrote the screenshot system using Selenium for a fairly well-known website service and had the same predicament: I could not know anything about the page being loaded.
After speaking with some of the Selenium developers, the answer was that various WebDriver implementations (Firefox Driver versus IEDriver for example) make different choices about when a page is considered to be loaded or not for the WebDriver to return control.
If you dig deep in Selenium code, you can find the spots that try and make the best choices, but since there are a number of things that can cause the state being looked for to fail, like multiple frames where one doesn't complete in a timely manner, there are cases where the driver obviously just does not return.
I was told, "it's an open-source project", and that it probably won't/can't be corrected for every possible scenario, but that I could make fixes and submit patches where applicable.
In the long run, that was a bit much for me to take on, so similar to you, I created my own timeout process. Since I use Java, I created a new Thread that upon reaching the timeout, tries to do several things to get WebDriver to return, even at times just pressing certain Keys to get the browser to respond has worked. If it does not return, then I kill the browser and try again as well.
Starting the driver again has handled most cases for us, as if the second load of the browser allowed it to be in a more settled state (mind you we are launching from VMs and the browser constantly wants to check for updates and run certain routines when it hasn't been launched recently).
Another piece of this is that we launch a known url first and confirm some aspects about the browser and that we are in fact able to interact with it before continuing. With these steps together the failure rate is pretty low, about 3% with 1000s of tests on all browsers/version/OSs (FF, IE, CHROME, Safari, Opera, iOS, Android, etc.)
Last but not least, for your case, it sounds like you only really need to capture the links on the page, not have full browser automation. There are other approaches I might take toward that, namesly cURL and linux tools.
If the page is still loading indefinitely, I'm guessing the readyState never reaches "complete". If you're using Firefox, you can force the page loading to halt by calling window.stop()
:
try:
driver.get(url)
WebDriverWait(driver, 30).until(readystate_complete)
except TimeoutException:
d.execute_script("window.stop();")