I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and rep
I have a similar situation as I wrote the screenshot system using Selenium for a fairly well-known website service and had the same predicament: I could not know anything about the page being loaded.
After speaking with some of the Selenium developers, the answer was that various WebDriver implementations (Firefox Driver versus IEDriver for example) make different choices about when a page is considered to be loaded or not for the WebDriver to return control.
If you dig deep in Selenium code, you can find the spots that try and make the best choices, but since there are a number of things that can cause the state being looked for to fail, like multiple frames where one doesn't complete in a timely manner, there are cases where the driver obviously just does not return.
I was told, "it's an open-source project", and that it probably won't/can't be corrected for every possible scenario, but that I could make fixes and submit patches where applicable.
In the long run, that was a bit much for me to take on, so similar to you, I created my own timeout process. Since I use Java, I created a new Thread that upon reaching the timeout, tries to do several things to get WebDriver to return, even at times just pressing certain Keys to get the browser to respond has worked. If it does not return, then I kill the browser and try again as well.
Starting the driver again has handled most cases for us, as if the second load of the browser allowed it to be in a more settled state (mind you we are launching from VMs and the browser constantly wants to check for updates and run certain routines when it hasn't been launched recently).
Another piece of this is that we launch a known url first and confirm some aspects about the browser and that we are in fact able to interact with it before continuing. With these steps together the failure rate is pretty low, about 3% with 1000s of tests on all browsers/version/OSs (FF, IE, CHROME, Safari, Opera, iOS, Android, etc.)
Last but not least, for your case, it sounds like you only really need to capture the links on the page, not have full browser automation. There are other approaches I might take toward that, namesly cURL and linux tools.