Selenium WebDriver analyze large collection of links quickly

问题

I have a web page with an extremely large amount of links (around 300) and I would like to collect information on these links.

Here is my code:

beginning_time = Time.now
#This gets a collection of links from the webpage
tmp = driver.find_elements(:xpath,"//a[string()]")
end_time = Time.now
puts "Execute links:#{(end_time - beginning_time)*1000} milliseconds for #{tmp.length} links"


before_loop = Time.now
#Here I iterate through the links
tmp.each do |link|
    #I am not interested in the links I can't see
    if(link.location.x < windowX and link.location.y < windowY)
        #I then insert the links into a NoSQL database, 
        #but for all purposes you could imagine this as just saving the data in a hash table.
        $elements.insert({
            "text" => link.text,
            "href" => link.attribute("href"),
            "type" => "text",
            "x" => link.location.x,
            "y" => link.location.y,
            "url" => url,
            "accessTime" => accessTime,
            "browserId" => browserId
        })
    end
end
after_loop = Time.now
puts "The loop took #{(after_loop - before_loop)*1000} milliseconds"

It currently take 20ms to get the link collection and around 4000ms (or 4 seconds) to retrieve the information for the links. When I separate the accessors from the NoSQL insert, I find that the NoSQL insert only takes 20ms and that the majority of time is spent with the accessors (who became much slower after being separated from the NoSQL insert, for reasons I don't understand), which makes me conclude that the accessors must be executing JavaScript.

My question is: How do I collect these links and their information more quickly?

The first solution that came to mind was to try running two drivers in parallel, but WebDrivers are not thread-safe, meaning that I would have to create a new instance of the WebDriver and navigate to the page. This raises the question, how to download the source of the page so that it can be loaded into another driver, which cannot be done in Selenium, thus must be performed on Chrome itself with desktop automation tools, adding a considerable amount of overhead.

Another alternative I heard of was to stop use ChromeDriver and to just use PhantomJS, but I need to display the page in visual browser.

Is there any other alternative that I haven't considered yet?

回答1:

You seem to be using Webdriver purely to execute Javascript rather than access the objects.

A couple of ideas to try IF you drop using javascript (Excuse the java but you get the idea);

 //We have restricted via xpath so will get less links back AND will not haveto check the text within loop
        List<WebElement> linksWithText = driver.findElements(By.xpath("//a[text() and not(text()='')]"));

        for (WebElement link : linksWithText) {

            //Store the location details rather than re-get each time
            Point location = link.getLocation();
            Integer x = location.getX();
            Integer y = location.getY();

            if (x < windowX && y < windowY) {
                ///Insert all info using webdriver commands;
            }
        }

I normally use remote grids so performace is a key concern in my tests, hence why I always try to restrict by CSS Selectors or XPath rather than get everything and loop

来源：https://stackoverflow.com/questions/18630317/selenium-webdriver-analyze-large-collection-of-links-quickly

标签

javascript

ruby

selenium

selenium-webdriver

selenium-chromedriver