Selenium WebDriver analyze large collection of links quickly

左心房为你撑大大i 提交于 2019-12-23 05:13:41

问题


I have a web page with an extremely large amount of links (around 300) and I would like to collect information on these links.

Here is my code:

beginning_time = Time.now
#This gets a collection of links from the webpage
tmp = driver.find_elements(:xpath,"//a[string()]")
end_time = Time.now
puts "Execute links:#{(end_time - beginning_time)*1000} milliseconds for #{tmp.length} links"


before_loop = Time.now
#Here I iterate through the links
tmp.each do |link|
    #I am not interested in the links I can't see
    if(link.location.x < windowX and link.location.y < windowY)
        #I then insert the links into a NoSQL database, 
        #but for all purposes you could imagine this as just saving the data in a hash table.
        $elements.insert({
            "text" => link.text,
            "href" => link.attribute("href"),
            "type" => "text",
            "x" => link.location.x,
            "y" => link.location.y,
            "url" => url,
            "accessTime" => accessTime,
            "browserId" => browserId
        })
    end
end
after_loop = Time.now
puts "The loop took #{(after_loop - before_loop)*1000} milliseconds"

It currently take 20ms to get the link collection and around 4000ms (or 4 seconds) to retrieve the information for the links. When I separate the accessors from the NoSQL insert, I find that the NoSQL insert only takes 20ms and that the majority of time is spent with the accessors (who became much slower after being separated from the NoSQL insert, for reasons I don't understand), which makes me conclude that the accessors must be executing JavaScript.

My question is: How do I collect these links and their information more quickly?

The first solution that came to mind was to try running two drivers in parallel, but WebDrivers are not thread-safe, meaning that I would have to create a new instance of the WebDriver and navigate to the page. This raises the question, how to download the source of the page so that it can be loaded into another driver, which cannot be done in Selenium, thus must be performed on Chrome itself with desktop automation tools, adding a considerable amount of overhead.

Another alternative I heard of was to stop use ChromeDriver and to just use PhantomJS, but I need to display the page in visual browser.

Is there any other alternative that I haven't considered yet?


回答1:


You seem to be using Webdriver purely to execute Javascript rather than access the objects.

A couple of ideas to try IF you drop using javascript (Excuse the java but you get the idea);

 //We have restricted via xpath so will get less links back AND will not haveto check the text within loop
        List<WebElement> linksWithText = driver.findElements(By.xpath("//a[text() and not(text()='')]"));

        for (WebElement link : linksWithText) {

            //Store the location details rather than re-get each time
            Point location = link.getLocation();
            Integer x = location.getX();
            Integer y = location.getY();

            if (x < windowX && y < windowY) {
                ///Insert all info using webdriver commands;
            }
        }

I normally use remote grids so performace is a key concern in my tests, hence why I always try to restrict by CSS Selectors or XPath rather than get everything and loop



来源:https://stackoverflow.com/questions/18630317/selenium-webdriver-analyze-large-collection-of-links-quickly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!