Web Crawling (Ajax/JavaScript enabled pages) using java

前端未结

关注

 3  1721

I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem

相关标签:

3条回答

天涯浪人

2020-12-09 06:30
Hi I found the workaround with the another library. I used Selinium WebDriver (org.openqa.selenium.WebDriver) library to extract the dynamic content. Here is the sample code.
```
public class CollectUrls {

private WebDriver driver;

public CollectUrls() {
    this.driver = new FirefoxDriver();
    this.driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
}

protected void next(String url, List<String> argUrlsList) {
    this.driver.get(url);
    String htmlContent = this.driver.getPageSource();
}
```
Here the "htmlContent" is the required one. Please let me know if you face any issues...???

Thanks, Amar
0 讨论(0)
发布评论:

提交评论
- 加载中...

慢半拍i

2020-12-09 06:41

I have find out the Solution of Dynamic Web page Crawling using Aperture and Selenium.Web Driver.
Aperture is Crawling Tools and Selenium is Testing Tools which can able to rendering Inspect Element. 

1. Extract the Aperture- core Jar file by Decompiler Tools and Create a Simple Web Crawling Java program. (https://svn.code.sf.net/p/aperture/code/aperture/trunk/)
2. Download Selenium. WebDriver Jar Files and Added to Your Program.
3. Go to CreatedDataObjec() method in org.semanticdesktop.aperture.accessor.http.HttpAccessor.(Aperture Decompiler).
Added Below Coding 

   WebDriver driver = new FirefoxDriver();
   String baseurl=uri.toString();
   driver.get(uri.toString());
   String str = driver.getPageSource();
        driver.close();
 stream= new ByteArrayInputStream(str.getBytes());

0 讨论(0)

终归单人心

2020-12-09 06:47

Simply said, Crawler4j is static crawler. Meaning that it can't parse the JavaScript on a page. So there is no way of getting the content you want by crawling that specific page you mentioned. Of course there are some workarounds to get it working.

If it is just this page you want to crawl, you could use a connection debugger. Check out this question for some tools. Find out which page the AJAX-request calls, and crawl that page.

If you have various websites which have dynamic content (JavaScript/ajax), you should consider using a dynamic-content-enabled crawler, like Crawljax (also written in Java).

0 讨论(0)
发布评论:

提交评论
- 加载中...