How do I save the origin html file with Apache Nutch

后端未结

关注

 5  1315

野的像风 2020-12-06 14:13

I\'m new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the bina

5条回答

悲&欢浪女 (楼主)

2020-12-06 14:35

In apache Nutch 2.3.1
You can save the raw HTML by edit the Nutch code firstly run the nutch in eclipse by following https://wiki.apache.org/nutch/RunNutchInEclipse

After you finish ruunning nutch in eclipse edit file FetcherReducer.java , add this code to the output method, run ant eclipse again to rebuild the class

Finally the raw html will added to reportUrl column in your database

if (content != null) {
ByteBuffer raw = fit.page.getContent();
if (raw != null) {
    ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
    Scanner scanner = new Scanner(arrayInputStream);
    scanner.useDelimiter("\\Z");//To read all scanner content in one String
    String data = "";
    if (scanner.hasNext()) {
        data = scanner.next();
    }
    fit.page.setReprUrl(StringUtil.cleanField(data));
    scanner.close();
}

0 讨论(0)

查看其它5个回答