How do I save the origin html file with Apache Nutch

后端 未结 5 1315
野的像风
野的像风 2020-12-06 14:13

I\'m new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the bina

5条回答
  •  悲&欢浪女
    2020-12-06 14:35

    In apache Nutch 2.3.1
    You can save the raw HTML by edit the Nutch code firstly run the nutch in eclipse by following https://wiki.apache.org/nutch/RunNutchInEclipse

    After you finish ruunning nutch in eclipse edit file FetcherReducer.java , add this code to the output method, run ant eclipse again to rebuild the class

    Finally the raw html will added to reportUrl column in your database

    if (content != null) {
    ByteBuffer raw = fit.page.getContent();
    if (raw != null) {
        ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
        Scanner scanner = new Scanner(arrayInputStream);
        scanner.useDelimiter("\\Z");//To read all scanner content in one String
        String data = "";
        if (scanner.hasNext()) {
            data = scanner.next();
        }
        fit.page.setReprUrl(StringUtil.cleanField(data));
        scanner.close();
    }
    

提交回复
热议问题