nutch | 易学教程

Nutch 2.2.1 doesnt continue after Injector job

阅读更多关于 Nutch 2.2.1 doesnt continue after Injector job

I am learning nutch and trying to carawl as per this tutorial .I am working on an ubuntu machinewith bash shell. But when I run the script, the execution happens, but nothing happens after , InjectorJob: starting at 2014-03-23 09:28:50 InjectorJob: Injecting urlDir: urls/seed.txt I have waited for hours, I tried running the same with sudo . The same issue occurs. I have tried with default urls given in the tutorial as well. What can be the probable errors? What was missing was I didnt add Proxy and port details in the nutch-site.xml, as I was accessing through proxy. setting up the same for

Nutch regex-urlfilter syntax

阅读更多关于 Nutch regex-urlfilter syntax

I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt . The site I want to crawl has a URL similar to this: http://www.example.com/foo.cfm On that page there are numerous links that match the following pattern: http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976 I want to crawl links that match second example above as well. In my regex-urlfilter.txt I have the following: +^http://www.example.com/foo.cfm$ +^http://www.example.com/foo.cfm/(.+)*$ Nutch matches on the first one

运行nutch报错：unzipBestEffort returned null

阅读更多关于运行nutch报错：unzipBestEffort returned null

报错信息： fetch of http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html failed with: java.io.IOException: unzipBestEffort returned null 完整的报错信息为： 2014-03-12 16:48:38,031 ERROR http.Http - Failed to get protocol output java.io.IOException: unzipBestEffort returned null at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317) at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:164) at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)

Cygwin运行nutch报错：Failed to set permissions of path

阅读更多关于 Cygwin运行nutch报错：Failed to set permissions of path

错误信息： Exception in thread "main" java.io.IOException:Failed to set permissions of path:\tmp\hadoop-ysc\mapred\staging\ysc-2036315919\.staging to 0700 官方BUG参考： https://issues.apache.org/jira/browse/HADOOP-7682 解决方法： 1、下载并解压 http://archive.apache.org/dist/hadoop/common/hadoop-1.1.2/hadoop-1.1.2.tar.gz 2、修改hadoop-1.1.2\src\core\org\apache\hadoop\fs\FileUtil.java，搜索 Failed to set permissions of path，找到689行，把throw new IOException改为LOG.warn 3、修改hadoop-1.1.2\build.xml，搜索autoreconf，移除匹配的6个executable="autoreconf"的exec配置 4、下载解压ant，将ant目录下的bin目录加入环境变量path 5、在Cygwin命令下行切换到hadoop-1.1.2目录，执行ant 6

Nutch1.7和eclipse集成

阅读更多关于 Nutch1.7和eclipse集成

Nutch1.7 和 eclipse 集成 1 、将 nutch 项目部署到 eclipse 首先到 nutch 官网上找到 FAQ 链接 http://wiki.apache.org/nutch/FAQ ，进入链接点击第二个链接查看。参照文档进行配置，有卡壳不明白的地方，继续在网上针对具体问题查找资料。集成时可参照 http://blog.csdn.net/witsmakemen/article/details/8866235 。运行之前必须有一下先决条件： A 、在 windows 上安装配置 Apache Ant ， http://ant.apache.org/manual/index.html B 、安装 Eclipse ，这个不用说肯定是必须的。 C 、在 Linux 安装 svn 。目的：迁出 nutch1.7 源代码。 D 、在 Linux 下检出 nutch1.7 代码 [root@nutch-five branch-1.7]# svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.7/ E 、在 Linux 安装 ivy 插件。目的，根据 ivy 配置文件动态下载 jar 包。 F 、编译 branch1.7 [root@nutch-five branch-1.7]# ant 2 、在

Nutch No agents listed in 'http.agent.name'

阅读更多关于 Nutch No agents listed in 'http.agent.name'

问题 Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068) at org.apache.nutch.crawl.Crawl.run(Crawl.java:135) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect

Using Nutch crawler with Solr

阅读更多关于 Using Nutch crawler with Solr

问题 Am I able to integrate Apache Nutch crawler with the Solr Index server? Edit: One of our devs came up with a solution from these posts Running Nutch and Solr Update for Running Nutch and Solr Answer Yes 回答1: If you're willing to upgrade to nutch 1.0 you can use the solrindex as described in this article by Lucid Imagination: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/. 回答2: It's still an open issue. If you're feeling adventurous you could try applying those patches yourself,

给LUKE增加word分词器

阅读更多关于给LUKE增加word分词器

word分词是一个Java实现的分布式中文分词组件 1、下载 http://luke.googlecode.com/files/lukeall-4.0.0-ALPHA.jar （国内不能访问） 2、下载并解压 Java中文分词组件word-1.0-bin.zip 3、将解压后的 Java中文分词组件word-1.0-bin/word-1.0 文件夹里面的4个jar包解压到当前文件夹，用压缩解压工具如winrar打开lukeall-4.0.0-ALPHA.jar，将当前文件夹里面除了.jar、.bat、.html文件外的其他所有文件拖到lukeall-4.0.0-ALPHA.jar里面 4、执行命令 java -jar lukeall-4.0.0-ALPHA.jar 启动luke，在Search选项卡的Analysis里面就可以选择 org.apdplat.word.lucene.ChineseWordAnalyzer 分词器了 5、在Plugins选项卡的Available analyzers found on the current classpath里面也可以选择 org.apdplat.word.lucene.ChineseWordAnalyzer 分词器下载已经集成好的Luke插件： lukeall-4.0.0-ALPHA-with-word-1.0.jar

网络爬虫面临的挑战之链接构造

阅读更多关于网络爬虫面临的挑战之链接构造

爬虫与反爬虫就好像是安全领域的破解与反破解一样，相互矛盾，相互克制，同时也相互促进。网站的构建技术从简单的静态网站发展到动态网站，信息的传递从用户单向接收发展到双向交互，内容的产生从站长集中生成发展到全民参与生成。 Web技术的发展对网络爬虫构成了极大的挑战，我们以 Nutch 为例来说明难在哪里： 1、静态网站（简单） 2、动态网站（无陷阱）（难） 3、动态网站（有陷阱）（非常难）对于静态网站，页面数量有限，无论页面之间如何构造链接，无论页面内容是什么，都能在一个有限的时间内抓取完毕。对于静态网站来说，我们假设网站无陷阱（不会有程序来动态生成无穷无尽的静态页面），内容质量高（不会为了提高搜索结果排名进行关键词堆砌，不会大量静态页面都是一样的内容或近似的内容等）。这样的静态网站，就是爬虫理想的抓取对象！对于无陷阱的动态网站，用户需要和服务器交互，服务器根据用户指定的参数动态返回结果。爬虫要抓取这样的网站，就需要枚举完所有可用的参数，而很多时候，爬虫是无法枚举完所有可用的参数的。假如我们要想抓取淘宝上面的所有商品，我们通过他的搜索入口去抓是无法抓完的，因为我们无法枚举所有的商品；那么我们还可以通过分类栏目作为入口，一页一页地往后抓，这虽然可行，但是无法抓全，淘宝会对分页进行限制，如100页

开发网络爬虫应该怎样选择爬虫框架？

阅读更多关于开发网络爬虫应该怎样选择爬虫框架？

有些人问，开发网络爬虫应该选择Nutch、Crawler4j、WebMagic、scrapy、WebCollector还是其他的？这里按照我的经验随便扯淡一下：上面说的爬虫，基本可以分3类： 1.分布式爬虫：Nutch 2.JAVA单机爬虫：Crawler4j、WebMagic、WebCollector 3. 非JAVA单机爬虫：scrapy 第一类:分布式爬虫爬虫使用分布式，主要是解决两个问题： 1)海量URL管理 2)网速现在比较流行的分布式爬虫，是Apache的Nutch。但是对于大多数用户来说，Nutch是这几类爬虫里，最不好的选择，理由如下： 1)Nutch是为搜索引擎设计的爬虫，大多数用户是需要一个做精准数据爬取（精抽取）的爬虫。Nutch运行的一套流程里，有三分之二是为了搜索引擎而设计的。对精抽取没有太大的意义。也就是说，用Nutch做数据抽取，会浪费很多的时间在不必要的计算上。而且如果你试图通过对Nutch进行二次开发，来使得它适用于精抽取的业务，基本上就要破坏Nutch的框架，把Nutch改的面目全非，有修改Nutch的能力，真的不如自己重新写一个分布式爬虫框架了。 2)Nutch依赖hadoop运行，hadoop本身会消耗很多的时间。如果集群机器数量较少，爬取速度反而不如单机爬虫快。 3)Nutch虽然有一套插件机制，而且作为亮点宣传

订阅 nutch