nutch | 易学教程

Nutch 1.2 - Why won't nutch crawl url with query strings?

阅读更多关于 Nutch 1.2 - Why won't nutch crawl url with query strings?

问题 I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now: # skip urls with these characters #-[] #skip urls with slash delimited segment that repeats 3+ times #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website. Does anyone

Regular expression to match a URL with 6 or more levels

阅读更多关于 Regular expression to match a URL with 6 or more levels

问题 I am trying to match a URL with 6 or more than 6 levels or sub-paths http://www.domain.com/level1/level2/level3/level4/level5/level6/level7/level8/level9/level10/level11/level12.html I came up with an expression ^http:\/\/([a-zA-Z\.-]*)\W(\b\w+\b) ...which matches level1 (demo) However, when I am trying to match a URL with six or more levels it doesn't seem to work. ^http:\/\/([a-zA-Z\.-]*)\W(\b\w+\b){6,} (demo) 回答1: I think this is what you were trying for: ^http://([a-zA-Z.-]+)/(?:[^/]+/){6

Elasticsearch indexing fails after successful Nutch crawl

阅读更多关于 Elasticsearch indexing fails after successful Nutch crawl

问题 I'm not sure why but Nutch 1.13 is failing to index the data to ES (v2.3.3). It is crawling, that is fine, but when it comes time to index to ES its giving me this error message: Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache

Generate only unfetched urls instead of scored Nutch 2.3

阅读更多关于 Generate only unfetched urls instead of scored Nutch 2.3

问题 Is there any way to generate only the un-fetched urls instead of based on score in Nutch 2.x? 回答1: Well, for Nutch 1.x you could use the jexl support that is shipped since Nutch 1.12 (I think): $ bin/nutch generate -expr "status == db_unfetched" with this command you're ensuring that only the URLs with a db_unfetched status are considered for generating the segments that you want to crawl. This feature is still not available on 2.x branch, but writing a custom GeneratorJob could do the trick.

Hadoop HBase Pseudo mode - RegionServer disconnects after some time

阅读更多关于 Hadoop HBase Pseudo mode - RegionServer disconnects after some time

问题 Please find the attached screenshot of Hbase-master log. I have tried all sorts of settings yet I couldn't overcome this issue. I made sure I don't have 127.0.1.1 in my /etc/hosts. I am using Apache Hadoop 0.20.205.0 and Apache HBase 0.90.6 in Pseudo distributed . I am using Nutch 2.2.1 and trying to store crawled data in HBase Pseudo mode. I am using bin/crawl all-in-one command. Please help! 回答1: Try killing the master then start it up again... The dead server state is in memory... hope

Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

阅读更多关于 Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

问题 I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url. I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local

Spell Checker in Nutch 1.0

阅读更多关于 Spell Checker in Nutch 1.0

问题 Can anyone tell me how to implement spell checker in nutch 1.0? 回答1: Can anyone tell me how to use the spell-check query plugin available in the contrib \ web2 dir (and even the rest of the plugins too)? Is it similar to enabling the nutch-plugins? First hit at google (`nutch spell checker') and located on the apache nutch project pages... 来源： https://stackoverflow.com/questions/3115422/spell-checker-in-nutch-1-0

Apache nutch 1.15 installing and running issues

阅读更多关于 Apache nutch 1.15 installing and running issues

问题 I am trying to run Apache Nutch 1.15 (local) on Windows 10, I have followed the same steps as mentioned on https://wiki.apache.org/nutch/NutchTutorial and https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial. When I try to inject the urls using this command on cygwin : bin/nutch inject crawl/crawldb urls i get this error: Injector: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\INFO\Desktop\apache-nutch1.15\runtime\local\crawl\crawldb\.locked when put

Nutch not working on Windows 10

阅读更多关于 Nutch not working on Windows 10

问题 I'm trying to do crawl with the Nutch and on ElasticSearch and MongoDB . After that, I saw some useful tutorials and SO question/answers as following I can not run the Nutch . Search Engine with Apache Nutch, MongoDB and Elasticsearch Scraping the Web with Nutch for Elasticsearch And I saw this question: nutch-does-not-index-on-elasticsearch-correctly-using-mongodb I build the Nutch with ant, but when I run the Nutch in command prompt with .\nutch command, it does not show anything in Command

Nutch Solr dataimport handler?

阅读更多关于 Nutch Solr dataimport handler?

问题 I have setup a nutch crawler on top of hadoop. Below is software stack with their respective versions. apache-nutch-2.3.1 , hbase-0.98.8-hadoop2 Both on top of hadoop-2.5.2 . All process till data insertion into Hbase is working fine. The problem is when I try to invoke IndexingJob using class, org.apache.nutch.indexer.IndexingJob the command run successfully but no records got indexed in solr. Solr version is solr-5.3.1 . Below is the output of the command I ran : 15/12/15 18:26:32 INFO