nutch

Nutch 1.2 - Why won't nutch crawl url with query strings?

£可爱£侵袭症+ 提交于 2019-12-12 06:24:14
问题 I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now: # skip urls with these characters #-[] #skip urls with slash delimited segment that repeats 3+ times #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website. Does anyone

Regular expression to match a URL with 6 or more levels

不想你离开。 提交于 2019-12-12 05:25:53
问题 I am trying to match a URL with 6 or more than 6 levels or sub-paths http://www.domain.com/level1/level2/level3/level4/level5/level6/level7/level8/level9/level10/level11/level12.html I came up with an expression ^http:\/\/([a-zA-Z\.-]*)\W(\b\w+\b) ...which matches level1 (demo) However, when I am trying to match a URL with six or more levels it doesn't seem to work. ^http:\/\/([a-zA-Z\.-]*)\W(\b\w+\b){6,} (demo) 回答1: I think this is what you were trying for: ^http://([a-zA-Z.-]+)/(?:[^/]+/){6

Elasticsearch indexing fails after successful Nutch crawl

匆匆过客 提交于 2019-12-12 04:01:21
问题 I'm not sure why but Nutch 1.13 is failing to index the data to ES (v2.3.3). It is crawling, that is fine, but when it comes time to index to ES its giving me this error message: Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache

Generate only unfetched urls instead of scored Nutch 2.3

半世苍凉 提交于 2019-12-12 03:27:28
问题 Is there any way to generate only the un-fetched urls instead of based on score in Nutch 2.x? 回答1: Well, for Nutch 1.x you could use the jexl support that is shipped since Nutch 1.12 (I think): $ bin/nutch generate -expr "status == db_unfetched" with this command you're ensuring that only the URLs with a db_unfetched status are considered for generating the segments that you want to crawl. This feature is still not available on 2.x branch, but writing a custom GeneratorJob could do the trick.

Hadoop HBase Pseudo mode - RegionServer disconnects after some time

淺唱寂寞╮ 提交于 2019-12-11 21:26:47
问题 Please find the attached screenshot of Hbase-master log. I have tried all sorts of settings yet I couldn't overcome this issue. I made sure I don't have 127.0.1.1 in my /etc/hosts. I am using Apache Hadoop 0.20.205.0 and Apache HBase 0.90.6 in Pseudo distributed . I am using Nutch 2.2.1 and trying to store crawled data in HBase Pseudo mode. I am using bin/crawl all-in-one command. Please help! 回答1: Try killing the master then start it up again... The dead server state is in memory... hope

Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

随声附和 提交于 2019-12-11 20:34:31
问题 I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url. I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local

Spell Checker in Nutch 1.0

*爱你&永不变心* 提交于 2019-12-11 20:11:12
问题 Can anyone tell me how to implement spell checker in nutch 1.0? 回答1: Can anyone tell me how to use the spell-check query plugin available in the contrib \ web2 dir (and even the rest of the plugins too)? Is it similar to enabling the nutch-plugins? First hit at google (`nutch spell checker') and located on the apache nutch project pages... 来源: https://stackoverflow.com/questions/3115422/spell-checker-in-nutch-1-0

Apache nutch 1.15 installing and running issues

无人久伴 提交于 2019-12-11 17:56:10
问题 I am trying to run Apache Nutch 1.15 (local) on Windows 10, I have followed the same steps as mentioned on https://wiki.apache.org/nutch/NutchTutorial and https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial. When I try to inject the urls using this command on cygwin : bin/nutch inject crawl/crawldb urls i get this error: Injector: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\INFO\Desktop\apache-nutch1.15\runtime\local\crawl\crawldb\.locked when put

Nutch not working on Windows 10

三世轮回 提交于 2019-12-11 17:07:18
问题 I'm trying to do crawl with the Nutch and on ElasticSearch and MongoDB . After that, I saw some useful tutorials and SO question/answers as following I can not run the Nutch . Search Engine with Apache Nutch, MongoDB and Elasticsearch Scraping the Web with Nutch for Elasticsearch And I saw this question: nutch-does-not-index-on-elasticsearch-correctly-using-mongodb I build the Nutch with ant, but when I run the Nutch in command prompt with .\nutch command, it does not show anything in Command

Nutch Solr dataimport handler?

会有一股神秘感。 提交于 2019-12-11 12:53:49
问题 I have setup a nutch crawler on top of hadoop. Below is software stack with their respective versions. apache-nutch-2.3.1 , hbase-0.98.8-hadoop2 Both on top of hadoop-2.5.2 . All process till data insertion into Hbase is working fine. The problem is when I try to invoke IndexingJob using class, org.apache.nutch.indexer.IndexingJob the command run successfully but no records got indexed in solr. Solr version is solr-5.3.1 . Below is the output of the command I ran : 15/12/15 18:26:32 INFO