nutch

Nutch crawl no error , but result is nothing

家住魔仙堡 提交于 2019-12-21 21:39:56
问题 I try to crawl some urls with nutch 2.1 as follows. bin/nutch crawl urls -dir crawl -depth 3 -topN 5 http://wiki.apache.org/nutch/NutchTutorial There is no error , but undermentioned folders don't be made. crawl/crawldb crawl/linkdb crawl/segments Can anyone help me? I have not resolved this trouble for two days. Thanks a lot! output is as follows. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost

How to run nutch server on distributed environment

╄→гoц情女王★ 提交于 2019-12-21 21:28:00
问题 I have tested running of nutch in server mode by starting it using bin/nutch startserver command locally . Now I wonder whether I can start nutch in server mode on top of a hadoop cluster(in distributed environment) and submit crawl requests to server using nutch REST api ? Please help. 回答1: From further research I've got nutch server working on distributed mode. Steps :- Assume hadoop is configured in all slave nodes. Then setup nutch in all nodes. This can help : http://wiki.apache.org

Where is the crawled data stored when running nutch crawler?

不羁岁月 提交于 2019-12-21 17:58:26
问题 I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed. Now, I don't find the text/html data in my local machine. Where can I find the data and what is the best way to read the data in text format? Versions apache-nutch-1.9 solr-4.10.4 回答1: After your

Integration between Nutch 1.11(1.x) and Solr 5.3.1(5.x)

旧巷老猫 提交于 2019-12-21 17:37:16
问题 I just started using Nutch 1.11 and Solr 5.3.1 . I want to crawl data with Nutch , then index and prepare for searching with Solr . I know how to crawl data from web using Nutch 's bin/crawl command, and successfully got much data from a website in my local. I also started a new Solr server in local with below command under Solr root folder, bin/solr start And started the example files core under the example folder with below command: bin/solr create -c files -d example/files/conf And I can

Web Cralwer Algorithm: depth?

空扰寡人 提交于 2019-12-21 05:32:06
问题 I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial depth indicates the link depth from the root page that should be crawled. So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 -- what do I need to do? If a site could be represented as a binary tree, then it wouldn't be a problem I think. 回答1: Link depth means the number of "hops" a page is be away from the root,

Nutch versus Solr

穿精又带淫゛_ 提交于 2019-12-21 00:47:13
问题 Currently collecting information where I should use Nutch with Solr (domain - vertical web search). Could you suggest me? 回答1: Nutch is a framework to build web crawler and search engines. Nutch can do the whole process from collecting the web pages to building the inverted index. It can also push those indexes to Solr. Solr is mainly a search engine with support for faceted searches and many other neat features. But Solr doesn't fetch the data, you have to feed it. So maybe the first thing

Nutch not crawling URLs except the one specified in seed.txt

吃可爱长大的小学妹 提交于 2019-12-19 11:20:17
问题 I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt : +^https://www.mywebsite.com/abc-def/(.+)*$ When I try to run the following crawl command : **/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3**

Nutch API advice

喜欢而已 提交于 2019-12-19 02:49:26
问题 I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to save the data to disk and I need it to be able to recrawl only the updated resources of a site and skip the parts that are already crawled. Does anyone have any experience working with the Nutch code directly in Java, not via the command line. I would like to start simple: create a crawler (or

Recrawl URL with Nutch just for updated sites

社会主义新天地 提交于 2019-12-18 15:52:43
问题 I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? 回答1: Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz. You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other

Recrawl URL with Nutch just for updated sites

[亡魂溺海] 提交于 2019-12-18 15:51:59
问题 I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? 回答1: Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz. You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other