nutch | 易学教程

Using Solr for indexing HTML tags with attributes

阅读更多关于 Using Solr for indexing HTML tags with attributes

问题 I have crawled websites using Nutch and I have pushed crawled data to solr. Now I want to search content between specific tag with specific attribute value. For example, <h><title> title to search </title></h> <div id="abc"> content to search </div> <div class="efg"> other content to search </div> I have seen this question(how to parse html with nutch and index specific tag to solr?) but this does not have enough clarity. I want to know that whether there is any plugin available or i need to

How to run nutch 1.9 in eclipse on windows?

阅读更多关于 How to run nutch 1.9 in eclipse on windows?

问题 I want to run Nutch 1.9 in Eclipse on Windows. I followed the tutorial from http://wiki.apache.org/nutch/RunNutchInEclipse and opened the project in Eclipse. But when I run Nutch, I get the following error: 2014-09-19 17:45:48,039 INFO crawl.Injector (Injector.java:inject(283)) - Injector: starting at 2014-09-19 17:45:48 2014-09-19 17:45:48,043 INFO crawl.Injector (Injector.java:inject(284)) - Injector: crawlDb: K:/kumar/Nutch/apache-nutch-1.9/crawlresult 2014-09-19 17:45:48,043 INFO crawl

Apache Nutch 1.12 with Apache Solr 6.2.1 give an error

阅读更多关于 Apache Nutch 1.12 with Apache Solr 6.2.1 give an error

问题 I am using Apache Nutch 1.12 and Apache Solr 6.2.1 to crawl data on the internet and index them, and the combination gives an error: java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down I have done the following as I have learned from the Nutch tutorial: https://wiki.apache.org/nutch/NutchTutorial copied schema.xml of Nutch and placed it in Solr's config folder Placed a seed url (of a newspaper company) in urls/seed.txt of Nutch changed http.content.limit value to "

run nutch2.3.1 on hadoop2

阅读更多关于 run nutch2.3.1 on hadoop2

问题 I want to run nutch2.3.1 to crawl data on hadoop2. I have 3 nodes for hadoop2: crawler1:master crawler2:slave crawler3:slave I deployed nutch2.3.1 to crawler1 and run it with following command: /usr/local/nutch/deploy/bin/crawl hdfs://xxx.xxx.xxx.xxx/urls/seed.txt test 5 It works and can crawl data ,but it looks like the crawl job only run on crawler1, the others nodes did not do any job for nutch. my questions are: do I need deploy nutch to crawler2 and crawler3? do I need run crawl command

Apache Nutch 2.1 - How get complete source code

阅读更多关于 Apache Nutch 2.1 - How get complete source code

问题 I am trying to write my own Nutch plugin for crawling webpages. The problem is that I need to identify if there is some special tag, e.g. on the webpage. There is some note in official documentation that this is possible using Document.getElementsByTagName("foo") but this is not working for me. Do you have any idea? My second question is that if I identified tag above, I would like to get some other tags from this webpage where tag was identified... is there any way to store complete source

Nutch 2.1 (HBase, SOLR) with Amazon Web Services

阅读更多关于 Nutch 2.1 (HBase, SOLR) with Amazon Web Services

问题 I experienced Nutch 2.1 locally without any difficulty. I have also tried on a 3 machine distributed cluster. We're now discussing whether to run it with Amazon Web Services or not. I do not have much experience with AWS. My question is that, is it possible and neccessary to try Nutch2.1 crawling and indexing parts on the cloud. What possible advantages and disadvantages we will have? Thanks. 回答1: If you have a cluster with same capacity as that of a AWS cluster (that you plan to invest in)

Nutch job failing when sending data to Solr

阅读更多关于 Nutch job failing when sending data to Solr

问题 I've been trying various things with no avail. My configuration of Nutch/Solr is based on this: http://ubuntuforums.org/showthread.php?t=1532230 Now that I have Nutch and Solr up and running, I would like to use Solr to index the crawl data. Nutch successfully crawls the domain I specified but fails when I run the command to communicate that data to Solr. Here's the command: bin/nutch solrindex http://solr:8181/solr/ crawl/crawldb crawl/linkdb crawl/segments/* Here's the output: Indexer:

How to crawl images in Nutch?

阅读更多关于 How to crawl images in Nutch?

问题 How to crawl images in Nutch? Or, is there any other open search engine which is producing the results with images? 回答1: change your regex-urlfilter.txt in conf -.(ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|exe|EXE|js|JS|gif|GIF|png|PNG||jpg|JPG|jpeg|JPEG|bmp|BMP|mpg|MPG|mov|MOV)$ Delete jpeg , jpg , gif or type picture that you want to grep . And then change suffix-urlfilter.txt in conf add # to jpeg , gif or png That worked for me! 来源： https:/

Solr 6 and Nutch 2.3.1 integration

阅读更多关于 Solr 6 and Nutch 2.3.1 integration

问题 According to Nutch news the latest version of Nutch is 2.3.1 compatible with Solr 4.10.3 which is very old version of solr. Can we integrate Solr 6 with Nutch 2.3.1. What will be the drawbacks if solr 6 will be integrated? Anybody tried this? 回答1: This is an old question but I just got Nutch 1.12 talking to Solr 6.3.0. The required schema/solrconfig changes should be the same for Nutch 2.x so here's what I did: Download and extract both products into some directory, e.g. ~/mycrawler, then go

Can't access hadoop web ui for job tracker [closed]

阅读更多关于 Can't access hadoop web ui for job tracker [closed]

问题 This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center. Closed 7 years ago . I'm trying to set up hadoop and nutch to run on EC2. To get started, I have followed the excellent NutchHadoopTutorial. Most everything works as it