nutch

Removing menu's from html during crawl or indexing with nutch and solr

别等时光非礼了梦想. 提交于 2019-12-03 20:06:30
I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query. Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others. I need to, at some point, delete the content in these DIVS. I am guessing that the right place is during indexing by solr but cannot work out how. A pattern would look something like (<div id="calendar">).*?(<\/div>) but i cannot get that to work in

nutch搜索引擎的搭建以及配置

前提是你 提交于 2019-12-03 18:00:13
最近公司需要搭建一个搜索引擎,于是就发现了apache旗下的这个nutch,也看了不少的文章,就在本地搭建了一个进行测试,发现局域网抓取还是比较好的,但是在互联网抓取还是有点问题,像百度、谷歌这些站点的页面基本就抓不到上信息. nutch搜索引擎的搭建以及配置 实验环境: vmware 6.0 redhat 5.1 软件环境 apache-tomcat-6.0.29.tar.gz nutch-1.0.tar.gz jdk-6u21-linux-i586.bin nutchg简介 Nutch的爬虫抓取网页有两种方式,一种方式是Intranet Crawling,针对的是企业内部网或少量网站,使用的是crawl命令;另一种方式是Whole-web crawling,针对的是整个互联网,使用inject、generate、fetch和updatedb等更底层的命令.本文档介绍Intranet Crawling的基本使用方法. 安装jdk # cp jdk-6u21-linux-i586.bin /usr/java # cd /usr/java # chmod +x jdk-6u21-linux-i586.bin # ./ jdk-6u21-linux-i586 # vi /etc/profile //添加如下的java环境变量 JAVA_HOME=/usr/java/jdk1.6.0

Adding URL parameter to Nutch/Solr index and search results

…衆ロ難τιáo~ 提交于 2019-12-03 15:04:31
问题 I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on). the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?) the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/) The crawling works fine so far. Any ideas? cheers, mana EDIT: A part of

Solr indexing following a Nutch crawl fails, reports “Job Failed”

孤街浪徒 提交于 2019-12-03 12:55:18
问题 I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given on the Nutch site (http://wiki.apache.org/nutch/NutchTutorial), and I have Solr running in my browser without issue. I am running the following command: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2 The crawl is working fine, but when it attemps to put the data into Solr,

Nutch does not Index on Elasticsearch correctly using Mongodb

匿名 (未验证) 提交于 2019-12-03 08:28:06
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. I have followed a mix of this tutorial: https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch and this tutorial: http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/ In order to create a web crawling tool using those aforementioned 3 pieces of software. Everything works great until it comes down to indexing... as soon as I use the index command from nutch: # bin/nutch index elasticsearch -all this happens: IndexingJob: starting Active

Nutch versus Solr

喜你入骨 提交于 2019-12-03 07:01:06
Currently collecting information where I should use Nutch with Solr (domain - vertical web search). Could you suggest me? Nutch is a framework to build web crawler and search engines. Nutch can do the whole process from collecting the web pages to building the inverted index. It can also push those indexes to Solr. Solr is mainly a search engine with support for faceted searches and many other neat features. But Solr doesn't fetch the data, you have to feed it. So maybe the first thing you have to ask in order to choose between the two is whether or not you have the data to be indexed already

Apache Nutch 2.1 different batch id (null)

孤街浪徒 提交于 2019-12-03 06:59:26
I crawl few sites with Apache Nutch 2.1. While crawling I see the following message on lot of pages: ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html ; different batch id (null). What causes this error ? How can I resolve this problem, because the pages with different batch id (null) are not stored in database. The site that I crawled is based on drupal, but i have tried on many others non drupal sites. I think, the message is not problem. batch_id not assigned to all of url. So, if batch_id is null , skip url. Generate url when batch_id assined for url. 来源: https:/

How to produce massive amount of data?

孤者浪人 提交于 2019-12-03 06:55:09
I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB. The problem is that I don't have this amount of data, so I'm thinking of ways to produce it. The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored). Another idea is to write a program that will create files with dummy data. Any other idea? Iterator This may be a better question for the

Adding URL parameter to Nutch/Solr index and search results

爷,独闯天下 提交于 2019-12-03 03:55:25
I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on). the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?) the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/) The crawling works fine so far. Any ideas? cheers, mana EDIT: A part of the solution is hidden here: configuring nutch regex-normalize.xml # skip URLs containing certain

Solr indexing following a Nutch crawl fails, reports “Job Failed”

橙三吉。 提交于 2019-12-03 03:16:03
I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given on the Nutch site ( http://wiki.apache.org/nutch/NutchTutorial ), and I have Solr running in my browser without issue. I am running the following command: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2 The crawl is working fine, but when it attemps to put the data into Solr, it fails with the following output: Indexer: starting at 2014-02-06 16:29:28 Indexer: deleting gone