nutch

SOLR4.2+NUTCH1.6

点点圈 提交于 2019-12-05 04:40:22
1、SOLR4.2集成NUTCH1.6 wget http://archive.apache.org/dist/lucene/solr/4.2.0/solr-4.2.0.tgz tar -xzvf solr-4.2.0.tgz cd solr-4.2.0/example 复制 nutch 的 conf 目录中的 schema-solr4.xml 文件到 solr/collection1/conf 目录,改名为 schema.xml ,覆盖原来文件 修改 solr/collection1/conf/schema.xml ,在 <fields> 下增加: <field name="_version_" type="long" indexed="true" stored="true"/> 2、给SOLR4.2配置中文分词器word分词 参考 https://github.com/ysc/word 的 Solr插件 部分 3、运行SOLR4.2 启动 SOLR4.2 服务器: java -jar start.jar & SOLR4.2 Web 界面: http://host2:8983 4、运行NUTCH提交索引 运行 solrindex命令 : bin/nutch solrindex http://host2:8983/solr data/crawldb -linkdb data

How to select data from specific tags in nutch

守給你的承諾、 提交于 2019-12-04 19:48:23
I am a newbie in Apache Nutch and I would like to know whether it's possible to crawl selected area of a web page. For instance, select a particular div and crawl contents in that div only. Any help would be appreciated. Thanks! You will have to write a plugin that will extend HtmlParseFilter to achieve your goal. I reckon you will be doing some of the stuff yourself like parsing the html's specific section, extracting the URLs that you want and add them as outlinks. HtmlParseFilter implementation: (Code below gives the general idea) ParseResult filter(Content content, ParseResult parseResult,

How to run nutch server on distributed environment

五迷三道 提交于 2019-12-04 17:08:00
I have tested running of nutch in server mode by starting it using bin/nutch startserver command locally . Now I wonder whether I can start nutch in server mode on top of a hadoop cluster(in distributed environment) and submit crawl requests to server using nutch REST api ? Please help. From further research I've got nutch server working on distributed mode. Steps :- Assume hadoop is configured in all slave nodes. Then setup nutch in all nodes. This can help : http://wiki.apache.org/nutch/NutchHadoopTutorial On your namenode, cd $NUTCH_HOME/runtime/deploy bin/nutch startserver -port <port>

Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

自古美人都是妖i 提交于 2019-12-04 16:52:26
After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://..... -depth 10 -topN 50000 -dir /crawl -threads 20 I need to extract each URL fetched along with it's content in a map reduce friendly format. By using the readseg command below, the contents are fetched but the output format doesn't lend itself to being map reduced. bin/nutch readseg -dump /crawl/segments/* /output -nogenerate -noparse -noparsedata -noparsetext Ideally the output should be in this format: http://abc.com/1 content of http://abc.com/1 http://abc.com/2 content of http://abc.com/2 Any suggestions

Nutch Raw Html Saving

旧巷老猫 提交于 2019-12-04 16:26:19
I'm trying to get raw html of crawled pages in different files, named as url of the page. Is it possible with Nutch to save the raw html pages in different files by ruling out the indexing part? Tejas Patil The is no direct way to do that. You will have to do few code modifications. See this and this . 来源: https://stackoverflow.com/questions/10142592/nutch-raw-html-saving

Nutch crawl no error , but result is nothing

本小妞迷上赌 提交于 2019-12-04 15:46:03
I try to crawl some urls with nutch 2.1 as follows. bin/nutch crawl urls -dir crawl -depth 3 -topN 5 http://wiki.apache.org/nutch/NutchTutorial There is no error , but undermentioned folders don't be made. crawl/crawldb crawl/linkdb crawl/segments Can anyone help me? I have not resolved this trouble for two days. Thanks a lot! output is as follows. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 0 records. Hit by time limit :0 -finishing thread

How to produce massive amount of data?

心不动则不痛 提交于 2019-12-04 11:09:18
问题 I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB. The problem is that I don't have this amount of data, so I'm thinking of ways to produce it. The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored). Another idea is to write a program

为Nutch 1.0添加JE中文分词

只谈情不闲聊 提交于 2019-12-04 11:07:30
为Nutch 1.0添加JE中文分词 文章来源网络 属于java 分类 电脑编程网整理 20091223 简介:这是为Nutch 1.0添加JE中文分词的详细页面,介绍了和java,有关的知识,加入收藏请按键盘ctrl+D,谢谢大家的观看!要查看更多有关信息,请点击此处 先下载Nutch 1.0的源文件: svn co http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0 ./nutch-1.0 更改查询语法解析部分: 改变tokenize的方式(原来为中文单字识别) modify “src/java/org/apache/nutch/analysis/NutchAnalysis.jj”line 130: | > change to: | )+ > run “javacc” cd nutch-1.0/src/java/org/apache 来源: oschina 链接: https://my.oschina.net/u/98576/blog/7929

Where is the crawled data stored when running nutch crawler?

巧了我就是萌 提交于 2019-12-04 09:51:59
I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed. Now, I don't find the text/html data in my local machine. Where can I find the data and what is the best way to read the data in text format? Versions apache-nutch-1.9 solr-4.10.4 After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format. The

Integration between Nutch 1.11(1.x) and Solr 5.3.1(5.x)

我的梦境 提交于 2019-12-04 09:22:41
I just started using Nutch 1.11 and Solr 5.3.1 . I want to crawl data with Nutch , then index and prepare for searching with Solr . I know how to crawl data from web using Nutch 's bin/crawl command, and successfully got much data from a website in my local. I also started a new Solr server in local with below command under Solr root folder, bin/solr start And started the example files core under the example folder with below command: bin/solr create -c files -d example/files/conf And I can login below admin url and manage the files core, http://localhost:8983/solr/#/files So I believe I