nutch

你的大数据之Hadoop是如何去学习的?Hadoop300集了解一下

倾然丶 夕夏残阳落幕 提交于 2019-12-16 08:45:20
1. HADOOP背景介绍 1.1 什么是HADOOP HADOOP是apache旗下的一套开源软件平台 HADOOP提供的功能:利用服务器集群,根据用户的自定义业务逻辑,对海量数据进行分布式处理 HADOOP的核心组件有 HDFS(分布式文件系统) YARN(运算资源调度系统) MAPREDUCE(分布式运算编程框架) 广义上来说,HADOOP通常是指一个更广泛的概念——HADOOP生态圈 1.2 HADOOP产生背景 HADOOP最早起源于Nutch。Nutch的设计目标是构建一个大型的全网搜索引擎,包括网页抓取、索引、查询等功能,但随着抓取网页数量的增加,遇到了严重的可扩展性问题——如何解决数十亿网页的存储和索引问题。 2003年、2004年谷歌发表的两篇论文为该问题提供了可行的解决方案。 ——分布式文件系统(GFS),可用于处理海量网页的存储 ——分布式计算框架MAPREDUCE,可用于处理海量网页的索引计算问题。 Nutch的开发人员完成了相应的开源实现HDFS和MAPREDUCE,并从Nutch中剥离成为独立项目HADOOP,到2008年1月,HADOOP成为Apache顶级项目,迎来了它的快速发展期。 1.3 HADOOP在大数据、云计算中的位置和关系 云计算是分布式计算、并行计算、网格计算、多核计算、网络存储、虚拟化

Apache Nutch REST api

自闭症网瘾萝莉.ら 提交于 2019-12-13 20:04:28
问题 I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default", "args":{ "path/to/seedlist/directory"} } My problem is in the args, I think more is needed but I'm not sure. In the NutchRESTAPI page this is the sample it gives

Nutch Elasticsearch Integration

核能气质少年 提交于 2019-12-13 04:28:12
问题 I'm following this tutorial for setting up nutch alongwith Elasticsearch. Whenever I try to index the data into the ES, it returns an error. Following are the logs:- Command:- bin/nutch index elasticsearch -all Logs when I add elastic.port (9200) in conf/nutch-site.xml :- 2016-05-05 13:22:49,903 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2016-05-05 13:22:49,904 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2016-05

How to read Nutch content from Java/Scala?

前提是你 提交于 2019-12-13 03:48:19
问题 I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup. I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory. The problem is that I cannot figure out how to actually read the website data (URLs and

How to fix this error running Nutch 1.15 ERROR fetcher.Fetcher - Fetcher job did not succeed, job status:FAILED, reason: NA

天大地大妈咪最大 提交于 2019-12-13 03:46:03
问题 When I'm starting a crawl using Nutch 1.15 with this: /usr/local/nutch/bin/crawl --i -s urls/seed.txt crawldb 5 Then it starts to run and I get this error when it tries to fetch: 2019-02-10 15:29:32,021 INFO mapreduce.Job - Running job: job_local1267180618_0001 2019-02-10 15:29:32,145 INFO fetcher.FetchItemQueues - Using queue mode : byHost 2019-02-10 15:29:32,145 INFO fetcher.Fetcher - Fetcher: threads: 50 2019-02-10 15:29:32,145 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2 2019-02-10

Apache Nutch 2.3.1 map-reduce timeout occurred while updating the score

ぃ、小莉子 提交于 2019-12-13 03:22:58
问题 I have 4 system cluster and Apache Nutch 2.3.1 is configured to crawl few website. After crawling, I have to change their score little big by some custom job. In job, the mapper is just combining the documents based on domain as key. While is reducer, I sum their effective text bytes and find the average. Later I assign the log of average bytes as score. But reducer job took 14 hours and then timeout has occured. While in Nutch builtin job e.g., updatedb is finished in 3 to 4 hours. Where is

how to consume rest api of apache nutch docker

浪子不回头ぞ 提交于 2019-12-13 02:08:24
问题 I pulled and started apache nutch docker started it with docker run --name my_nutch -d -p 8899:8899 -e SOLRURL=192.168.99.100:8983 -t meabed/nutch any action I try to consume (according to their rest api) - I get 404 for example 192.168.99.100:8899/admin tried also GET http://192.168.99.100:8899/nutch/#/admin I get in postman (for all GET REST requests, POST I get 404) [ [ "admin", "Service admin actions" ], [ "confs", "Configuration manager" ], [ "db", "DB data streaming" ], [ "jobs", "Job

Unable to verify crawled data stored in hbase

故事扮演 提交于 2019-12-13 01:25:38
问题 I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial . Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1 Here are the steps I used- ./runtime/local/bin/nutch inject urls ./runtime/local/bin/nutch generate -topN 100 -adddays 30 ./runtime/local/bin/nutch fetch -all ./runtime/local/bin/nutch fetch -all ./runtime/local/bin/nutch updatedb ./runtime/local/bin/nutch solrindex http:/

Nutch in Hadoop 2.x

假如想象 提交于 2019-12-13 00:29:09
问题 I have a three-node cluster running Hadoop 2.2.0 and HBase 0.98.1 and I need to use a Nutch 2.2.1 crawler on top of that. But it only supports Hadoop versions from 1.x branch. By now I am able to submit a Nutch job to my cluster, but it fails with java.lang.NumberFormatException. So my question is pretty simple: how do I make Nutch work in my environment? 回答1: At the moment it's impossible to integrate Nutch 2.2.1 (Gora 0.3) with HBase 0.98.x. See: https://issues.apache.org/jira/browse/GORA

Working of nutch server in distributed mode

送分小仙女□ 提交于 2019-12-12 20:51:53
问题 I would like to know how nutch server works actually in a distributed environment? Whether it use a listener for incoming crawl requests or it is a continuously running server? 回答1: Nutch REST API is built using Apache CXF framework and JAX-RS. The Nutch Server uses an embedded Jetty Server to service the http requests. You can find out more about CXF and Jetty here (http://cxf.apache.org/docs/overview.html) 来源: https://stackoverflow.com/questions/39853492/working-of-nutch-server-in