nutch | 易学教程

你的大数据之Hadoop是如何去学习的？Hadoop300集了解一下

阅读更多关于你的大数据之Hadoop是如何去学习的？Hadoop300集了解一下

1. HADOOP背景介绍 1.1 什么是HADOOP HADOOP是apache旗下的一套开源软件平台 HADOOP提供的功能：利用服务器集群，根据用户的自定义业务逻辑，对海量数据进行分布式处理 HADOOP的核心组件有 HDFS（分布式文件系统） YARN（运算资源调度系统） MAPREDUCE（分布式运算编程框架）广义上来说，HADOOP通常是指一个更广泛的概念——HADOOP生态圈 1.2 HADOOP产生背景 HADOOP最早起源于Nutch。Nutch的设计目标是构建一个大型的全网搜索引擎，包括网页抓取、索引、查询等功能，但随着抓取网页数量的增加，遇到了严重的可扩展性问题——如何解决数十亿网页的存储和索引问题。 2003年、2004年谷歌发表的两篇论文为该问题提供了可行的解决方案。 ——分布式文件系统（GFS），可用于处理海量网页的存储 ——分布式计算框架MAPREDUCE，可用于处理海量网页的索引计算问题。 Nutch的开发人员完成了相应的开源实现HDFS和MAPREDUCE，并从Nutch中剥离成为独立项目HADOOP，到2008年1月，HADOOP成为Apache顶级项目，迎来了它的快速发展期。 1.3 HADOOP在大数据、云计算中的位置和关系云计算是分布式计算、并行计算、网格计算、多核计算、网络存储、虚拟化

Apache Nutch REST api

阅读更多关于 Apache Nutch REST api

问题 I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default", "args":{ "path/to/seedlist/directory"} } My problem is in the args, I think more is needed but I'm not sure. In the NutchRESTAPI page this is the sample it gives

Nutch Elasticsearch Integration

阅读更多关于 Nutch Elasticsearch Integration

问题 I'm following this tutorial for setting up nutch alongwith Elasticsearch. Whenever I try to index the data into the ES, it returns an error. Following are the logs:- Command:- bin/nutch index elasticsearch -all Logs when I add elastic.port (9200) in conf/nutch-site.xml :- 2016-05-05 13:22:49,903 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2016-05-05 13:22:49,904 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2016-05

How to read Nutch content from Java/Scala?

阅读更多关于 How to read Nutch content from Java/Scala?

问题 I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup. I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory. The problem is that I cannot figure out how to actually read the website data (URLs and

How to fix this error running Nutch 1.15 ERROR fetcher.Fetcher - Fetcher job did not succeed, job status:FAILED, reason: NA

阅读更多关于 How to fix this error running Nutch 1.15 ERROR fetcher.Fetcher - Fetcher job did not succeed, job status:FAILED, reason: NA

问题 When I'm starting a crawl using Nutch 1.15 with this: /usr/local/nutch/bin/crawl --i -s urls/seed.txt crawldb 5 Then it starts to run and I get this error when it tries to fetch: 2019-02-10 15:29:32,021 INFO mapreduce.Job - Running job: job_local1267180618_0001 2019-02-10 15:29:32,145 INFO fetcher.FetchItemQueues - Using queue mode : byHost 2019-02-10 15:29:32,145 INFO fetcher.Fetcher - Fetcher: threads: 50 2019-02-10 15:29:32,145 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2 2019-02-10

Apache Nutch 2.3.1 map-reduce timeout occurred while updating the score

阅读更多关于 Apache Nutch 2.3.1 map-reduce timeout occurred while updating the score

问题 I have 4 system cluster and Apache Nutch 2.3.1 is configured to crawl few website. After crawling, I have to change their score little big by some custom job. In job, the mapper is just combining the documents based on domain as key. While is reducer, I sum their effective text bytes and find the average. Later I assign the log of average bytes as score. But reducer job took 14 hours and then timeout has occured. While in Nutch builtin job e.g., updatedb is finished in 3 to 4 hours. Where is

how to consume rest api of apache nutch docker

阅读更多关于 how to consume rest api of apache nutch docker

问题 I pulled and started apache nutch docker started it with docker run --name my_nutch -d -p 8899:8899 -e SOLRURL=192.168.99.100:8983 -t meabed/nutch any action I try to consume (according to their rest api) - I get 404 for example 192.168.99.100:8899/admin tried also GET http://192.168.99.100:8899/nutch/#/admin I get in postman (for all GET REST requests, POST I get 404) [ [ "admin", "Service admin actions" ], [ "confs", "Configuration manager" ], [ "db", "DB data streaming" ], [ "jobs", "Job

Unable to verify crawled data stored in hbase

阅读更多关于 Unable to verify crawled data stored in hbase

问题 I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial . Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1 Here are the steps I used- ./runtime/local/bin/nutch inject urls ./runtime/local/bin/nutch generate -topN 100 -adddays 30 ./runtime/local/bin/nutch fetch -all ./runtime/local/bin/nutch fetch -all ./runtime/local/bin/nutch updatedb ./runtime/local/bin/nutch solrindex http:/

Nutch in Hadoop 2.x

阅读更多关于 Nutch in Hadoop 2.x

问题 I have a three-node cluster running Hadoop 2.2.0 and HBase 0.98.1 and I need to use a Nutch 2.2.1 crawler on top of that. But it only supports Hadoop versions from 1.x branch. By now I am able to submit a Nutch job to my cluster, but it fails with java.lang.NumberFormatException. So my question is pretty simple: how do I make Nutch work in my environment? 回答1: At the moment it's impossible to integrate Nutch 2.2.1 (Gora 0.3) with HBase 0.98.x. See: https://issues.apache.org/jira/browse/GORA

Working of nutch server in distributed mode

阅读更多关于 Working of nutch server in distributed mode

问题 I would like to know how nutch server works actually in a distributed environment? Whether it use a listener for incoming crawl requests or it is a continuously running server? 回答1: Nutch REST API is built using Apache CXF framework and JAX-RS. The Nutch Server uses an embedded Jetty Server to service the http requests. You can find out more about CXF and Jetty here (http://cxf.apache.org/docs/overview.html) 来源： https://stackoverflow.com/questions/39853492/working-of-nutch-server-in