nutch | 易学教程

Google Cloud Search Indexer “Indexer: java.io.IOException: Job failed!”

阅读更多关于 Google Cloud Search Indexer “Indexer: java.io.IOException: Job failed!”

问题 I am a young developer, I am relatively new to Google Cloud Platform products, and in particular to Google Cloud Search. I have tried to follow also https://developers.google.com/cloud-search/docs/guides/apache-nutch-connector tutorial. What I have done it’s simply reproduce the tutorial modifying the nutch-site.xml file like that <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration>

Why doesn't Nutch seem to know about “Last-Modified”?

阅读更多关于 Why doesn't Nutch seem to know about “Last-Modified”?

问题 I setup Nutch with a db.fetch.interval.default of 60000 so that I can crawl every day. If I don't, it won't even look at my site when I crawl the next day. But when I do crawl the next day, every page that it fetched yesterday gets fetched with a 200 response code, indicating that it's not using the previous day's date in the "If-Modified-Since". Shouldn't it skip fetching pages that haven't changed? Is there a way to make it do that? I noticed a ProtocolStatus.NOT_MODIFIED in Fetcher.java,

Unknown issue in Nutch elastic indexer with nutch REST api

阅读更多关于 Unknown issue in Nutch elastic indexer with nutch REST api

问题 I was trying to expose nutch using REST endpoints and ran into an issue in indexer phase. I'm using elasticsearch index writer to index docs to ES. I've used $NUTCH_HOME/runtime/deploy/bin/nutch startserver command. While indexing an unknown exception is thrown. Error: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor; 16/10/07 16:01:47 INFO mapreduce.Job: map 100% reduce 0% 16/10/07 16:01:49 INFO mapreduce.Job: Task Id : attempt_1475748314769_0107

Using Nutch how to crawl the dynamic content of web page that are uisng ajax?

阅读更多关于 Using Nutch how to crawl the dynamic content of web page that are uisng ajax?

问题 I am using apache Nutch 1.10 to crawl the web pages and to extract the contents in the page. Some of the links contains dynamic contents which are loaded on the call of ajax. Nutch cannot able to crawl and extract the dynamic contents of ajax. How can I solve this? Is there any solution? if yes please help me with your answers. Thanks in advance. 回答1: Most web crawler libraries do not offer javascript rendering out of the box. You usually have to plugin another library or product that offers

Nutch 2.1 urls injection takes forever

阅读更多关于 Nutch 2.1 urls injection takes forever

问题 I'm trying to deploy nutch 2.1 on Ubuntu 12.04 by following that tutorial. Everything goes well until I try to inject urls into the database. When I type ($bin/nutch inject urls) and press Enter I get InjectorJob: starting InjectorJob: urlDir: urls and remains there (for hours) until I decide to cancel the execution. urls is a directory that contains file with urls. I added proxy and port details in the nutch-site.xml as suggested here but it doesn't solve. I tried apache nutch 2.2.1 and the

Nutch Crawling not working for particular URL

阅读更多关于 Nutch Crawling not working for particular URL

问题 I am using apache nutch for crawling. When i crawled the page http://www.google.co.in . It crawls the page correctly and produce the results. But when i add one parameter in that url it does not fetch any results for the url http://www.google.co.in/search?q=bill+gates . solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 100 Injector: starting at 2013-05-27 08:01:57 Injector: crawlDb: crawl/crawldb Injector:

Apache Nutch Command Unable to Execute

阅读更多关于 Apache Nutch Command Unable to Execute

问题 I followed each and every step in the Apache Nutch Wiki. I am using MacOSX 10.8.3, my JAVA_HOME is perfectly set and can even see various command options when bin/nutch is executed (according to the wiki). But when I use bin/nutch crawl urls -dir crawl -depth 3 -topN 5 , I get the following error: bin/nutch: line 104: [: too many arguments Error: Could not find or load main class Engines FYI: I have already created a urls directory in apache-nutch-1.6/urls Can any one tell what might be the

How to extend Nutch for article crawling

阅读更多关于 How to extend Nutch for article crawling

问题 I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each: 1 Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list pages. But, if I don't allow the list page to be indexed, Nutch will do nothing because the list page is the entrance. So, how can I index only the article page without list pages? 2 Write a plugin to parse out the 'author', 'date', 'article body',

Nutch does not Index on Elasticsearch correctly using Mongodb

阅读更多关于 Nutch does not Index on Elasticsearch correctly using Mongodb

问题 I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. I have followed a mix of this tutorial: https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch and this tutorial: http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/ In order to create a web crawling tool using those aforementioned 3 pieces of software. Everything works great until it comes down to indexing... as soon as I use the index command from nutch: # bin/nutch index

How to select data from specific tags in nutch

阅读更多关于 How to select data from specific tags in nutch

问题 I am a newbie in Apache Nutch and I would like to know whether it's possible to crawl selected area of a web page. For instance, select a particular div and crawl contents in that div only. Any help would be appreciated. Thanks! 回答1: You will have to write a plugin that will extend HtmlParseFilter to achieve your goal. I reckon you will be doing some of the stuff yourself like parsing the html's specific section, extracting the URLs that you want and add them as outlinks. HtmlParseFilter