nutch

Nutch on windows: ERROR crawl.Injector

岁酱吖の 提交于 2020-01-02 20:10:55
问题 I'm trying to install nutch 1.12 on a windows 2012 server based on cygwin64 2.874. Due to limited skills with java and linux I followed the step by step introduction at https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Seeding_the_crawldb_with_a_list_of_URLs. The command bin/nutch inject crawl/crawldb urls throws an error because winutils.exe couldn't be found. Here is the hadoop log: 2016-07-01 09:22:25,660 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary

Find all the web pages in a domain and its subdomains

浪尽此生 提交于 2020-01-02 08:04:09
问题 I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu). I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch

How do we create a simple search engine using Lucene, Solr or Nutch?

自作多情 提交于 2020-01-01 05:07:06
问题 Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's. 回答1: None of the projects in the Lucene family can natively process PDFs, but there are utilities you can drop in and well written examples on how to roll your own. Lucene will do pretty much whatever you need it to do, but

how to parse html with nutch and index specific tag to solr?

為{幸葍}努か 提交于 2019-12-30 10:08:42
问题 i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this: <div id=something> me specific tag </div> indeed i want to add a field to solr (something) that have value of "me specific tag" in this page. any idea? 回答1: I made my

Nutch 2.2.1 doesnt continue after Injector job

删除回忆录丶 提交于 2019-12-29 09:29:43
问题 I am learning nutch and trying to carawl as per this tutorial .I am working on an ubuntu machinewith bash shell. But when I run the script, the execution happens, but nothing happens after , InjectorJob: starting at 2014-03-23 09:28:50 InjectorJob: Injecting urlDir: urls/seed.txt I have waited for hours, I tried running the same with sudo . The same issue occurs. I have tried with default urls given in the tutorial as well. What can be the probable errors? 回答1: What was missing was I didnt

configuring nutch regex-normalize.xml

丶灬走出姿态 提交于 2019-12-25 15:22:08
问题 I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expressions of 'jsessionid' from the urls being indexed when running the Nutch crawler to index my intranet. However my modifications to $NUTCH_HOME/conf/regex-normalize.xml (prior to running my crawl) do not seem to be having any effect. How can I ensure that my regex-normalize.xml configuration is being

Nutch 2.2.1 & HBase - Can I create a new property in nutch-site.xml

我怕爱的太早我们不能终老 提交于 2019-12-25 10:36:04
问题 I wanna develop a topical web robot using Nutch 2.2.1. And I wanna create a new property with some topic keywords,like following: <property> <name>html.metatitle.keys</name> <value>movie,actor,firm</value> <description> </description> </property> 回答1: There are two different solutions available for your problem: Implementing a customized HtmlParseFilter plugin to filter pages based on your desired keywords. For more information about Nutch extension points and writing customized plugin for

Nutch 2.2.1 & HBase - Can I create a new property in nutch-site.xml

允我心安 提交于 2019-12-25 10:32:05
问题 I wanna develop a topical web robot using Nutch 2.2.1. And I wanna create a new property with some topic keywords,like following: <property> <name>html.metatitle.keys</name> <value>movie,actor,firm</value> <description> </description> </property> 回答1: There are two different solutions available for your problem: Implementing a customized HtmlParseFilter plugin to filter pages based on your desired keywords. For more information about Nutch extension points and writing customized plugin for

Nutch Crawler error: Premission denied

随声附和 提交于 2019-12-25 06:59:05
问题 I am trying to run a basic crawler. Got the command from the NutchTutorial: bin/crawl urls -dir crawl -depth 3 -topN 5 (after doing all the presets) Im running from windows so I've installed cygwin64 as a running environment I don't see any problems when I run bin/nutch from the nutch home directory, but when I try to run the crawl as above I get the following error: Injector: starting at 2014-11-29 11:31:35 Injector: crawlDb: -dir/crawldb Injector: urlDir: urls Injector: Converting injected

apache nutch crawler - keeps retrieve only single url

倖福魔咒の 提交于 2019-12-25 06:57:48
问题 INJECT step keeps retrieving only single URL - trying to crawl CNN. I'm with default config (below is the nutch-site) - what could that be - shouldn't it be 10 docs according to my value? <configuration> <property> <name>http.agent.name</name> <value>crawler1</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <property> <name>solr.server.url</name>