nutch

Apache nutch: Manipulating the DOM before parsing

点点圈 提交于 2019-12-25 06:45:59
问题 I want to remove specific elements from the page response, before it is handed down to nutch. Specifically, I want to mark parts of my pages with i.e. <div class="noindex">I shall not be indexed</div> And want to remove them before nutch parse, so that "I shall not be indexed" is not present in the NutchDocument afterwards. I plan die surround my navigation, header, footer content with this because right now, they are present in every document in the index. Thanks, Paul 回答1: You have some

apache nutch to index to solr via REST

淺唱寂寞╮ 提交于 2019-12-25 06:34:06
问题 newbie in apache nutch - writing a client to use it via REST. succeed in all the steps (INJECT, FETCH...) - in the last step - when trying to index to solr - it fails to pass the parameter. The Request (I formatted it in some website) { "args": { "batch": "1463743197862", "crawlId": "sample-crawl-01", "solr.server.url": "http:\/\/x.x.x.x:8081\/solr\/" }, "confId": "default", "type": "INDEX", "crawlId": "sample-crawl-01" } The Nutch logs: java.lang.Exception: java.lang.RuntimeException:

nutch crawling stops after injector.

拈花ヽ惹草 提交于 2019-12-25 06:21:04
问题 here is my cygwin screen looks... cygpath: can't convert empty path Injector: starting at 2014-05-15 16:57:50 Injector: crawlDb: -dir/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Patch for HADOOP-7682: Instantiating workaround file system Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 0 Injector: Merging injected urls into crawl db. Injector: overwrite: false Injector:

web crawling tools which support interacting with target sites before begining to crawl

[亡魂溺海] 提交于 2019-12-25 05:38:28
问题 I am looking for a crawler which is capable of handling pages with Ajax and being able to perform certain user interactions with the target site before starting to crawl the site (e.g., clicking on certain menu items, filling some forms, etc...).I tried webdriver/selenium (which are really web scraping tools) and now I am want to know if there is any crawler available that supports emulating certain user interactions before starting to crawl ? (In Java or Python or Ruby ...) Thanks ps - Can

Empty Nutch crawl list

狂风中的少年 提交于 2019-12-25 03:09:30
问题 I'm trying to run a crawl using Nutch in Eclipse. I'm using a file called urls, and it contains http://www.google.com/ However, when I run the project, the Generator class tells me that: "0 records selected for fetching, exiting" How can I solve this issue? I've followed these documentations: http://wiki.apache.org/nutch/RunNutchInEclipse1.0 http://wiki.apache.org/nutch/NutchTutorial Any help would be greatly appreciated. 回答1: I recently ran into this issue and found that most responses

Nutch 2 with Cassandra as a storage is not crawling data properly

我怕爱的太早我们不能终老 提交于 2019-12-25 03:07:43
问题 I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format. When I use readdb command in Nutch, I did get any useful crawling data. Below are the details of different files and output I am getting: ========== command to run crawler ===================== bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3 ======================== seed.txt data ========================== http://www.ft.com ===

Nutch does not crawl multiple sites

喜你入骨 提交于 2019-12-25 02:35:13
问题 I'm trying to crawl multiple sites using Nutch. My seed.txt looks like this: http://1.a.b/ http://2.a.b/ and my regex-urlfilter.txt looks like this: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs

Apache Nutch not adding internal links in a web page to fetchlist

对着背影说爱祢 提交于 2019-12-24 15:41:23
问题 I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links. However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore

Apache Nutch error: Injector: java.io.IOException: (null) entry in command string: null chmod 0644

 ̄綄美尐妖づ 提交于 2019-12-24 09:58:28
问题 I am using Apache Nutch 1.14 on Windows 10 having java 1.8. I have followed the same steps as mentioned on https://wiki.apache.org/nutch/NutchTutorial. When I try to inject the URLs in crawldb using the command on cygwin : bin/nutch inject crawl/crawldb urls I get the following error: Injector: java.io.IOException: (null) entry in command string: null chmod 0644 E:\apache-nutch-1.4\runtime\local\crawl\crawldb.locked at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)

UTF-8 characters not showing properly

天涯浪子 提交于 2019-12-24 07:03:56
问题 I am using Nutch 1.4 and solr 3.3.0 to crawl and index my site which is in French. My site used to be in iso8859-1. Currently I have 2 indexes under solr. In the first one I store my old pages (in iso8859-1) and in the second one I store my new pages (in utf-8). I use the same nutch configurations for both of the crawl jobs to get and index the old and the new pages on my site. I have not added any settings about charters encodings on my own ( i think). I am facing problem when searching the