nutch | 易学教程

Apache nutch: Manipulating the DOM before parsing

阅读更多关于 Apache nutch: Manipulating the DOM before parsing

问题 I want to remove specific elements from the page response, before it is handed down to nutch. Specifically, I want to mark parts of my pages with i.e. <div class="noindex">I shall not be indexed</div> And want to remove them before nutch parse, so that "I shall not be indexed" is not present in the NutchDocument afterwards. I plan die surround my navigation, header, footer content with this because right now, they are present in every document in the index. Thanks, Paul 回答1: You have some

apache nutch to index to solr via REST

阅读更多关于 apache nutch to index to solr via REST

问题 newbie in apache nutch - writing a client to use it via REST. succeed in all the steps (INJECT, FETCH...) - in the last step - when trying to index to solr - it fails to pass the parameter. The Request (I formatted it in some website) { "args": { "batch": "1463743197862", "crawlId": "sample-crawl-01", "solr.server.url": "http:\/\/x.x.x.x:8081\/solr\/" }, "confId": "default", "type": "INDEX", "crawlId": "sample-crawl-01" } The Nutch logs: java.lang.Exception: java.lang.RuntimeException:

nutch crawling stops after injector.

阅读更多关于 nutch crawling stops after injector.

问题 here is my cygwin screen looks... cygpath: can't convert empty path Injector: starting at 2014-05-15 16:57:50 Injector: crawlDb: -dir/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Patch for HADOOP-7682: Instantiating workaround file system Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 0 Injector: Merging injected urls into crawl db. Injector: overwrite: false Injector:

web crawling tools which support interacting with target sites before begining to crawl

阅读更多关于 web crawling tools which support interacting with target sites before begining to crawl

问题 I am looking for a crawler which is capable of handling pages with Ajax and being able to perform certain user interactions with the target site before starting to crawl the site (e.g., clicking on certain menu items, filling some forms, etc...).I tried webdriver/selenium (which are really web scraping tools) and now I am want to know if there is any crawler available that supports emulating certain user interactions before starting to crawl ? (In Java or Python or Ruby ...) Thanks ps - Can

Empty Nutch crawl list

阅读更多关于 Empty Nutch crawl list

问题 I'm trying to run a crawl using Nutch in Eclipse. I'm using a file called urls, and it contains http://www.google.com/ However, when I run the project, the Generator class tells me that: "0 records selected for fetching, exiting" How can I solve this issue? I've followed these documentations: http://wiki.apache.org/nutch/RunNutchInEclipse1.0 http://wiki.apache.org/nutch/NutchTutorial Any help would be greatly appreciated. 回答1: I recently ran into this issue and found that most responses

Nutch 2 with Cassandra as a storage is not crawling data properly

阅读更多关于 Nutch 2 with Cassandra as a storage is not crawling data properly

问题 I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format. When I use readdb command in Nutch, I did get any useful crawling data. Below are the details of different files and output I am getting: ========== command to run crawler ===================== bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3 ======================== seed.txt data ========================== http://www.ft.com ===

Nutch does not crawl multiple sites

阅读更多关于 Nutch does not crawl multiple sites

问题 I'm trying to crawl multiple sites using Nutch. My seed.txt looks like this: http://1.a.b/ http://2.a.b/ and my regex-urlfilter.txt looks like this: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs

Apache Nutch not adding internal links in a web page to fetchlist

阅读更多关于 Apache Nutch not adding internal links in a web page to fetchlist

问题 I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links. However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore

Apache Nutch error: Injector: java.io.IOException: (null) entry in command string: null chmod 0644

阅读更多关于 Apache Nutch error: Injector: java.io.IOException: (null) entry in command string: null chmod 0644

问题 I am using Apache Nutch 1.14 on Windows 10 having java 1.8. I have followed the same steps as mentioned on https://wiki.apache.org/nutch/NutchTutorial. When I try to inject the URLs in crawldb using the command on cygwin : bin/nutch inject crawl/crawldb urls I get the following error: Injector: java.io.IOException: (null) entry in command string: null chmod 0644 E:\apache-nutch-1.4\runtime\local\crawl\crawldb.locked at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)

UTF-8 characters not showing properly

阅读更多关于 UTF-8 characters not showing properly

问题 I am using Nutch 1.4 and solr 3.3.0 to crawl and index my site which is in French. My site used to be in iso8859-1. Currently I have 2 indexes under solr. In the first one I store my old pages (in iso8859-1) and in the second one I store my new pages (in utf-8). I use the same nutch configurations for both of the crawl jobs to get and index the old and the new pages on my site. I have not added any settings about charters encodings on my own ( i think). I am facing problem when searching the