nutch | 易学教程

nutch + mysql integration

阅读更多关于 nutch + mysql integration

问题 When nutch finishes its cycle (that is crawl - fetch- parse - index) during index phase, I do not want nutch to index (lucene index), but I want nutch to place all the crawled data (I believe he keeps them as NutchDocument object) into mysql using my code. Is there any way to do this? Thanks 回答1: Create your own java class that manage the Nutch cycle. It should be similar to org.apache.nutch.crawl.Crawl but you will have to replace the call to the indexer by a call to your Mysql connector. Or

Facing issue in elasticsearch mapping of nutch crawled document

阅读更多关于 Facing issue in elasticsearch mapping of nutch crawled document

问题 Facing some serious issues while using nutch and elasticsearch for crawling purpose. We have two data storage engines in our App. MySql Elasticsearch Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a

Nutch: Authentication via putting a cookie in the header

阅读更多关于 Nutch: Authentication via putting a cookie in the header

问题 I am surprised that there is so little support or information out there for getting Nutch to be able to crawl parts of a website that require authentication. I am aware that maybe Apache Nutch is not currently able to (but apparently hopes to) support Http POST authentication. However, all we really want to do is be able to add a cookie to our Nutch bot header that will allow it to access those parts of the site that way (rather than post a username and password to a form and then receive the

How to lookup HBase REST API (Stargate) if the row-keys are reversed urls

阅读更多关于 How to lookup HBase REST API (Stargate) if the row-keys are reversed urls

问题 I am using nutch2.2.1 + hbase0.90.4, and wanting to access the data via the HBase REST API Stargate. If I seed nutch with a url (eg. www.usatoday.com), the reversed url becomes the HBase row-key in the designated table ('webpage'). I can lookup the data via the hbase shell as follows: hbase(main):001:0> get 'webpage', 'com.usatoday.www:http/' COLUMN CELL f:fi timestamp=1404762373394,value=\x00'\x8D\x00 f:ts timestamp=1404762373394, value=\x00\x00\x01G\x12\\xB5\xB3 mk:_injmrk_ timestamp

Nutch does not crawl URLs with query string parameters

阅读更多关于 Nutch does not crawl URLs with query string parameters

问题 I am using Nutch1.9 and trying to crawl using individual commands. as can be seen in the output when going in to the 2nd level generater returned with 0 records. any one has faced this issue ? i am stuck in here from past 2 days. have searched all possible options. any leads/helps would be much appreciated. ####### INJECT ###### Injector: starting at 2015-04-08 17:36:20 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db

Nutch problems executing crawl

阅读更多关于 Nutch problems executing crawl

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7. Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl. I am getting the following error when I try to run a crawl execute with nutch: Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seed.txt Failed with exit value 127. I have my JAVA_HOME classpath set, and I have altered the host

How to Crawl .pdf links using Apache Nutch

阅读更多关于 How to Crawl .pdf links using Apache Nutch

问题 I got a website to crawl which includes some links to pdf files. I want nutch to crawl that link and dump them as .pdf files. I am using Apache Nutch1.6 also i am tring this in java as ToolRunner.run(NutchConfiguration.create(), new Crawl(), tokenize(crawlArg)); SegmentReader.main(tokenize(dumpArg)); can some one help me on this 回答1: If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin: Document crawling 1.1 Edit regex-urlfilter.txt

How to Open an Ant project (Nutch Source) at Intellij Idea?

阅读更多关于 How to Open an Ant project (Nutch Source) at Intellij Idea?

问题 I want to open Nutch 2.1 source file (http://www.eu.apache.org/dist/nutch/2.1/) at Intellij IDEA. Here is an explanation of how to open it at Eclipse: http://wiki.apache.org/nutch/RunNutchInEclipse However I am not familiar with Ant (I use Maven) and when I open that source file many classes are not known by Intellij. i.e.: org.apache.hadoop.mapreduce.JobContext org.apache.gora.mapreduce.GoraMapper How can I add them to library or what should I do? 回答1: I finally figure out how to do it. Now

nutch + mysql integration

阅读更多关于 nutch + mysql integration

When nutch finishes its cycle (that is crawl - fetch- parse - index) during index phase, I do not want nutch to index (lucene index), but I want nutch to place all the crawled data (I believe he keeps them as NutchDocument object) into mysql using my code. Is there any way to do this? Thanks Create your own java class that manage the Nutch cycle. It should be similar to org.apache.nutch.crawl.Crawl but you will have to replace the call to the indexer by a call to your Mysql connector. Or you can call your Mysql connector during each cycle depending on whether you want to update Mysql at the

Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

阅读更多关于 Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

问题 After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://..... -depth 10 -topN 50000 -dir /crawl -threads 20 I need to extract each URL fetched along with it's content in a map reduce friendly format. By using the readseg command below, the contents are fetched but the output format doesn't lend itself to being map reduced. bin/nutch readseg -dump /crawl/segments/* /output -nogenerate -noparse -noparsedata -noparsetext Ideally the output should be in this format: