nutch

nutch + mysql integration

梦想的初衷 提交于 2019-12-08 03:43:57
问题 When nutch finishes its cycle (that is crawl - fetch- parse - index) during index phase, I do not want nutch to index (lucene index), but I want nutch to place all the crawled data (I believe he keeps them as NutchDocument object) into mysql using my code. Is there any way to do this? Thanks 回答1: Create your own java class that manage the Nutch cycle. It should be similar to org.apache.nutch.crawl.Crawl but you will have to replace the call to the indexer by a call to your Mysql connector. Or

Facing issue in elasticsearch mapping of nutch crawled document

╄→гoц情女王★ 提交于 2019-12-08 02:47:48
问题 Facing some serious issues while using nutch and elasticsearch for crawling purpose. We have two data storage engines in our App. MySql Elasticsearch Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a

Nutch: Authentication via putting a cookie in the header

北城以北 提交于 2019-12-07 21:24:47
问题 I am surprised that there is so little support or information out there for getting Nutch to be able to crawl parts of a website that require authentication. I am aware that maybe Apache Nutch is not currently able to (but apparently hopes to) support Http POST authentication. However, all we really want to do is be able to add a cookie to our Nutch bot header that will allow it to access those parts of the site that way (rather than post a username and password to a form and then receive the

How to lookup HBase REST API (Stargate) if the row-keys are reversed urls

回眸只為那壹抹淺笑 提交于 2019-12-07 21:10:01
问题 I am using nutch2.2.1 + hbase0.90.4, and wanting to access the data via the HBase REST API Stargate. If I seed nutch with a url (eg. www.usatoday.com), the reversed url becomes the HBase row-key in the designated table ('webpage'). I can lookup the data via the hbase shell as follows: hbase(main):001:0> get 'webpage', 'com.usatoday.www:http/' COLUMN CELL f:fi timestamp=1404762373394,value=\x00'\x8D\x00 f:ts timestamp=1404762373394, value=\x00\x00\x01G\x12\\xB5\xB3 mk:_injmrk_ timestamp

Nutch does not crawl URLs with query string parameters

…衆ロ難τιáo~ 提交于 2019-12-07 20:22:25
问题 I am using Nutch1.9 and trying to crawl using individual commands. as can be seen in the output when going in to the 2nd level generater returned with 0 records. any one has faced this issue ? i am stuck in here from past 2 days. have searched all possible options. any leads/helps would be much appreciated. <br>####### INJECT ######<br> Injector: starting at 2015-04-08 17:36:20 <br> Injector: crawlDb: crawl/crawldb<br> Injector: urlDir: urls<br> Injector: Converting injected urls to crawl db

Nutch problems executing crawl

旧城冷巷雨未停 提交于 2019-12-07 09:57:27
I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7. Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl. I am getting the following error when I try to run a crawl execute with nutch: Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seed.txt Failed with exit value 127. I have my JAVA_HOME classpath set, and I have altered the host

How to Crawl .pdf links using Apache Nutch

旧街凉风 提交于 2019-12-07 09:48:33
问题 I got a website to crawl which includes some links to pdf files. I want nutch to crawl that link and dump them as .pdf files. I am using Apache Nutch1.6 also i am tring this in java as ToolRunner.run(NutchConfiguration.create(), new Crawl(), tokenize(crawlArg)); SegmentReader.main(tokenize(dumpArg)); can some one help me on this 回答1: If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin: Document crawling 1.1 Edit regex-urlfilter.txt

How to Open an Ant project (Nutch Source) at Intellij Idea?

孤街浪徒 提交于 2019-12-06 21:50:57
问题 I want to open Nutch 2.1 source file (http://www.eu.apache.org/dist/nutch/2.1/) at Intellij IDEA. Here is an explanation of how to open it at Eclipse: http://wiki.apache.org/nutch/RunNutchInEclipse However I am not familiar with Ant (I use Maven) and when I open that source file many classes are not known by Intellij. i.e.: org.apache.hadoop.mapreduce.JobContext org.apache.gora.mapreduce.GoraMapper How can I add them to library or what should I do? 回答1: I finally figure out how to do it. Now

nutch + mysql integration

白昼怎懂夜的黑 提交于 2019-12-06 14:49:06
When nutch finishes its cycle (that is crawl - fetch- parse - index) during index phase, I do not want nutch to index (lucene index), but I want nutch to place all the crawled data (I believe he keeps them as NutchDocument object) into mysql using my code. Is there any way to do this? Thanks Create your own java class that manage the Nutch cycle. It should be similar to org.apache.nutch.crawl.Crawl but you will have to replace the call to the indexer by a call to your Mysql connector. Or you can call your Mysql connector during each cycle depending on whether you want to update Mysql at the

Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

牧云@^-^@ 提交于 2019-12-06 09:35:28
问题 After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://..... -depth 10 -topN 50000 -dir /crawl -threads 20 I need to extract each URL fetched along with it's content in a map reduce friendly format. By using the readseg command below, the contents are fetched but the output format doesn't lend itself to being map reduced. bin/nutch readseg -dump /crawl/segments/* /output -nogenerate -noparse -noparsedata -noparsetext Ideally the output should be in this format: