Unable to verify crawled data stored in hbase

故事扮演 提交于 2019-12-13 01:25:38

问题


I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial.

Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1

Here are the steps I used-

./runtime/local/bin/nutch inject urls

./runtime/local/bin/nutch generate -topN 100 -adddays 30

./runtime/local/bin/nutch fetch -all

./runtime/local/bin/nutch fetch -all

./runtime/local/bin/nutch updatedb

./runtime/local/bin/nutch solrindex http://localhost:8983/solr/ -all

My url/seed.txt file contains- http://www.xyzshoppingsite.com/mobiles/

And I have kept only below line in 'regex-urlfilter.txt' file (all other regex are commented).

+^http://([a-z0-9]*\.)*xyzshoppingsite.com/mobile/*

At the end of the crawl, I can see a table "webpage" created in the HBase but I am unable to verify whether all and complete data have been crawled or not. When searched in Solr, it shows nothing, 0 result.

My ultimate intention is to get the complete data present in all pages under mobile in above URL.

Could you please let me know,

  • How to verify crawled data present in HBase?

  • Solr log directory contains 0 files so I am unable to get a breakthrough. How to resolve this?

  • Output of HBase command scan "webpage" shows only timestamp data and other data as

    'value=\x0A\x0APlease Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>Please Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>'

Here, why is the data crawled like this and not the actual contents of page after redirection?

Please help. Thanks in advance.

Thanks and Regards!


回答1:


Instead of executing all those steps, can you use below command

./bin/crawl url/seed.txt shoppingcrawl http://localhost:8080/solr 2

If you are able to execute successfully, a table will be created in hbase , with name, shoppingcrawl_webpage.

we can check by executing below command in hbase shell

hbase> list

Then we can scan for specific table. In this case

 hbase> scan 'shoppingcrawl_webpage'


来源:https://stackoverflow.com/questions/23564206/unable-to-verify-crawled-data-stored-in-hbase

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!