Crawling websites which ask for authentication

允我心安 提交于 2019-12-10 12:18:54

问题


I followed this https://wiki.apache.org/nutch/HttpAuthenticationSchemes link for crawling few websites by providing username and password

Work around:I have set the auth-configuration in httpclient-auth.xml file:

<auth-configuration>
<credentials username="xyz" password="xyz">
<default realm="domain" />
<authscope host="www.gmail.com" port="80"/>
</credentials>
</auth-configuration>

ii)Define httpclient property in both nutch-site.xml and nutch-default.xml

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

iii) Also have defined the auth configuration file in nutch-site.xml.

<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.
</description>

I'm not able to crawl it and getting no error..

Requirements: I want to crawl websites like gmail.com or yahoomail.com or anything which asks for authentication.

Where am i going wrong, am i choosing wrong websites for crawling

( if yes please provide me the websites which asks for authentication I'll register for it)

(if no how can i crawl my gmail or facebook accounts)


回答1:


Few points which will help you in resolving this issue:

1) Yes you have chosen wrong website to crawl and index try some different websites.

2) Nutch only support NTLM, Basic or Digest authentication. It do not support the Form Based Authentication. The sites that you are trying use have Form based Authentication.

3) To implement Form Based Authentication you will have to customize your Nutch code.

I am sure following 2 links will help you in making some progress in this issue that you are facing:

http://technical-fundas.blogspot.in/2014/05/nutch-solr-formed-based-authentication.html

http://technical-fundas.blogspot.in/2014/06/how-to-configure-nutch-in-eclipse-for.html



来源:https://stackoverflow.com/questions/25183951/crawling-websites-which-ask-for-authentication

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!