Nutch: Authentication via putting a cookie in the header

北城以北 提交于 2019-12-07 21:24:47

问题


I am surprised that there is so little support or information out there for getting Nutch to be able to crawl parts of a website that require authentication.

I am aware that maybe Apache Nutch is not currently able to (but apparently hopes to) support Http POST authentication.

However, all we really want to do is be able to add a cookie to our Nutch bot header that will allow it to access those parts of the site that way (rather than post a username and password to a form and then receive the cookie).

So I have spent a good amount of time searching and am surprised that most discussions about this are all the way back in 2005 or 2008: here, there, everywhere.

After all these years, is there anyway to work around this limitation or is there just still no way to authenticate by giving Nutch a 'prebaked' cookie so it can access member only parts of our site?.


回答1:


I have added custom code to nutch protocol-httpclient plugin to solve the issue.

Shared the changes in the link below

http://www.gingercart.com/Home/search-and-crawl/nutch-custom-authentication-cookies-session-management-to-crawl-secure-enterprise-websites



来源:https://stackoverflow.com/questions/17581298/nutch-authentication-via-putting-a-cookie-in-the-header

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!