python-scrapy: how to fetch an URL (not via following links) inside a spider?

▼魔方 西西 提交于 2019-12-31 07:07:20

问题


How can I have inside my spider something that will fetch some URL to extract something from a page via HtmlXPathSelector? But the URL is something I want to supply as a string inside the code, not a link to follow.

I tried something like this:

req = urllib2.Request('http://www.example.com/' + some_string + '/')
req.add_header('User-Agent', 'Mozilla/5.0')
response = urllib2.urlopen(req)
hxs = HtmlXPathSelector(response)

but at this moment it throws an exception with:

[Failure instance: Traceback: <type 'exceptions.AttributeError'>: addinfourl instance has no attribute 'encoding'

回答1:


You will need to construct a scrapy.http.HtmlResponse object with the body=urllib2.urlopen(req).read() - but why exactly do you need to use urllib2 instead of returning the request with a callback?




回答2:


scrapy is not explicit to show how to do unittest, i don't recommend use scrapy to crawl data if you want do unittest for each spider.



来源:https://stackoverflow.com/questions/4640804/python-scrapy-how-to-fetch-an-url-not-via-following-links-inside-a-spider

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!