scrapy item loader to get a absolute url from extracted url

≡放荡痞女 提交于 2019-12-23 05:26:18

问题


I am using/learning scrapy, python framework to scrape few of my interested web pages. In that go I extract the links in a page. But those links are relative in most of the case. I used urljoin_rfc which is present in scrapy.utils.url to get the absolute path. It worked fine.

In a process of learning I came across a feature called Item Loader. Now I want to do the same using Item loader. My urljoin_rfc() is in a user defined function function _urljoin(url,response). I want my loader to refer the function _urljoin now. So in my loader class I do link_in = _urljoin(). So I canged my _urljoin declaration to _urljoin(url, response = loader_context.response). But I get a error saying NameError: name 'loader_context' is not defined

I need help here. I do this because, not just while loading I call _urljoin(), other part of my code too call the function _urljoin. If i am terribly doing bad please bring it to my notice.


回答1:


If you're using _urljoin(url, response) elsewhere, you can keep as it is, accepting a response as 2nd argument.

Now, processors for Item Loaders can accept a context, but the context is a dict of arbitrary key/values which is shared among all input and output processors (from the docs).

So you could have wrapping function calling your _urljoin(url, response):

def urljoin_w_context(url, loader_context):
    response = loader_context.get('response')
    return _urljoin(url, response)

and in your ItemLoader definition:

    ...
    link_in = MapCompose(urljoin_w_context)
    ...

and finally in your callback code, when you instantiate your ItemLoader, pass the response reference:

def parse_something(self, response):
    ...
    loader = ItemLoader(item, response=response)
    ...


来源:https://stackoverflow.com/questions/19970015/scrapy-item-loader-to-get-a-absolute-url-from-extracted-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!