Scrapy not downloading images and getting pipeline error

雨燕双飞 提交于 2019-12-24 07:26:44

问题


I have this code

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
            for image_url in item['image_urls']:
                yield Request(image_url)

and this is the spider subclassed from BaseSpider. This basespider is giving me nightmare

def parse(self, response):

    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//strong[@class="genmed"]')
    items = []


    for site in sites[:5]:

        item = PanduItem()
        item['username'] = site.select('dl/dd/h2/a').select("string()").extract()
        item['number_posts'] = site.select('dl/dd/h2/em').select("string()").extract()
        item['profile_link'] = site.select('a/@href').extract()



        request =  Request("http://www.example/profile.php?mode=viewprofile&u=5",
        callback = self.parseUserProfile)
        request.meta['item'] = item
        return request

 def parseUserProfile(self, response):

        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@id="current')
        myurl = sites[0].select('img/@src').extract()

        item = response.meta['item']

        image_absolute_url = urljoin(response.url, myurl[0].strip())
        item['image_urls'] = [image_absolute_url]

        return item

This is the error i am getting. I am not able to find. Looks like its getting item but i am not sure

ERROR

File "/app_crawler/crawler/pipelines.py", line 9, in get_media_requests
            for image_url in item['image_urls']:
        exceptions.TypeError: 'NoneType' object has no attribute '__getitem__'

回答1:


You are missing a method in your pipelines.py The said file contains 3 methods:

  • Process item
  • get_media_requests
  • item_completed

The item_completed method is the one that handles the saving of the images to a specified path. This path is set in the settings.py as below:

ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGES_STORE = '/your/path/here'

Also included in the settings.py as seen above is the line that enables the imagepipeline.

I've tried to explain it in the best way I understood it as possible. For further reference, have a look at the official scrapy documentation.




回答2:


Hmmm. At no point are you appending item to items (although the example code in the documentation doesn't do an append either, so I could be barking up the wrong tree).

Try adding it to parse(self, response) like so and see if this resolves the issue:

for site in sites:
    item = PanduItem()
    item['username'] = site.select('dl/dd/h2/a').select("string()").extract()
    item['number_posts'] = site.select('dl/dd/h2/em').select("string()").extract()
    item['profile_link'] = site.select('a/@href').extract()

    items.append(item)



回答3:


And set the IMAGES_STORE setting to a valid directory that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.

For example:

IMAGES_STORE = '/path/to/valid/dir'


来源:https://stackoverflow.com/questions/13880765/scrapy-not-downloading-images-and-getting-pipeline-error

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!