Scrapy Linkextractor duplicating(?)

半世苍凉 提交于 2019-11-27 15:49:39

First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:

custom_settings = {
    'DEPTH_LIMIT': 3,
}

Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.

First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.

Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:

images=hxs.xpath('//img')

and then to get the image url:

allimages['image'] = image.xpath('./@src').extract()

for the news, it looks like this could work:

allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/@href').extract()

Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!