Scrapy: constructing non-duplicative list of absolute paths from relative paths

余生颓废 提交于 2019-12-31 04:07:13

问题


Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag?

Background: I am trying to use Scrapy to crawl a site, pull any links under the img src tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For reference, here are some other details about the hypothetical target site:

  • The relative paths look like img src="/images/file1.jpg", where images is a directory (www.example.com/products/images) that cannot be directly crawled for file paths.
  • The relative paths for these images do not follow any logical naming convention (e.g., file1.jpg, file2.jpg, file3.jpg).
  • The image types differ across files, with PNG and JPG being the most common.

Problems experienced: Even after thoroughly reading the Scrapy documentation and going through a ton of fairly dated Stackoverflow questions [e.g., this question], I can't seem to get the precise output I want. I can pull the relative paths and reconstruct them, but the output is off. Here are the issues I've noticed with my current code:

  • In the CSV output, there are both populated rows and blank rows. My best guess is that each row represents the results of scraping a particular page for relative paths, which would mean a blank row is a negative result.

  • Each non-blank row in the CSV contains a list of URLs delimited by commas, whereas I would simply like an individual, non-duplicative value in a row. The population of a row with a comma-delimited list seems to support my suspicions about what is going on under the hood.

Current code: I execute the following code in the command line using 'scrapy crawl relpathfinder -o output.csv -t csv'.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class MySpider(CrawlSpider):
    name='relpathfinder'
    allowed_domains=['example.com']
    start_urls=['https://www.example.com/']
    rules = (Rule(LinkExtractor(allow=()), callback='url_join', follow=True),)

    def url_join(self,response):
        item=MyItem()
        item['url']=[]
        relative_url=response.xpath('//img/@src').extract()
        for link in relative_url:
            item['url'].append(response.urljoin(link))
        yield item

Thank you!


回答1:


What about:

def url_join(self,response):
    item=MyItem()
    item['url']=[]
    relative_url=response.xpath('//img/@src').extract()
    for link in relative_url:
        item['url'] = response.urljoin(link)
        yield item



回答2:


I would use an Item Pipeline to deal with the duplicated items.

# file: yourproject/pipelines.py
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.url_seen = set()

    def process_item(self, item, spider):
        if item['url'] in self.url_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.url_seen.add(item['url'])
            return item

And add this pipeline to your settings.py

# file: yourproject/settings.py
ITEM_PIPELINES = {
    'your_project.pipelines.DuplicatesPipeline': 300,
}

Then you just need to run your spider scrapy crawl relpathfinder -o items.csv and the pipeline will Drop duplicate items for you. So will not see any duplicate in your csv output.



来源:https://stackoverflow.com/questions/48051158/scrapy-constructing-non-duplicative-list-of-absolute-paths-from-relative-paths

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!