Scraping links with Scrapy

好久不见. 提交于 2019-12-14 03:04:23

问题


I am trying to scrape a swedish real estate website www.booli.se . However, i can't figure out how to follow links for each house and extract for example price, rooms, age etc. I only know how to scrape one page and i can't seem to wrap my head around this. I am looking to do something like:

for link in website:
    follow link
    attribute1 = item.css('cssobject::text').extract()[1]
    attribute2 = item.ss('cssobject::text').extract()[2]
    yield{'Attribute 1': attribute1, 'Attribute 2': attribute2}

So that i can scrape the data and output it to an excel-file. My code for scraping a simple page without following links is as follows:

import scrapy

class BooliSpider(scrapy.Spider):
    name = "boolidata"
    start_urls = [
        'https://www.booli.se/slutpriser/lund/116978/'
    ]
    '''def parse(self, response):
        for link in response.css('.nav-list a::attr(href)').extract():
            yield scrapy.Request(url=response.urljoin(link), 
callback=self.collect_data)'''

    def parse(self, response):
        for item in response.css('li.search-list__item'):
            size = item.css('span.search-list__row::text').extract()[1]
            price = item.css('span.search-list__row::text').extract()[3]
            m2price = item.css('span.search-list__row::text').extract()[4]

            yield {'Size': size, 'Price': price, 'M2price': m2price}

Thankful for any help. Really having trouble getting it all together and outputting specific link contents to a cohesive output-file (excel).


回答1:


You could use scrapy's CrawlSpider for following and scraping links

Your code should look like this:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule


class BooliItem(scrapy.Item):
    size = scrapy.Field()
    price = scrapy.Field()
    m2price = scrapy.Field()


class BooliSpider(CrawlSpider):
    name = "boolidata"
    start_urls = [
        'https://www.booli.se/slutpriser/lund/116978/',
    ]

    rules = [
        Rule(
            LinkExtractor(
                allow=(r'listing url pattern here to follow'),
                deny=(r'other url patterns to deny'),
            ),
            callback='parse_item',
            follow=True,
        ),
    ]

    def parse_item(self, response):
        item = BooliItem()
        item['size'] = response.css('size selector').extract()
        item['price'] = response.css('price selector').extract()
        item['m2price'] = response.css('m2price selector').extract()

        return item

And you can run your code via:

scrapy crawl booli -o booli.csv

and import your csv to Excel.



来源:https://stackoverflow.com/questions/49694649/scraping-links-with-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!