问题
I am trying to scrape a swedish real estate website www.booli.se . However, i can't figure out how to follow links for each house and extract for example price, rooms, age etc. I only know how to scrape one page and i can't seem to wrap my head around this. I am looking to do something like:
for link in website:
follow link
attribute1 = item.css('cssobject::text').extract()[1]
attribute2 = item.ss('cssobject::text').extract()[2]
yield{'Attribute 1': attribute1, 'Attribute 2': attribute2}
So that i can scrape the data and output it to an excel-file. My code for scraping a simple page without following links is as follows:
import scrapy
class BooliSpider(scrapy.Spider):
name = "boolidata"
start_urls = [
'https://www.booli.se/slutpriser/lund/116978/'
]
'''def parse(self, response):
for link in response.css('.nav-list a::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(link),
callback=self.collect_data)'''
def parse(self, response):
for item in response.css('li.search-list__item'):
size = item.css('span.search-list__row::text').extract()[1]
price = item.css('span.search-list__row::text').extract()[3]
m2price = item.css('span.search-list__row::text').extract()[4]
yield {'Size': size, 'Price': price, 'M2price': m2price}
Thankful for any help. Really having trouble getting it all together and outputting specific link contents to a cohesive output-file (excel).
回答1:
You could use scrapy's CrawlSpider for following and scraping links
Your code should look like this:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule
class BooliItem(scrapy.Item):
size = scrapy.Field()
price = scrapy.Field()
m2price = scrapy.Field()
class BooliSpider(CrawlSpider):
name = "boolidata"
start_urls = [
'https://www.booli.se/slutpriser/lund/116978/',
]
rules = [
Rule(
LinkExtractor(
allow=(r'listing url pattern here to follow'),
deny=(r'other url patterns to deny'),
),
callback='parse_item',
follow=True,
),
]
def parse_item(self, response):
item = BooliItem()
item['size'] = response.css('size selector').extract()
item['price'] = response.css('price selector').extract()
item['m2price'] = response.css('m2price selector').extract()
return item
And you can run your code via:
scrapy crawl booli -o booli.csv
and import your csv to Excel.
来源:https://stackoverflow.com/questions/49694649/scraping-links-with-scrapy