问题
Here is the spider
import scrapy
import re
from ..items import HomedepotSpiderItem
class HomedepotcrawlSpider(scrapy.Spider):
name = 'homeDepotCrawl'
allowed_domains = ['homedepot.com']
start_urls = ['https://www.homedepot.com/b/ZLINE-Kitchen-and-Bath/N-5yc1vZhsy/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&storeSelection=3304,3313,3311,3310,8560&experienceName=default']
def parse(self, response):
items = HomedepotSpiderItem()
#get model
productName = response.css('.pod-plp__description.js-podclick-analytics').css('::text').getall()
productName = [x.strip(' ') for x in productName if len(x.strip())]
productName = [x.strip('\n') for x in productName if len(x.strip())]
productName = [x.strip('\t') for x in productName if len(x.strip())]
productName = [x.strip(',') for x in productName if len(x.strip())]
#productName = productName[0].split(',') tried to split the list into indiviudal elements
productSKU = response.css('.pod-plp__model::text').getall()
#get rid of all the stuff i dont need
productSKU = [x.strip(' ') for x in productSKU] #whiteSpace
productSKU = [x.strip('\n') for x in productSKU]
productSKU = [x.strip('\t') for x in productSKU]
productSKU = [x.strip(' Model# ') for x in productSKU] #gets rid of the model name
productSKU = [x.strip('\xa0') for x in productSKU] #gets rid of the model name
#get the price
productPrice = response.css('.price__numbers::text').getall()
#get rid of all the stuff i dont need
productPrice = [x.strip(' ') for x in productPrice if len(x.strip())]
productPrice = [x.strip('\n') for x in productPrice if len(x.strip())]
productPrice = [x.strip('\t') for x in productPrice if len(x.strip())]
productPrice = [x.strip('$') for x in productPrice if len(x.strip())]
## All prices are printing out twice, so take every other price
productPrice = productPrice[::2]
items['productName'] = productName
items['productSKU'] = productSKU
items['productPrice'] = productPrice
yield items
Items.py
import scrapy
class HomedepotSpiderItem(scrapy.Item):
#create items
productName = scrapy.Field()
productSKU = scrapy.Field()
productPrice = scrapy.Field()
#prodcutNumRating = scrapy.Field()
pass
My Issue
I'm doing some practice with Scrapy right now and I extracted all of this data from Home Depot's website using CSS. After extracting I manually stripped off everything all the data that I didn't need and it looked fine on the terminal. However, after exporting everything to excel, All my extracted data is printing out into one column per row. Ex: Product Name -> all models going into one cell. I looked into some scrapy documentation and saw that .getall() returns everything as a list so I tried splitting the list into individual elements thinking it would be fine, however that would get rid of all the data that I scraped.
Any help would be appreciated and let me know if there is any clarification that is needed!
Edit I'm exporting to excel using: scrapy crawl homeDepotCrawl -o test.csv -t csv
回答1:
The problem is you are loading all items into one scrapy.Item instance. See code comments for more details.
Also, it is worth noting that you can use item loaders or create an item pipeline to clean the fields instead of repeating so much code. When dealing with a single item you will not need to use so much list comprehension. Even a simple function you can call to run them through would be better than doing all this list comprehension.
[1] https://docs.scrapy.org/en/latest/topics/loaders.html
[2] https://docs.scrapy.org/en/latest/topics/item-pipeline.html
[3] https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
import re
from ..items import HomedepotSpiderItem
class HomedepotcrawlSpider(scrapy.Spider):
name = 'homeDepotCrawl'
allowed_domains = ['homedepot.com']
start_urls = ['https://www.homedepot.com/b/ZLINE-Kitchen-and-Bath/N-5yc1vZhsy/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&storeSelection=3304,3313,3311,3310,8560&experienceName=default']
def parse(self, response):
'''
Notice when we set items variable we are not using .get or .extract yet
We collect the top level of each item into a list of selectors.
Then we loop through the selectors creating a new scrapy.Item instance for each selector/item on the page.
The for product in items loop will step through each item selector individually.
You can then chain .css to your variable product.css now to access each section of each
item individually and export them separately.
This will give you a new row for each item.
'''
items = response.css('.plp-pod')
for product in items:
# Create new scrapy.Item for each product in our selector list.
item = HomedepotSpiderItem()
item['productName'] = product.css('.pod-plp__description.js-podclickanalytics::text').get()
# Notice we are yielding item inside of the loop.
yield item
来源:https://stackoverflow.com/questions/60122716/scrapy-using-css-to-extract-data-and-excel-export-everything-into-one-cell