Python3 Scraping all informations of one page

问题

My Spider:

import scrapy

class LinkSpider(scrapy.Spider):
    name = "page"

    start_urls = [
        'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1'
        ]

    def parse(self, response):
        yield{
            'ItemSKU': response.xpath('//span[@class="sn_p01_pno"]/text()').getall(),
            'title': response.xpath('//div[@class="sn_p01_desc h4 col-12 pl-0 pl-sm-3 pull-left"]/text()').getall(),
            'ItemEAN': response.xpath('//div[@class="productean"]/text()').getall(),
            'Delivery_Status': response.xpath('//div[@class="availabilitydeliverytime"]/text()').getall()

        }

How can i separate my values that i get an csv file with line 1: itemsku,title,itemean,deliverystatus with the correct values? now my csv stores all in one line, so the are all skus, all title and so on. mycsvfile

回答1:

The getall() grabs a list of all the values you're trying to get with the XPATH selector. So it is putting the values of the key 'ITEMSKU' for example in a list [item1,item2,item3,item4]

In order to get the title etc you need to use the get() method. But before that you need to inspect the HTML in a bit more detail. The page you're trying to get data from has the information within seperate 'cards'. The most logical way to get the individual data you want ITEMSKU etc.. is to loop around each 'card' and then yield the dictionary of ITEMSKU with get().

Also you need to check your XPATH selectors, because the productean is not on that page. It's within each item's own link but not on the page you're scraping.

Code Example

import scrapy

class LinkSpider(scrapy.Spider):
    name = "link"
    allow_domains = ['topart-online.com']
    start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
    
    def parse(self, response):
        card = response.xpath('//a[@class="clearfix productlink"]')
        for a in card:
            yield{
                'title': a.xpath('.//div[@class="sn_p01_desc h4 col-12 pl-0 pl-sm-3 pull-left"]/text()').get().strip(),
                'links': a.xpath('@href').get(),
                'ItemSKU': a.xpath('.//span[@class="sn_p01_pno"]/text()').get().strip(),
                'ItemEAN': a.xpath('.//div[@class="productean"]/text()').get(),
                'Delivery_Status': a.xpath('.//div[@class="availabilitydeliverytime"]/text()').get().strip().replace('/','')
                }

Explanation

Every item on the page is within a card. The card variable XPATH selector gives us a list of these cards. A list we can loop around.

We then loop around each of these cards. Note we use a.xpath instead of response.xpath. That is because we want to select data from each card.

Also notice that at the start of each XPATH selector we use .// instead of //.

.// tells scrapy we want to look at each card variables child elements and not the whole HTML document. Remember // searches the entire HTML document. This is called a relative XPATH and it's important to put with every XPATH selector when you have a list of 'cards' you want to get data from each individual card.

Please note that we're using strip() string method to strip any starting and ending white space which you get when you run this code.

Also note that for delivery status, there's a '/' within the data so using the replace string method to replace that with '' gets rid of it.

回答2:

You need to find all tags (rows) which contain all required information which, and then look through them:

def parse(self, response):
    for a in response.css('a.productlink'): # all rows
        yield{
            'ItemSKU': a.xpath('.//span[@class="sn_p01_pno"]/text()').get(),
            'title': a.xpath('.//div[@class="sn_p01_desc h4 col-12 pl-0 pl-sm-3 pull-left"]/text()').get(),
            'ItemEAN': a.xpath('.//div[@class="productean"]/text()').get(),
            'Delivery_Status': a.xpath('.//div[@class="availabilitydeliverytime"]/text()').get()

        }

来源：https://stackoverflow.com/questions/63111320/python3-scraping-all-informations-of-one-page

标签

python-3.x

xpath

web-scraping

scrapy