Scrapy Extract ld+JSON

吃可爱长大的小学妹 提交于 2021-02-08 09:48:20

问题


How to extract the name and url?

quotes_spiders.py

import scrapy
import json

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://www.lazada.com.my/shop-power-banks2/?price=1572-1572"]

    def parse(self, response):
        data = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first())
        //how to extract the name and url?
        yield data

Data to Extract

<script type="application/ld+json">{"@context":"https://schema.org","@type":"ItemList","itemListElement":[{"@type":"Product","image":"http://my-live-02.slatic.net/p/2/test-product-0601-7378-08684315-8be741b9107b9ace2f2fe68d9c9fd61a-webp-catalog_233.jpg","name":"test product 0601","offers":{"@type":"Offer","availability":"https://schema.org/InStock","price":"99999.00","priceCurrency":"RM"},"url":"http://www.lazada.com.my/test-product-0601-51348680.html?ff=1"}]}</script>

回答1:


This line of code returns a dictionary with the data you want:

data = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first())

All you need to do is to access it like:

name = data['itemListElement'][0]['name']
url = data['itemListElement'][0]['url']

Given that the microdata contains a list you will need to check you are referring to the correct product in the list.




回答2:


A really easy solution for this would be to use https://github.com/scrapinghub/extruct. It handles all the hard parts of extracting structured data.



来源:https://stackoverflow.com/questions/44939247/scrapy-extract-ldjson

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!