Scrapy parse javascript

眉间皱痕 提交于 2021-02-18 11:22:39

问题


I have a javascript on the page like below:

new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",

i want to get "185310341". I am searching on google about a few hours but couldn't find anything, I hope u can help me. How can i scrape that javascript and get that id?

I tried that code :

id = sel.search('"id":(.*?),',text).group(1)
print id

but i got:

exceptions.AttributeError: 'Selector' object has no attribute 'search'

回答1:


Scrapy selectors have built-in support for regular expressions:

sel.xpath('<xpath_to_find_the_element_text>').re(r'"id":(\d+)')

Demo showing the work of this particular regular expression:

>>> import re
>>> s = 'new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",'
>>> re.search('"id":(\d+)', s).group(1)
'185310341' 



回答2:


An alternative to the regex approach is to use a Javascript parser, convert the output of that parser to an XML document, and parse it with XPath.

That's what implemented in js2xml, which uses slimit and lxml (disclaimer: I wrote js2xml; warning: not stable)

In your case, check this sample scrapy shell session, using js2xml.jsonlike.getall():

paul:~$ scrapy shell http://2loom.com/products/2loom-design-siyah-beyaz-kalpli
2014-05-19 16:12:00+0200 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
2014-05-19 16:12:00+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-05-19 16:12:00+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled item pipelines: 
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-19 16:12:00+0200 [default] INFO: Spider opened
2014-05-19 16:12:01+0200 [default] DEBUG: Crawled (200) <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f8552946610>
[s]   item       {}
[s]   request    <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s]   response   <200 http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <Spider 'default' at 0x7f8552384b90>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.
  warn("The top-level `frontend` package has been deprecated. "

In [1]: scripts = response.selector.xpath('//script/text()').extract()

In [2]: import js2xml, js2xml.jsonlike

In [3]: js = js2xml.parse(scripts[-1])

In [4]: js2xml.jsonlike.getall(js)
Out[4]: 
[{'onVariantSelected': 'selectCallback',
  'product': {'available': True,
   'compare_at_price': None,
   'compare_at_price_max': 0,
   'compare_at_price_min': 0,
   'compare_at_price_varies': False,
   'content': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
   'created_at': '2013-11-29T13:37:11+02:00',
   'description': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
   'featured_image': '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
   'handle': '2loom-design-siyah-beyaz-kalpli',
   'id': 185310341,
   'images': ['//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
    '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwarte_hartjes_ak_girls.jpg?v=1389259259',
    '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_boys.jpg?v=1389259264',
    '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwartje_hartjes_ak_boys.jpg?v=1389259264'],
   'options': ['Size'],
   'price': 15900,
   'price_max': 15900,
   'price_min': 15900,
   'price_varies': False,
   'published_at': '2013-11-29T13:34:20+02:00',
   'tags': [u'2\xb7Loom',
    'Beyaz',
    'Design',
    'Ekrek',
    u'Kad\u0131n',
    'Kalpli',
    'Lacivert'],
   'title': '10. Design | Siyah & beyaz kalpli',
   'type': '2 Loom Limiteds',
   'variants': [{'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584985,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 3,
     'option1': 'XS (34-36: 1.60m-1.70m)',
     'option2': None,
     'option3': None,
     'options': ['XS (34-36: 1.60m-1.70m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-XS',
     'taxable': True,
     'title': 'XS (34-36: 1.60m-1.70m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584989,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 3,
     'option1': 'S (36-38: 1.65m-1.75m)',
     'option2': None,
     'option3': None,
     'options': ['S (36-38: 1.65m-1.75m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-S',
     'taxable': True,
     'title': 'S (36-38: 1.65m-1.75m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584997,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 7,
     'option1': 'M (38-40: 1.70m-1.80m)',
     'option2': None,
     'option3': None,
     'options': ['M (38-40: 1.70m-1.80m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-M',
     'taxable': True,
     'title': 'M (38-40: 1.70m-1.80m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424585001,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 7,
     'option1': 'L (40-42: 1.75m-1.85m)',
     'option2': None,
     'option3': None,
     'options': ['L (40-42: 1.75m-1.85m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-L',
     'taxable': True,
     'title': 'L (40-42: 1.75m-1.85m)',
     'weight': 0}],
   'vendor': u'2\xb7Loom'}}]

In [5]: 


来源:https://stackoverflow.com/questions/23662069/scrapy-parse-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!