利用 scrapy-splash 对京东进行模拟点击并进行数据爬取

匿名 (未验证) 提交于 2019-12-02 22:51:08

本人是第一次写博客,有写得不好的地方欢迎值出来,大家一起进步!

scrapy-splash模块主要使用了Splash. 所谓的Splash, 就是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT。Twisted(QT)用来让服务具有异步处理能力,以发挥webkit的并发能力。Splash的特点如下:

  • 并行处理多个网页
  • 得到HTML结果以及(或者)渲染成图片
  • 关掉加载图片或使用 Adblock Plus规则使得渲染速度更快
  • 使用JavaScript处理网页内容
  • 使用Lua脚本
  • 能在Splash-Jupyter Notebooks中开发Splash Lua scripts
  • 能够获得具体的HAR格式的渲染信息

参考文档:https://www.cnblogs.com/jclian91/p/8590617.html

准备配置

  • scrapy框架
  • splash安装,windows用户通过虚拟机安装docker,Linux直接安装docker

页面分析

首先进入https://search.jd.com/

点击搜索后发现京东是通过 js 来加载书籍数据的, 通过下来鼠标可以发现加载了更多的书籍数据(数据也可以通过京东的api来获取)

首先是模拟搜索,通过检查可得:

然后是模拟下拉,这里选择页面底部的这个元素作为模拟元素:

开始爬取

模拟点击的lua脚本并获取页数:

 1 function main(splash, args)  2   splash.images_enabled = false  3   splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36')  4   assert(splash:go(args.url))  5   splash:wait(0.5)  6   local input = splash:select("#keyword")  7   input:send_text('python3.7')  8   splash:wait(0.5)  9   local form = splash:select('.input_submit') 10   form:click() 11   splash:wait(2) 12   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)") 13   splash:wait(6) 14   return splash:html() 15 end
View Code

同上有模拟下拉的代码:

1 function main(splash, args) 2   splash.images_enabled = false 3   splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36') 4   assert(splash:go(args.url)) 5   splash:wait(2) 6   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)") 7   splash:wait(6) 8   return splash:html() 9 end
View Code

选择你想要获取的元素,通过检查获得。附上源码:

 1 # -*- coding: utf-8 -*-  2 import scrapy  3 from scrapy import Request  4 from scrapy_splash import SplashRequest  5 from ..items import JdsplashItem  6   7   8 lua_script = '''  9 function main(splash, args) 10   splash.images_enabled = false 11   splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36') 12   assert(splash:go(args.url)) 13   splash:wait(0.5) 14   local input = splash:select("#keyword") 15   input:send_text('python3.7') 16   splash:wait(0.5) 17   local form = splash:select('.input_submit') 18   form:click() 19   splash:wait(2) 20   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)") 21   splash:wait(6) 22   return splash:html() 23 end 24 ''' 25  26 lua_script2 = ''' 27 function main(splash, args) 28   splash.images_enabled = false 29   splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36') 30   assert(splash:go(args.url)) 31   splash:wait(2) 32   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)") 33   splash:wait(6) 34   return splash:html() 35 end 36 ''' 37  38 class JdBookSpider(scrapy.Spider): 39     name = 'jd' 40     allowed_domains = ['search.jd.com'] 41     start_urls = ['https://search.jd.com'] 42  43     def start_requests(self): 44         #进入搜索页进行搜索 45         for each in self.start_urls: 46             yield SplashRequest(each,callback=self.parse,endpoint='execute', 47                 args={'lua_source': lua_script}) 48  49     def parse(self, response): 50         item = JdsplashItem() 51         price = response.css('div.gl-i-wrap div.p-price i::text').getall() 52         page_num = response.xpath("//span[@class= 'p-num']/a[last()-1]/text()").get() 53         #这里使用了 xpath 函数 fn:string(arg):返回参数的字符串值。参数可以是数字、逻辑值或节点集。 54         name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall() 55         #comment = response.css('div.gl-i-wrap div.p-commit').xpath('string(.//strong)').getall() 56         comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall() 57         publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall() 58         href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()] 59         for each in zip(name, price, comment, publishstore,href): 60             item['name'] = each[0] 61             item['price'] = each[1] 62             item['comment'] = each[2] 63             item['p_store'] = each[3] 64             item['href'] = each[4] 65             yield item 66         #这里从第二页开始 67         url = 'https://search.jd.com/Search?keyword=python3.7&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=%d&s=%d&click=0' 68         for each_page in range(1,int(page_num)): 69             yield SplashRequest(url%(each_page*2+1,each_page*60),callback=self.s_parse,endpoint='execute', 70                 args={'lua_source': lua_script2}) 71  72     def s_parse(self, response): 73         item = JdsplashItem() 74         price = response.css('div.gl-i-wrap div.p-price i::text').getall() 75         name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall() 76         comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall() 77         publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall() 78         href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()] 79         for each in zip(name, price, comment, publishstore, href): 80             item['name'] = each[0] 81             item['price'] = each[1] 82             item['comment'] = each[2] 83             item['p_store'] = each[3] 84             item['href'] = each[4] 85             yield item
View Code

各个文件的配置:

items.py :

 1 import scrapy  2   3   4 class JdsplashItem(scrapy.Item):  5     # define the fields for your item here like:  6     # name = scrapy.Field()  7     name = scrapy.Field()  8     price = scrapy.Field()  9     p_store = scrapy.Field() 10     comment = scrapy.Field() 11     href = scrapy.Field() 12     pass

settings.py:

1 import scrapy_splash 2 # Splash服务器地址 3 SPLASH_URL = 'http://192.168.99.100:8050' 4 # 开启Splash的两个下载中间件并调整HttpCompressionMiddleware的次序 5 DOWNLOADER_MIDDLEWARES = { 6 'scrapy_splash.SplashCookiesMiddleware': 723, 7 'scrapy_splash.SplashMiddleware': 725, 8 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 9 }

最后运行代码,可以看到书籍数据已经被爬取了:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!