scrapy

利用scrapy-splash爬取JS生成的动态页面

怎甘沉沦 提交于 2020-01-26 23:14:10
目前,为了加速页面的加载速度,页面的很多部分都是用JS生成的,而对于用scrapy爬虫来说就是一个很大的问题,因为scrapy没有JS engine,所以爬取的都是静态页面,对于JS生成的动态页面都无法获得。 解决方案: 利用第三方中间件来提供JS渲染服务: scrapy-splash 等。 利用webkit或者基于webkit库 Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT。Twisted(QT)用来让服务具有异步处理能力,以发挥webkit的并发能力。 下面就来讲一下如何使用scrapy-splash: 利用pip安装scrapy-splash库: $ pip install scrapy-splash scrapy-splash使用的是Splash HTTP API, 所以需要一个splash instance,一般采用docker运行splash,所以需要安装 docker 。 安装 docker , 安装好后运行docker。 拉取镜像(pull the image): $ docker pull scrapinghub/splash 用docker运行scrapinghub/splash: $ docker run -p 8050:8050

Scrapy 爬虫模板--XMLFeedSpider

懵懂的女人 提交于 2020-01-26 09:42:28
XMLFeedSpider 主要用于 RSS 的爬取。RSS 是基于 XML 的信息局和技术。这篇文章的最后一下小结我会利用爬取经济观察网 RSS 的例子来讲解它的具体用法。现在我们先看一下 XMLFeedSpider 的常用属性。 零、常用属性 iterator:迭代器,主要用来分析 RSS 源,可用的迭代器有三种: iternode:高性能的正则表达式迭代器,是默认迭代器 html:加载所有的 DOM 结构进行分析,但是如果数据量巨大会产生性能问题。唯一的优点是处理不合理的标签会很有用 xml:和 html 迭代器类似。 itertag:指定需要迭代的节点 namespaces:定义处理文档时所需要使用的命名空间。 一、常用方法 adapt_response(response):在处理分析 Response 前触发,主要用于修改 Response 的内容,返回类型为 Response 。 parse_node(response,selectot):怕渠道匹配的节点时触发这个方法处理数据。这个方法必须在项目代码中实现,否则爬虫不工作,并且必须返回 Item、Request 或者包含二者的迭代器。 process_result(response,result):返回爬取结果时触发,用于将爬取结果传递给框架核心处理前来做最后的修改。 案例 下面我们通过爬取经济观察网的 RSS 来看看

Scrapy入门教程

那年仲夏 提交于 2020-01-26 08:27:21
关键字: scrapy 入门教程 爬虫 Spider 作者: http://www.cnblogs.com/txw1958/ 出处: http://www.cnblogs.com/txw1958/archive/2012/07/16/scrapy-tutorial.html 在这篇入门教程中,我们假定你已经安装了Scrapy。如果你还没有安装,那么请参考 安装指南 。 我们将使用开放目录项目(dmoz)作为抓取的例子。 这篇入门教程将引导你完成如下任务: 创建一个新的Scrapy项目 定义提取的Item 写一个Spider用来爬行站点,并提取Items 写一个Item Pipeline用来存储提取出的Items Scrapy是由Python编写的。如果你是Python新手,你也许希望从了解Python开始,以期最好的使用Scrapy。如果你对其它编程语言熟悉,想快速的学习Python,这里推荐 Dive Into Python 。如果你对编程是新手,且想从Python开始学习编程,请看下面的 对非程序员的Python资源列表 。 新建工程 在抓取之前,你需要新建一个Scrapy工程。进入一个你想用来保存代码的目录,然后执行: Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp.

How to decrease the bandwidth of scraping pages via Scrapy

若如初见. 提交于 2020-01-26 03:58:04
问题 I am using Scrapy with Luminati proxy to scrape thousands of Amazon pages, but I noticed my scraping bandwidth consumption is very high. I am scraping the whole page right now and I am just wondering if it is possible to remove/block images, css, js because I am just dealing with html code to make the scraping bandwidth as minimum as possible. thank you for looking into my problem :) 来源: https://stackoverflow.com/questions/59541755/how-to-decrease-the-bandwidth-of-scraping-pages-via-scrapy

【scrapy】【五】scrapy项目二

人盡茶涼 提交于 2020-01-26 00:20:44
1、多个url 在之前的基础上,添加多个url进行爬取, 第一种是将需要爬取的网站都列出来,在start_urls中 #在basic.py中 start_urls={ '网址1', '网址2', '网址3', } 第二种是如下写法: start_urls=[i.strip() for i in open('todo.urls.txt').readlines()] 2、双向爬取(水平、垂直爬取) 水平爬取就是next page或者多个url进行爬取 垂直爬取是某个网页下某个目标的爬取 示例: 首先复制一下之前写的爬虫文件,basic.py cp basic.py manual.py 【未完待续】 来源: CSDN 作者: mkczc 链接: https://blog.csdn.net/kidcad/article/details/104066229

Scrapy with Selenium does not detect HTML element loaded dynamically

纵然是瞬间 提交于 2020-01-25 09:30:09
问题 I am using Scrapy with Selenium to scrape content from this page: https://nikmikk.itch.io/door-knocker In it, there is a table under the div with class .game_info_panel_widget , where the first row Published 62 days ago seems to be loaded dynamically. I have try fetching the page as Scrapy sees but cannot find that row in the html. scrapy fetch --nolog https://nikmikk.itch.io/door-knocker > test.html Here is what I see in test.html , the first table row is the Status, not the Published row

Scrapy doesn't call callback function even with no filter

老子叫甜甜 提交于 2020-01-25 09:13:16
问题 I have this code to crawl the details page yield Request(flexibleItem[self.linkAttributeName],callback=self.parseDetails,dont_filter=True ) there is no error in the subURL because I tested it with the same method "GET" I didn't get any error but simply python ignoring the callback function It is a very huge project working on a server so I can't share the code . But here is the main architecture for what I am doing . Output is : in start request TRUE oooo def start_requests(self): print "in

Why is Scrapy skipping some URL's but not others?

无人久伴 提交于 2020-01-25 08:05:31
问题 I am writing a scrapy crawler to grab info on shirts from Amazon. The crawler starts on an amazon page for some search, "funny shirts" for example, and collects all the result item containers. It then parses through each result item collecting data on the shirts. I use ScraperAPI and Scrapy-user-agents to dodge amazon. The code for my spider is: class AmazonSpiderSpider(scrapy.Spider): name = 'amazon_spider' page_number = 2 keyword_file = open("keywords.txt", "r+") all_key_words = keyword

Scrapy: How to export Json from script

大城市里の小女人 提交于 2020-01-25 06:49:07
问题 I created a web crawler with scrapy, but I've a problem with phone number because it is into a script. The script is: <script data-n-head="true" type="application/ld+json">{"@context":"http://schema.org","@type":"LocalBusiness","name":"Clínica Dental Reina Victoria 23","description":".TU CLÍNICA DENTAL DE REFERENCIA EN MADRID","logo":"https://estaticos.qdq.com/CMS/directory/logos/c/l/clinica-dental-reina-victoria.png","image":"https://estaticos.qdq.com/coverphotos/098/535

python scrapy how to code the parameter instead of using cmd: use Custom code in Scrapy

亡梦爱人 提交于 2020-01-25 04:51:11
问题 I am using scrapy 0.20 with puthon 2.7 i used to do this in cmd -s JOBDIR=crawls/somespider-1 to handle the dublicated items. note please, i already did the changes in setting I dont' want to use that in cmd. is there anyway so i can type it in code inside my spider? thanks 回答1: It's so easy. Use dropitem in pipelines.py to drop the item. And you can use custom command to code the parameter inside of program. Here is example of custom code in scrapy Using the custom command (say : scrapy