scrapy

How to use python requests with scrapy?

自闭症网瘾萝莉.ら 提交于 2021-01-21 11:55:56
问题 I am trying to use requests to fetch a page then pass the response object to a parser, but I ran into a problem: def start_requests(self): yield self.parse(requests.get(url)) def parse(self, response): #pass builtins.AttributeError: 'generator' object has no attribute 'dont_filter' 回答1: You first need to download the page's resopnse and then convert that string to HtmlResponse object from scrapy.http import HtmlResponse resp = requests.get(url) response = HtmlResponse(url="", body=resp.text,

How to use python requests with scrapy?

為{幸葍}努か 提交于 2021-01-21 11:55:39
问题 I am trying to use requests to fetch a page then pass the response object to a parser, but I ran into a problem: def start_requests(self): yield self.parse(requests.get(url)) def parse(self, response): #pass builtins.AttributeError: 'generator' object has no attribute 'dont_filter' 回答1: You first need to download the page's resopnse and then convert that string to HtmlResponse object from scrapy.http import HtmlResponse resp = requests.get(url) response = HtmlResponse(url="", body=resp.text,

How to scrape dynamic content from a website?

不想你离开。 提交于 2021-01-21 06:08:22
问题 So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far: import scrapy from ..items import AmazonsItem class AmazonSpiderSpider(scrapy.Spider): name = 'amazon_spider' start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6'] def parse(self, response): items

Passing a argument to a callback function

删除回忆录丶 提交于 2021-01-20 17:47:07
问题 def parse(self, response): for sel in response.xpath('//tbody/tr'): item = HeroItem() item['hclass'] = response.request.url.split("/")[8].split('-')[-1] item['server'] = response.request.url.split('/')[2].split('.')[0] item['hardcore'] = len(response.request.url.split("/")[8].split('-')) == 3 item['seasonal'] = response.request.url.split("/")[6] == 'season' item['rank'] = sel.xpath('td[@class="cell-Rank"]/text()').extract()[0].strip() item['battle_tag'] = sel.xpath('td[@class="cell-BattleTag"

Is there a way to extract text along with text-links in Scrapy using CSS?

天大地大妈咪最大 提交于 2021-01-20 13:26:31
问题 I'm brand new to Scrapy. I have learned how to use response.css() for reading specific aspects from a web page, and am avoiding learning the xpath system. It seems to do the exact same thing, but in a different format (correct me if I'm wrong) The site I'm scraping has long paragraphs of text, with an occasional linked text right in the middle. This sentence with a link to a picture of a dog is an example. I'm not sure if there is a way to have a spider read the text, with links in place (I

关于使用scrapy框架编写爬虫以及Ajax动态加载问题、反爬问题解决方案

亡梦爱人 提交于 2021-01-20 09:49:04
关于使用scrapy框架编写爬虫以及Ajax动态加载问题、反爬问题解决方案 参考文章: (1)关于使用scrapy框架编写爬虫以及Ajax动态加载问题、反爬问题解决方案 (2)https://www.cnblogs.com/viczhangyuetao/p/8031528.html 备忘一下。 来源: oschina 链接: https://my.oschina.net/u/4438370/blog/4914949

一个爬虫工程师的成长之路

折月煮酒 提交于 2021-01-15 06:21:39
大数据流行的今天,网络爬虫成为了获取数据的一个重要手腕。但要学习好爬虫并没有那么简单。由于学习点、学习方向等实在是太多了,而且它涉及到计算机网络、后端编程、前端开发、App 开发与逆向、网络安全、数据库、自动化运维、机器学习、数据分析等各个方向的内容,它像一张大网一样,把如今一些主流的技术栈都囊括在内。正由于设计内容的多样性,需要学习的东西也变得十分零散和杂乱。很多初学者找不到具体的学习方向。学习过程中遇到反爬、JS渲染等问题,也不知道该如何处理。基于这些年的爬虫经验,梳理了一下作为一个初学者,需要掌握的内容。 语言的选择 C语言历史悠久,Java横行当下,大多初学者可能在大学都接触过这两门语言。但他们都有缺点,C语言学习难度大,Java太复杂,效率也有点底,Python则刚刚好。所以,本文所讲内容均以Python为开发语言。 初学爬虫 一般的网站,常常不带任何反爬措施。比方某某博客站点,我们要爬全站的话就顺着列表页爬到文章页,再把文章的时间、作者、正文等信息爬下来就能够了。 那代码怎样写呢?用 Python 的 requests 等库就够了,写一个根本的逻辑,顺着把一篇篇文章的源码获取下来,解析的话用 XPath、BeautifulSoup、PyQuery 或者正则表达式,或者粗暴的字符串匹配把想要的内容抠出来,再加个文本写入存下来就完事了。 代码很简单,就几个办法调用

Unable to make my script stop when some urls are scraped

依然范特西╮ 提交于 2021-01-14 17:23:28
问题 I'v created a script in scrapy to parse the titles of different sites listed in start_urls . The script is doing it's job flawlessly. What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there. I've tried so far with: import scrapy from scrapy.crawler import CrawlerProcess class TitleSpider(scrapy.Spider): name = "title_bot" start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"] def parse(self, response):

Unable to make my script stop when some urls are scraped

时光总嘲笑我的痴心妄想 提交于 2021-01-14 17:10:44
问题 I'v created a script in scrapy to parse the titles of different sites listed in start_urls . The script is doing it's job flawlessly. What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there. I've tried so far with: import scrapy from scrapy.crawler import CrawlerProcess class TitleSpider(scrapy.Spider): name = "title_bot" start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"] def parse(self, response):

数据采集与解析案例之:2020博客之星评选

别来无恙 提交于 2021-01-14 13:41:19
数据采集与解析案例之:2020博客之星评选 一、博客之星 2020年的博客之星已经开始啦,根据规则投票会持续一段时间,但是在活动页面并未有实时排行榜,本文将用爬虫实现数据的采集以及排序,可以直接查看到评比排行~ 同时,在下也有幸入选博客之星TOP 200,如果你手里还有多余的票票,请不要错过投票的机会,点击阅读原文即可为小猪投上宝贵的N票,不胜感激 二、实现思路 1. 确定数据源 首先我们需要在页面上获得数据,由于每次刷新数据都是会变化的,所以一般都是Ajax请求,我们需要用到开发者工具来查看网络请求。 如何呼出开发者工具 在进行页面分析的时候,浏览器的开发者工具是不必可少的,笔者以蓝狐(火狐浏览器开发版)为例。对于其他的浏览器,主要都是基于Gecko(火狐内核)、Blink(Chrome内核)、Webkit(Safari内核)、Trident(IE内核)来套的壳子,所以整体的呼出方式不会差别很大。 首先我们在 页面空白处点击右键 -> 点击 检查元素 寻找数据源 页面链接:https://bss.csdn.net/m/topic/blog_star2020,打开后切换至网络栏目,点击重新载入: 可以按照类型排序,过滤掉一些静态资源的请求。发现数据来自于一个 getUser 的接口,返回的数据可以使用JSON格式进行解析。 2. 实现步骤 找到数据源以后,步骤就比较明确了