scrapy

Trouble running a parser created using scrapy with selenium

懵懂的女人 提交于 2019-12-24 02:11:11
问题 I've written a scraper in Python scrapy in combination with selenium to scrape some titles from a website. The css selectors defined within my scraper is flawless. I wish my scraper to keep on clicking on the next page and parse the information embedded in each page. It is doing fine for the first page but when it comes to play the role for selenium part the scraper keeps clicking on the same link over and over again. As this is my first time to work with selenium along with scrapy, I don't

Pass the url into the parse method in scrapy that was consumed from RabbitMQ

拥有回忆 提交于 2019-12-24 01:49:32
问题 I am using the scrapy to consume the message(url) from the RabbitMQ,But When I use the yield to call the parse method passing my url as parameters .The program does not comes inside the callback method.Below is the foloowing code of my spider # -*- coding: utf-8 -*- import scrapy import pika from scrapy import cmdline import json class MydeletespiderSpider(scrapy.Spider): name = 'Mydeletespider' allowed_domains = [] start_urls = [] def callback(self,ch, method, properties, body): print(" [x]

scrapy 日志信息输出设置

大憨熊 提交于 2019-12-24 01:28:59
scrapy框架在使用的时候,会输出许多调试的信息,这样看起来很乱,所以我们输出的日志文件进行一下调试 在settings中设置输出级别 LOG_LEVEL = "DEBUG" # 输出级别 LOG_STDOUT = true # 是否标准输出 各种级别参数设置 CRITICAL - - 关键错误 ERROR - - 一般级别的错误 WARNING - - 警告信息 INFO - - 信息消息的日志(建议生产模式使用) DEBUG - - 调试消息的日志(建议开发模式) 来源: CSDN 作者: BeefpasteC 链接: https://blog.csdn.net/ybw_2569/article/details/103669784

Scrape Products from scrapy and follow pagination

徘徊边缘 提交于 2019-12-24 01:18:20
问题 I am trying to scrape data using scrapy from Alibaba Agriculture and Growing Media Category. You can Click Here to see view the page. The data I want to scrape from the page are Product_name, Price, Min_order, Company Name, Url of image . The picture shows what I want to scrape My Python code # -*- coding: utf-8 -*- import scrapy class AlibabaSpider(scrapy.Spider): name = 'alibaba' allowed_domains = ['alibaba.com'] start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media

Can i use selenium with Scrapy without actual browser opening with python

廉价感情. 提交于 2019-12-24 01:10:39
问题 I want to do some web crawling with scrapy and python. I have found few code examples from internet where they use selenium with scrapy. I don't know much about selenium but only knows that it automates some web tasks. and browser actually opens and do stuff. but i don't want the actual browser to open but i want everything to happen from command line. Can i do that in selenium and scrapy 回答1: You can use selenium with PyVirtualDisplay, at least on linux. from pyvirtualdisplay import Display

Is it possible to remove requests from scrapys scheduler queue?

删除回忆录丶 提交于 2019-12-24 00:47:44
问题 Is it possible to remove requests from scrapy's scheduler queue? I have a working routine that limits crawling to a certain domain for a set amount of time. It's working in the sense that it will not yield anymore links once the time limit was hit but as the queue can already contain thousands of requests for the domain I'd like to remove them from the scheduler queue once the time limit is hit. 回答1: Okay so I ended up following the suggestion from @rickgh12hs and wrote my own Downloader

python安装包时:timeout

走远了吗. 提交于 2019-12-24 00:09:28
问题: D : \code_python\scriptLearning > pip install scrapy Collecting scrapy Downloading https : // files . pythonhosted . org / packages / 3b / e4 / 69b87d7827abf03dea2ea984230d50f347b00a7a3897bc93f6ec3dafa494 / Scrapy - 1.8 .0 - py2 . py3 - none - any . whl ( 238kB ) | ████████████████████████████████ | 245kB 105kB / s Requirement already satisfied : zope . interface >= 4.1 .3 in d : \programdata\anaconda3\lib\site - packages ( from scrapy ) ( 4.5 .0 ) Requirement already satisfied : cssselect >= 0.9 .1 in d : \programdata\anaconda3\lib\site - packages ( from scrapy ) ( 1.1 .0 ) Collecting

Scrapy: crawl 1 level deep on offsite links

[亡魂溺海] 提交于 2019-12-24 00:06:53
问题 In scrapy how would I go about having scrapy crawl only 1 level deep for all links outside the allowed domains. Within the crawl, I want to be able to make sure all outbound links within the site are working and not 404'd. I do not want it to crawl the whole site of the non-allowed domain. I am currently processing allowed domain 404s. I know that I can set a DEPTH_LIMIT of 1, but that will affect the allowed domain as well. my code: from scrapy.selector import Selector from scrapy.spiders

how to get the comments in a html page while scraping?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-23 23:36:36
问题 Here's the issue . im trying to scrape this facebook about page for the birthday date and when I see the page source in the browser , it shows me the birthday date as a comment in html within a div of classname class="hidden_elem" . It might that becoz of this, when I see the source code of this page in my get request using (selenium , scrapy , requests) all I get just a div with class="hidden_elem" and that comment is nowhere to be seen let alone parsing it for info. So how to get this text

Troubles using scrapy with javascript __doPostBack method

时间秒杀一切 提交于 2019-12-23 23:13:52
问题 Trying to automatically grab the search results from a public search, but running into some trouble. The URL is of the form http://www.website.com/search.aspx?keyword=#&&page=1&sort=Sorting As I click through the pages, after visiting this page, it changes slightly to http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=2 Problem being, if I then try to directly visit the second link without first visiting the first link, I am redirected to the first link. My current attempt at