scrapy | 易学教程

Trouble running a parser created using scrapy with selenium

阅读更多关于 Trouble running a parser created using scrapy with selenium

问题 I've written a scraper in Python scrapy in combination with selenium to scrape some titles from a website. The css selectors defined within my scraper is flawless. I wish my scraper to keep on clicking on the next page and parse the information embedded in each page. It is doing fine for the first page but when it comes to play the role for selenium part the scraper keeps clicking on the same link over and over again. As this is my first time to work with selenium along with scrapy, I don't

Pass the url into the parse method in scrapy that was consumed from RabbitMQ

阅读更多关于 Pass the url into the parse method in scrapy that was consumed from RabbitMQ

问题 I am using the scrapy to consume the message(url) from the RabbitMQ,But When I use the yield to call the parse method passing my url as parameters .The program does not comes inside the callback method.Below is the foloowing code of my spider # -*- coding: utf-8 -*- import scrapy import pika from scrapy import cmdline import json class MydeletespiderSpider(scrapy.Spider): name = 'Mydeletespider' allowed_domains = [] start_urls = [] def callback(self,ch, method, properties, body): print(" [x]

scrapy 日志信息输出设置

阅读更多关于 scrapy 日志信息输出设置

scrapy框架在使用的时候，会输出许多调试的信息，这样看起来很乱，所以我们输出的日志文件进行一下调试在settings中设置输出级别 LOG_LEVEL = "DEBUG" # 输出级别 LOG_STDOUT = true # 是否标准输出各种级别参数设置 CRITICAL - - 关键错误 ERROR - - 一般级别的错误 WARNING - - 警告信息 INFO - - 信息消息的日志（建议生产模式使用） DEBUG - - 调试消息的日志（建议开发模式）来源： CSDN 作者： BeefpasteC 链接： https://blog.csdn.net/ybw_2569/article/details/103669784

Scrape Products from scrapy and follow pagination

阅读更多关于 Scrape Products from scrapy and follow pagination

问题 I am trying to scrape data using scrapy from Alibaba Agriculture and Growing Media Category. You can Click Here to see view the page. The data I want to scrape from the page are Product_name, Price, Min_order, Company Name, Url of image . The picture shows what I want to scrape My Python code # -*- coding: utf-8 -*- import scrapy class AlibabaSpider(scrapy.Spider): name = 'alibaba' allowed_domains = ['alibaba.com'] start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media

Can i use selenium with Scrapy without actual browser opening with python

阅读更多关于 Can i use selenium with Scrapy without actual browser opening with python

问题 I want to do some web crawling with scrapy and python. I have found few code examples from internet where they use selenium with scrapy. I don't know much about selenium but only knows that it automates some web tasks. and browser actually opens and do stuff. but i don't want the actual browser to open but i want everything to happen from command line. Can i do that in selenium and scrapy 回答1: You can use selenium with PyVirtualDisplay, at least on linux. from pyvirtualdisplay import Display

Is it possible to remove requests from scrapys scheduler queue?

阅读更多关于 Is it possible to remove requests from scrapys scheduler queue?

问题 Is it possible to remove requests from scrapy's scheduler queue? I have a working routine that limits crawling to a certain domain for a set amount of time. It's working in the sense that it will not yield anymore links once the time limit was hit but as the queue can already contain thousands of requests for the domain I'd like to remove them from the scheduler queue once the time limit is hit. 回答1: Okay so I ended up following the suggestion from @rickgh12hs and wrote my own Downloader

python安装包时：timeout

阅读更多关于 python安装包时：timeout

问题： D : \code_python\scriptLearning > pip install scrapy Collecting scrapy Downloading https : // files . pythonhosted . org / packages / 3b / e4 / 69b87d7827abf03dea2ea984230d50f347b00a7a3897bc93f6ec3dafa494 / Scrapy - 1.8 .0 - py2 . py3 - none - any . whl ( 238kB ) | ████████████████████████████████ | 245kB 105kB / s Requirement already satisfied : zope . interface >= 4.1 .3 in d : \programdata\anaconda3\lib\site - packages ( from scrapy ) ( 4.5 .0 ) Requirement already satisfied : cssselect >= 0.9 .1 in d : \programdata\anaconda3\lib\site - packages ( from scrapy ) ( 1.1 .0 ) Collecting

Scrapy: crawl 1 level deep on offsite links

阅读更多关于 Scrapy: crawl 1 level deep on offsite links

问题 In scrapy how would I go about having scrapy crawl only 1 level deep for all links outside the allowed domains. Within the crawl, I want to be able to make sure all outbound links within the site are working and not 404'd. I do not want it to crawl the whole site of the non-allowed domain. I am currently processing allowed domain 404s. I know that I can set a DEPTH_LIMIT of 1, but that will affect the allowed domain as well. my code: from scrapy.selector import Selector from scrapy.spiders

how to get the comments in a html page while scraping?

阅读更多关于 how to get the comments in a html page while scraping?

问题 Here's the issue . im trying to scrape this facebook about page for the birthday date and when I see the page source in the browser , it shows me the birthday date as a comment in html within a div of classname class="hidden_elem" . It might that becoz of this, when I see the source code of this page in my get request using (selenium , scrapy , requests) all I get just a div with class="hidden_elem" and that comment is nowhere to be seen let alone parsing it for info. So how to get this text

Troubles using scrapy with javascript __doPostBack method

阅读更多关于 Troubles using scrapy with javascript __doPostBack method

问题 Trying to automatically grab the search results from a public search, but running into some trouble. The URL is of the form http://www.website.com/search.aspx?keyword=#&&page=1&sort=Sorting As I click through the pages, after visiting this page, it changes slightly to http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=2 Problem being, if I then try to directly visit the second link without first visiting the first link, I am redirected to the first link. My current attempt at