scrapy

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

冷暖自知 提交于 2019-12-30 07:55:05
问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item

GtkWarning: could not open display

妖精的绣舞 提交于 2019-12-30 07:29:10
问题 I am trying to run a spider on a vps (using scrapyjs which uses python-gtk2). On running the spider I am getting the error /root/myporj/venv/local/lib/python2.7/dist-packages/gtk-2.0/gtk/__init__.py:57: GtkWarning: could not open display How do I run this in a headless setup? 回答1: First of all, you didn't specify if you have a desktop environment (or X) installed on your server? Regardless of that, you can achieve headless setup for your spider by using xvfb : Xvfb or X virtual framebuffer is

xpath: string manipulation

人走茶凉 提交于 2019-12-30 07:28:11
问题 So in my scrapy project I was able to isolate some particular fields, one of the field return something like: [Rank Info] on 2013-06-27 14:26 Read 174 Times which was selected by expression: (//td[@class="show_content"]/text())[4] I usually do post-processing to extract the datetime information, i.e., 2013-06-27 14:26 Now since I've learned a little more on the xpath substring manipulation, I am wondering if it is even possible to extract that piece of information in the first place, i.e., in

Yield multiple items using scrapy

走远了吗. 提交于 2019-12-30 07:22:08
问题 I'm scraping data from the following URL: http://www.indexmundi.com/commodities/?commodity=gasoline There are two sections which contain price: Gulf Coast Gasoline Futures End of Day Settlement Price and Gasoline Daily Price I want to scrape data from both sections as two different items . Here is the code which I've written: if dailyPrice: item['description'] = u''.join(dailyPrice.xpath(".//h1/text()").extract()) item['price'] = u''.join(dailyPrice.xpath(".//span/text()").extract()) item[

Scrapy: scraping data from Pagination

北战南征 提交于 2019-12-30 06:57:17
问题 so far I have scraped data from one page. I want to continue until the end of the pagination. Click Here to view the page There seems to be a problem because the href contains a javascript element. <a href="javascript:void(0)" class="next" data-role="next" data-spm-anchor-id="a2700.galleryofferlist.pagination.8">Next</a> My Code # -*- coding: utf-8 -*- import scrapy class AlibabaSpider(scrapy.Spider): name = 'alibaba' allowed_domains = ['alibaba.com'] start_urls = ['https://www.alibaba.com

Running Multiple spiders in scrapy

霸气de小男生 提交于 2019-12-30 06:21:49
问题 In scrapy for example if i had two URL's that contains different HTML. Now i want to write two individual spiders each for one and want to run both the spiders at once. In scrapy is it possible to run multiple spiders at once. In scrapy after writing multiple spiders, how can we schedule them to run for every 6 hours(May be like cron jobs) I had no idea of above , can u suggest me how to perform the above things with an example. Thanks in advance. 回答1: It would probably be easiest to just run

pyconfig.h missing during “pip install cryptography”

大城市里の小女人 提交于 2019-12-30 05:48:45
问题 I wanna set up scrapy cluster follow this link scrapy-cluster,Everything is ok before I run this command: pip install -r requirements.txt The requirements.txt looks like: cffi==1.2.1 characteristic==14.3.0 ConcurrentLogHandler>=0.9.1 cryptography==0.9.1 ... I guess the above command means to install packages in requirements.txt.But I don't want it to specify the version,So I change it to this: cat requirements.txt | while read line; do pip install ${line%%[>=]*} --user;done When install

Splash lua script to do multiple clicks and visits

隐身守侯 提交于 2019-12-30 03:27:08
问题 I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window before getting the href of the BibTeX format of the citation. But seeing that there are multiple search results and hence multiple "Cite" links, I need to click them all and load up the individual BibTeX pages. Here's what I have: import scrapy from

Setting Scrapy proxy middleware to rotate on each request

不羁岁月 提交于 2019-12-30 03:10:14
问题 This question necessarily comes in two forms, because I don't know the better route to a solution. A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So... When Scrapy receives a "Redirect" to its request (e.g. DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm) ), does

Scrapy Splash Screenshots?

喜你入骨 提交于 2019-12-30 01:25:15
问题 I'm trying to scrape a site whilst taking a screenshot of every page. So far, I have managed to piece together the following code: import json import base64 import scrapy from scrapy_splash import SplashRequest class ExtractSpider(scrapy.Spider): name = 'extract' def start_requests(self): url = 'https://stackoverflow.com/' splash_args = { 'html': 1, 'png': 1 } yield SplashRequest(url, self.parse_result, endpoint='render.json', args=splash_args) def parse_result(self, response): png_bytes =