scrapy | 易学教程

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

阅读更多关于 Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item

GtkWarning: could not open display

阅读更多关于 GtkWarning: could not open display

问题 I am trying to run a spider on a vps (using scrapyjs which uses python-gtk2). On running the spider I am getting the error /root/myporj/venv/local/lib/python2.7/dist-packages/gtk-2.0/gtk/__init__.py:57: GtkWarning: could not open display How do I run this in a headless setup? 回答1: First of all, you didn't specify if you have a desktop environment (or X) installed on your server? Regardless of that, you can achieve headless setup for your spider by using xvfb : Xvfb or X virtual framebuffer is

xpath: string manipulation

阅读更多关于 xpath: string manipulation

问题 So in my scrapy project I was able to isolate some particular fields, one of the field return something like: [Rank Info] on 2013-06-27 14:26 Read 174 Times which was selected by expression: (//td[@class="show_content"]/text())[4] I usually do post-processing to extract the datetime information, i.e., 2013-06-27 14:26 Now since I've learned a little more on the xpath substring manipulation, I am wondering if it is even possible to extract that piece of information in the first place, i.e., in

Yield multiple items using scrapy

阅读更多关于 Yield multiple items using scrapy

问题 I'm scraping data from the following URL: http://www.indexmundi.com/commodities/?commodity=gasoline There are two sections which contain price: Gulf Coast Gasoline Futures End of Day Settlement Price and Gasoline Daily Price I want to scrape data from both sections as two different items . Here is the code which I've written: if dailyPrice: item['description'] = u''.join(dailyPrice.xpath(".//h1/text()").extract()) item['price'] = u''.join(dailyPrice.xpath(".//span/text()").extract()) item[

Scrapy: scraping data from Pagination

阅读更多关于 Scrapy: scraping data from Pagination

问题 so far I have scraped data from one page. I want to continue until the end of the pagination. Click Here to view the page There seems to be a problem because the href contains a javascript element. <a href="javascript:void(0)" class="next" data-role="next" data-spm-anchor-id="a2700.galleryofferlist.pagination.8">Next</a> My Code # -*- coding: utf-8 -*- import scrapy class AlibabaSpider(scrapy.Spider): name = 'alibaba' allowed_domains = ['alibaba.com'] start_urls = ['https://www.alibaba.com

Running Multiple spiders in scrapy

阅读更多关于 Running Multiple spiders in scrapy

问题 In scrapy for example if i had two URL's that contains different HTML. Now i want to write two individual spiders each for one and want to run both the spiders at once. In scrapy is it possible to run multiple spiders at once. In scrapy after writing multiple spiders, how can we schedule them to run for every 6 hours(May be like cron jobs) I had no idea of above , can u suggest me how to perform the above things with an example. Thanks in advance. 回答1: It would probably be easiest to just run

pyconfig.h missing during “pip install cryptography”

阅读更多关于 pyconfig.h missing during “pip install cryptography”

问题 I wanna set up scrapy cluster follow this link scrapy-cluster,Everything is ok before I run this command: pip install -r requirements.txt The requirements.txt looks like: cffi==1.2.1 characteristic==14.3.0 ConcurrentLogHandler>=0.9.1 cryptography==0.9.1 ... I guess the above command means to install packages in requirements.txt.But I don't want it to specify the version,So I change it to this: cat requirements.txt | while read line; do pip install ${line%%[>=]*} --user;done When install

Splash lua script to do multiple clicks and visits

阅读更多关于 Splash lua script to do multiple clicks and visits

问题 I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window before getting the href of the BibTeX format of the citation. But seeing that there are multiple search results and hence multiple "Cite" links, I need to click them all and load up the individual BibTeX pages. Here's what I have: import scrapy from

Setting Scrapy proxy middleware to rotate on each request

阅读更多关于 Setting Scrapy proxy middleware to rotate on each request

问题 This question necessarily comes in two forms, because I don't know the better route to a solution. A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So... When Scrapy receives a "Redirect" to its request (e.g. DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm) ), does

Scrapy Splash Screenshots?

阅读更多关于 Scrapy Splash Screenshots?

问题 I'm trying to scrape a site whilst taking a screenshot of every page. So far, I have managed to piece together the following code: import json import base64 import scrapy from scrapy_splash import SplashRequest class ExtractSpider(scrapy.Spider): name = 'extract' def start_requests(self): url = 'https://stackoverflow.com/' splash_args = { 'html': 1, 'png': 1 } yield SplashRequest(url, self.parse_result, endpoint='render.json', args=splash_args) def parse_result(self, response): png_bytes =