scrapy | 易学教程

Database insertion fails without error with scrapy

阅读更多关于 Database insertion fails without error with scrapy

问题 I'm working with scrapy and dataset (https://dataset.readthedocs.io/en/latest/quickstart.html#storing-data) which is a layer on top of sqlalchemy , trying to load data into a sqllite table as a follow up to Sqlalchemy : Dynamically create table from Scrapy item. using the dataset package I have: class DynamicSQLlitePipeline(object): def __init__(self,table_name): db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db" db = dataset.connect(db_path) self.table = db[table_name].table def

Scrapy: constructing non-duplicative list of absolute paths from relative paths

阅读更多关于 Scrapy: constructing non-duplicative list of absolute paths from relative paths

问题 Question : how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag? Background : I am trying to use Scrapy to crawl a site, pull any links under the img src tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For

Scrapy: constructing non-duplicative list of absolute paths from relative paths

阅读更多关于 Scrapy: constructing non-duplicative list of absolute paths from relative paths

Scrapy - Spider crawls duplicate urls

阅读更多关于 Scrapy - Spider crawls duplicate urls

问题 I'm crawling a search results page and scrape title and link information from the same page. As its a Search page, I have the links to the next pages as well, which I have specified in the SgmlLinkExtractor to allow. The description of the problem is, In 1st page, i have found the links of Page2 and Page3 to crawl and it does perfectly. But when it is crawls 2nd page, it again has links to Page1(previous page) and Page3(next page). SO it again crawls Page1 with referrer as Page2 and its going

Is it possible to run another spider from Scrapy spider?

阅读更多关于 Is it possible to run another spider from Scrapy spider?

问题 For now I have 2 spiders, what I would like to do is Spider 1 goes to url1 and if url2 appears, call spider 2 with url2 . Also saves the content of url1 by using pipeline. Spider 2 goes to url2 and do something. Due to the complexities of both spiders I would like to have them separated. What I have tried using scrapy crawl : def parse(self, response): p = multiprocessing.Process( target=self.testfunc()) p.join() p.start() def testfunc(self): settings = get_project_settings() crawler =

Scrapy框架

阅读更多关于 Scrapy框架

一、介绍二、安装三、命令行工具四、项目结构以及爬虫应用简介五、Spiders 六、Selectors 七、Items 八、Item Pipelin 九、 Dowloader Middeware 十、Sider Middlewear 十一、自定义扩展十二、setitings.py 十三、获取亚马逊商品信息一、介绍 Scrapy一个开源和协作的框架，其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛，可用于如数据挖掘、监测和自动化测试等领域，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy 是基于twisted框架开发而来，twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞（又名异步）的代码来实现并发。整体架构大致如下在Scrapy的数据流是由执行引擎控制，具体流程如下： 1、spiders产生request请求，将请求交给引擎 2、引擎(EGINE)吧刚刚处理好的请求交给了调度器，以一个队列或者堆栈的形式吧这些请求保存起来，调度一个出来再传给引擎 3、调度器(SCHEDULER)返回给引擎一个要爬取的url 4

Scrapy CrawlSpider retry scrape

阅读更多关于 Scrapy CrawlSpider retry scrape

问题 For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like: def parse_page(self, response): url = response.url # Check to make sure the page is loaded if 'var PageIsLoaded = false;' in response.body: self.logger.warning('parse_page

How to extract raw html from a Scrapy selector?

阅读更多关于 How to extract raw html from a Scrapy selector?

问题 I'm extracting js data using response.xpath('//*')re_first() and later converting it to python native data. The problem is extract/re methods don't seem to provide a way to not unquote html i.e. original html: {my_fields:['O'Connor Park'], } extract output: {my_fields:['O'Connor Park'], } turning this output into json won't work. What's the easiest way around it? 回答1: Short answer: Scrapy/Parsel selectors' .re() and .re_first() methods replace HTML entities (except < , & ) instead, use

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

阅读更多关于 Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

阅读更多关于 Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?