scrapy

Database insertion fails without error with scrapy

孤街浪徒 提交于 2019-12-31 04:11:49
问题 I'm working with scrapy and dataset (https://dataset.readthedocs.io/en/latest/quickstart.html#storing-data) which is a layer on top of sqlalchemy , trying to load data into a sqllite table as a follow up to Sqlalchemy : Dynamically create table from Scrapy item. using the dataset package I have: class DynamicSQLlitePipeline(object): def __init__(self,table_name): db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db" db = dataset.connect(db_path) self.table = db[table_name].table def

Scrapy: constructing non-duplicative list of absolute paths from relative paths

余生颓废 提交于 2019-12-31 04:07:13
问题 Question : how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag? Background : I am trying to use Scrapy to crawl a site, pull any links under the img src tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For

Scrapy: constructing non-duplicative list of absolute paths from relative paths

天涯浪子 提交于 2019-12-31 04:07:05
问题 Question : how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag? Background : I am trying to use Scrapy to crawl a site, pull any links under the img src tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For

Scrapy - Spider crawls duplicate urls

天涯浪子 提交于 2019-12-31 02:42:07
问题 I'm crawling a search results page and scrape title and link information from the same page. As its a Search page, I have the links to the next pages as well, which I have specified in the SgmlLinkExtractor to allow. The description of the problem is, In 1st page, i have found the links of Page2 and Page3 to crawl and it does perfectly. But when it is crawls 2nd page, it again has links to Page1(previous page) and Page3(next page). SO it again crawls Page1 with referrer as Page2 and its going

Is it possible to run another spider from Scrapy spider?

懵懂的女人 提交于 2019-12-30 18:47:11
问题 For now I have 2 spiders, what I would like to do is Spider 1 goes to url1 and if url2 appears, call spider 2 with url2 . Also saves the content of url1 by using pipeline. Spider 2 goes to url2 and do something. Due to the complexities of both spiders I would like to have them separated. What I have tried using scrapy crawl : def parse(self, response): p = multiprocessing.Process( target=self.testfunc()) p.join() p.start() def testfunc(self): settings = get_project_settings() crawler =

Scrapy框架

扶醉桌前 提交于 2019-12-30 14:30:04
一、介绍 二、安装 三、命令行工具 四、项目结构以及爬虫应用简介 五、Spiders 六、Selectors 七、Items 八、Item Pipelin 九、 Dowloader Middeware 十、Sider Middlewear 十一、自定义扩展 十二、setitings.py 十三、获取亚马逊商品信息 一、介绍 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛,可用于如数据挖掘、监测和自动化测试等领域,也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy 是基于twisted框架开发而来 ,twisted是一个流行的事件驱动的python网络框架。 因此Scrapy使用了一种非阻塞(又名异步)的代码来实现并发 。整体架构大致如下 在Scrapy的数据流是由执行引擎控制,具体流程如下: 1、spiders产生request请求,将请求交给引擎 2、引擎(EGINE)吧刚刚处理好的请求交给了调度器,以一个队列或者堆栈的形式吧这些请求保存起来,调度一个出来再传给引擎 3、调度器(SCHEDULER)返回给引擎一个要爬取的url 4

Scrapy CrawlSpider retry scrape

我的未来我决定 提交于 2019-12-30 11:24:08
问题 For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like: def parse_page(self, response): url = response.url # Check to make sure the page is loaded if 'var PageIsLoaded = false;' in response.body: self.logger.warning('parse_page

How to extract raw html from a Scrapy selector?

你说的曾经没有我的故事 提交于 2019-12-30 09:40:54
问题 I'm extracting js data using response.xpath('//*')re_first() and later converting it to python native data. The problem is extract/re methods don't seem to provide a way to not unquote html i.e. original html: {my_fields:['O'Connor Park'], } extract output: {my_fields:['O'Connor Park'], } turning this output into json won't work. What's the easiest way around it? 回答1: Short answer: Scrapy/Parsel selectors' .re() and .re_first() methods replace HTML entities (except < , & ) instead, use

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

我与影子孤独终老i 提交于 2019-12-30 07:55:56
问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

孤人 提交于 2019-12-30 07:55:25
问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item