scrapy-pipeline

Scrapy image pipeline does not download images

那年仲夏 提交于 2020-01-07 06:28:20
问题 I'm trying to set up image downloading from web pages by using Scrapy Framework and djano-item. I think I have done everything like in doc but after calling scrapy crawl I log looking like this: Scrapy log I can't find there any information on what went wrong but Images field Is empty and directory does not contain any images. This is my model class Event(models.Model): title = models.CharField(max_length=100, blank=False) description = models.TextField(blank=True, null=True) event_location =

Use Django's models in a Scrapy project (in the pipeline)

半世苍凉 提交于 2020-01-05 07:11:08
问题 This has been asked before but the answer that always comes up is to use DjangoItem. However it states on it's github that: often not a good choice for a write intensive applications (such as a web crawler) ... may not scale well This is the crux of my problem, I'd like to use and interact with my django model in the same way I can when I run python manage.py shell and I do from myapp.models import Model1 . Using queries like seen here. I have tried relative imports and moving my whole scrapy

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

我与影子孤独终老i 提交于 2019-12-30 07:55:56
问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

孤人 提交于 2019-12-30 07:55:25
问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

冷暖自知 提交于 2019-12-30 07:55:05
问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item

How to enable overwriting a file everytime in scrapy item export?

最后都变了- 提交于 2019-12-23 01:12:15
问题 I am scraping a website which returns in a list of urls . Example - scrapy crawl xyz_spider -o urls.csv It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it enable? 回答1: Unfortunately scrapy can't do this at the moment. There is a proposed enhancement on github though: https://github.com/scrapy/scrapy/issues/547 However you can easily do redirect the output to stdout and redirect that to a file:

Scrapy store returned items in variables to use in main script

江枫思渺然 提交于 2019-12-22 14:05:03
问题 I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/' ] custom_settings = { 'LOG_ENABLED': 'False', } def parse(self, response): global title # This would work, but there should

Scrapy: how to use items in spider and how to send items to pipelines?

自作多情 提交于 2019-12-20 08:49:37
问题 I am new to scrapy and my task is simple: For a given e-commerce website: crawl all website pages look for products page If the URL point to a product page Create an Item Process the item to store it in a database I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines. 回答1: How to use items in my spider?

Export scrapy items to different files

杀马特。学长 韩版系。学妹 提交于 2019-12-19 04:07:04
问题 I'm scraping review from moocs likes this one From there I'm getting all the course details, 5 items and another 6 items from each review itself. This is the code I have for the course details: def parse_reviews(self, response): l = ItemLoader(item=MoocsItem(), response=response) l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()') l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()') l.add_xpath('course_instructors', '/

Crawl website from list of values using scrapy

北城以北 提交于 2019-12-12 03:32:00
问题 I have a list of NPIs which I want to scrape the names of the providers for from npidb.org The NPI values are stored in a csv file. I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I have a list of NPIs for each of which I want the provider names. Here is my current code: import scrapy from scrapy.spider import BaseSpider class MySpider(BaseSpider): name = "npidb" def start_requests(self): urls = [ 'https://npidb.org/npi-lookup/