scrapy-pipeline | 易学教程

Scrapy image pipeline does not download images

阅读更多关于 Scrapy image pipeline does not download images

问题 I'm trying to set up image downloading from web pages by using Scrapy Framework and djano-item. I think I have done everything like in doc but after calling scrapy crawl I log looking like this: Scrapy log I can't find there any information on what went wrong but Images field Is empty and directory does not contain any images. This is my model class Event(models.Model): title = models.CharField(max_length=100, blank=False) description = models.TextField(blank=True, null=True) event_location =

Use Django's models in a Scrapy project (in the pipeline)

阅读更多关于 Use Django's models in a Scrapy project (in the pipeline)

问题 This has been asked before but the answer that always comes up is to use DjangoItem. However it states on it's github that: often not a good choice for a write intensive applications (such as a web crawler) ... may not scale well This is the crux of my problem, I'd like to use and interact with my django model in the same way I can when I run python manage.py shell and I do from myapp.models import Model1 . Using queries like seen here. I have tried relative imports and moving my whole scrapy

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

阅读更多关于 Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

问题 Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB. How do i check if the scraped data is already in the DB and if so, dont enter into the DB.... Here is my pipeline.py code from sqlalchemy.orm import sessionmaker from models import Channels, db_connect, create_channel_table # -*- coding: utf-8 -*- # Define your item

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

阅读更多关于 Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

阅读更多关于 Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

How to enable overwriting a file everytime in scrapy item export?

阅读更多关于 How to enable overwriting a file everytime in scrapy item export?

问题 I am scraping a website which returns in a list of urls . Example - scrapy crawl xyz_spider -o urls.csv It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it enable? 回答1: Unfortunately scrapy can't do this at the moment. There is a proposed enhancement on github though: https://github.com/scrapy/scrapy/issues/547 However you can easily do redirect the output to stdout and redirect that to a file:

Scrapy store returned items in variables to use in main script

阅读更多关于 Scrapy store returned items in variables to use in main script

问题 I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/' ] custom_settings = { 'LOG_ENABLED': 'False', } def parse(self, response): global title # This would work, but there should

Scrapy: how to use items in spider and how to send items to pipelines?

阅读更多关于 Scrapy: how to use items in spider and how to send items to pipelines?

问题 I am new to scrapy and my task is simple: For a given e-commerce website: crawl all website pages look for products page If the URL point to a product page Create an Item Process the item to store it in a database I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines. 回答1: How to use items in my spider?

Export scrapy items to different files

阅读更多关于 Export scrapy items to different files

问题 I'm scraping review from moocs likes this one From there I'm getting all the course details, 5 items and another 6 items from each review itself. This is the code I have for the course details: def parse_reviews(self, response): l = ItemLoader(item=MoocsItem(), response=response) l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()') l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()') l.add_xpath('course_instructors', '/

Crawl website from list of values using scrapy

阅读更多关于 Crawl website from list of values using scrapy

问题 I have a list of NPIs which I want to scrape the names of the providers for from npidb.org The NPI values are stored in a csv file. I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I have a list of NPIs for each of which I want the provider names. Here is my current code: import scrapy from scrapy.spider import BaseSpider class MySpider(BaseSpider): name = "npidb" def start_requests(self): urls = [ 'https://npidb.org/npi-lookup/