scrapy-pipeline | 易学教程

Scrapy Pipeline to Parse

阅读更多关于 Scrapy Pipeline to Parse

问题 I made a pipeline to put scrapy data to my Parse Backend PARSE = 'api.parse.com' PORT = 443 However, I can't find the right way to post the data in Parse. Because everytime it creates undefined objects in my Parse DB. class Newscrawlbotv01Pipeline(object): def process_item(self, item, spider): for data in item: if not data: raise DropItem("Missing data!") connection = httplib.HTTPSConnection( settings['PARSE'], settings['PORT'] ) connection.connect() connection.request('POST', '/1/classes

Scrapy, make http request in pipeline

阅读更多关于 Scrapy, make http request in pipeline

问题 Assume I have an scraped item that looks like this { name: "Foo", country: "US", url: "http://..." } In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet certain conditions I want to drop the item. Like class MyPipeline(object): def process_item(self, item, spider): request(item['url'], function(response) { if (...) { raise DropItem() } return item }, function(error){ raise DropItem() }) Smells like this is not

Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

阅读更多关于 Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

问题 I'm trying to build Scrapy crawler: spider will crawl data then in pipeline.py, the data will save to Bigquery. I built it by docker, setup crontab job and push to Google Cloud Server to daily running. The problem is when crontab executes scrapy crawler, it got "google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name: Request had insufficient authentication scopes.". For more detail, when accessing to

Scrapy store returned items in variables to use in main script

阅读更多关于 Scrapy store returned items in variables to use in main script

I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/' ] custom_settings = { 'LOG_ENABLED': 'False', } def parse(self, response): global title # This would work, but there should be a better way title = response.css('title::text').extract_first() process = CrawlerProcess({ 'USER

Scrapy: how to use items in spider and how to send items to pipelines?

阅读更多关于 Scrapy: how to use items in spider and how to send items to pipelines?

I am new to scrapy and my task is simple: For a given e-commerce website: crawl all website pages look for products page If the URL point to a product page Create an Item Process the item to store it in a database I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines. Adrien Blanquer How to use items in my spider? Well, the main purpose of items is to store the data you crawled. scrapy.Items are basically

How to import Scrapy item keys in the correct order?

阅读更多关于 How to import Scrapy item keys in the correct order?

I am importing the Scrapy item keys from items.py , into pipelines.py . The problem is that the order of the imported items are different from how they were defined in the items.py file. My items.py file: class NewAdsItem(Item): AdId = Field() DateR = Field() AdURL = Field() In my pipelines.py : from adbot.items import NewAdsItem ... def open_spider(self, spider): self.ikeys = NewAdsItem.fields.keys() print("Keys in pipelines: \t%s" % ",".join(self.ikeys) ) #self.createDbTable(ikeys) The output is: Keys in pipelines: AdId,AdURL,DateR instead of the expected: AdId,DateR,AdURL . How can I ensure

Scrapy file download how to use custom filename

阅读更多关于 Scrapy file download how to use custom filename

问题 For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names. [(True, {'checksum': '2b00042f7481c7b056c4b410d28f33cf', 'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg', 'url': 'http://www.example.com/files/product1.pdf'}), (False, Failure(...))] How can I store the files using my custom file names instead? In the example above, I would want the file name being "product1