scrapy-pipeline

Scrapy Pipeline to Parse

不想你离开。 提交于 2019-12-11 10:43:40
问题 I made a pipeline to put scrapy data to my Parse Backend PARSE = 'api.parse.com' PORT = 443 However, I can't find the right way to post the data in Parse. Because everytime it creates undefined objects in my Parse DB. class Newscrawlbotv01Pipeline(object): def process_item(self, item, spider): for data in item: if not data: raise DropItem("Missing data!") connection = httplib.HTTPSConnection( settings['PARSE'], settings['PORT'] ) connection.connect() connection.request('POST', '/1/classes

Scrapy, make http request in pipeline

浪尽此生 提交于 2019-12-10 09:55:13
问题 Assume I have an scraped item that looks like this { name: "Foo", country: "US", url: "http://..." } In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet certain conditions I want to drop the item. Like class MyPipeline(object): def process_item(self, item, spider): request(item['url'], function(response) { if (...) { raise DropItem() } return item }, function(error){ raise DropItem() }) Smells like this is not

Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

别说谁变了你拦得住时间么 提交于 2019-12-08 09:54:38
问题 I'm trying to build Scrapy crawler: spider will crawl data then in pipeline.py, the data will save to Bigquery. I built it by docker, setup crontab job and push to Google Cloud Server to daily running. The problem is when crontab executes scrapy crawler, it got "google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name: Request had insufficient authentication scopes.". For more detail, when accessing to

Scrapy store returned items in variables to use in main script

筅森魡賤 提交于 2019-12-06 13:38:14
I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/' ] custom_settings = { 'LOG_ENABLED': 'False', } def parse(self, response): global title # This would work, but there should be a better way title = response.css('title::text').extract_first() process = CrawlerProcess({ 'USER

Scrapy: how to use items in spider and how to send items to pipelines?

大憨熊 提交于 2019-12-02 17:41:43
I am new to scrapy and my task is simple: For a given e-commerce website: crawl all website pages look for products page If the URL point to a product page Create an Item Process the item to store it in a database I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines. Adrien Blanquer How to use items in my spider? Well, the main purpose of items is to store the data you crawled. scrapy.Items are basically

How to import Scrapy item keys in the correct order?

丶灬走出姿态 提交于 2019-12-02 06:56:24
I am importing the Scrapy item keys from items.py , into pipelines.py . The problem is that the order of the imported items are different from how they were defined in the items.py file. My items.py file: class NewAdsItem(Item): AdId = Field() DateR = Field() AdURL = Field() In my pipelines.py : from adbot.items import NewAdsItem ... def open_spider(self, spider): self.ikeys = NewAdsItem.fields.keys() print("Keys in pipelines: \t%s" % ",".join(self.ikeys) ) #self.createDbTable(ikeys) The output is: Keys in pipelines: AdId,AdURL,DateR instead of the expected: AdId,DateR,AdURL . How can I ensure

Scrapy file download how to use custom filename

纵然是瞬间 提交于 2019-11-30 18:29:03
问题 For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names. [(True, {'checksum': '2b00042f7481c7b056c4b410d28f33cf', 'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg', 'url': 'http://www.example.com/files/product1.pdf'}), (False, Failure(...))] How can I store the files using my custom file names instead? In the example above, I would want the file name being "product1