scrapy-spider

Scrapy: how to use items in spider and how to send items to pipelines?

大憨熊 提交于 2019-12-02 17:41:43
I am new to scrapy and my task is simple: For a given e-commerce website: crawl all website pages look for products page If the URL point to a product page Create an Item Process the item to store it in a database I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines. Adrien Blanquer How to use items in my spider? Well, the main purpose of items is to store the data you crawled. scrapy.Items are basically

Speed up web scraper

点点圈 提交于 2019-12-02 14:57:41
I am scraping 23770 webpages with a pretty simple web scraper using scrapy . I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages). I have looked on the scrapy webpage and the mailing lists and stackoverflow , but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome! I have listed my code below, if it's needed. from scrapy.spider import

Order a json by field using scrapy

牧云@^-^@ 提交于 2019-12-02 12:41:27
I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple. But unfortunately, ordering items to write in json by scrapy (I need ascending order by id field) seem not to be so simple. I've studied every single component

Scrapy spider_idle signal - need to add requests with parse item callback

ぃ、小莉子 提交于 2019-12-02 09:57:04
In my Scrapy spider I have overridden the start_requests() method, in order to retrieve some additional urls from a database, that represent items potentially missed in the crawl (orphaned items). This should happen at the end of the crawling process. Something like (pseudo code): def start_requests(self): for url in self.start_urls: yield Request(url, dont_filter=True) # attempt to crawl orphaned items db = MySQLdb.connect(host=self.settings['AWS_RDS_HOST'], port=self.settings['AWS_RDS_PORT'], user=self.settings['AWS_RDS_USER'], passwd=self.settings['AWS_RDS_PASSWD'], db=self.settings['AWS

Python Scrapy Get HTML <script> tag

耗尽温柔 提交于 2019-12-02 08:02:22
I have a project and i need the get script in html code. <script> (function() { ... / More Code Level.grade = "2"; Level.level = "1"; Level.max_line = "5"; Level.cozum = 'adım 12\ndön sağ\nadım 13\ndön sol\nadım 11'; ... / More Code </script> How i get only " adım 12\ndön sağ\nadım 13\ndön sol\nadım 11 " this code? Thanks for Helps Use Regex to do that First grab the content of that SCRIPT tag like response.css("script").extract_first() And then use this regex (Level\.cozum = )(.*?)(\;) See demo here https://regex101.com/r/YxHRmR/1 This is code import re regex = r"(Level\.cozum = )(.*?)(\;)"

Force Python Scrapy not to encode URL

五迷三道 提交于 2019-12-01 17:05:04
There are some URLs with [] in it like http://www.website.com/CN.html?value_ids[]=33&value_ids[]=5007 But when I try scraping this URL with Scrapy, it makes Request to this URL http://www.website.com/CN.html?value_ids%5B%5D=33&value_ids%5B%5D=5007 How can I force scrapy to not to urlenccode my URLs? When creating a Request object scrapy applies some url encoding methods. To revert these you can utilize a custom middleware and change the url to your needs. You could use a Downloader Middleware like this: class MyCustomDownloaderMiddleware(object): def process_request(self, request, spider):

Scrapy: scraping a list of links

梦想与她 提交于 2019-12-01 11:34:46
This question is somewhat a follow-up of this question that I asked previously. I am trying to scrape a website which contains some links on the first page. Something similar to this . Now, since I want to scrape the details of the items present on the page I have extracted their individual URLs. I have saved these URLS in a list. How do I launch spiders to scrape the pages individually? For better understanding: [urlA, urlB, urlC, urlD...] This is the list of URLs that I have scraped. Now I want to launch a spider to scrape the links individually. How do I go about this? I'm assuming that the

Scrapy CrawlSpider retry scrape

夙愿已清 提交于 2019-12-01 11:26:29
For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like: def parse_page(self, response): url = response.url # Check to make sure the page is loaded if 'var PageIsLoaded = false;' in response.body: self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url)) yield Request(url, self.parse, dont_filter=True) return .

Multiple nested request with scrapy

谁说胖子不能爱 提交于 2019-12-01 10:41:05
I try to scrap some airplane schedule information on www.flightradar24.com website for research project. The hierarchy of json file i want to obtain is something like that : Object ID - country - link - name - airports - airport0 - code_total - link - lat - lon - name - schedule - ... - ... - airport1 - code_total - link - lat - lon - name - schedule - ... - ... Country and Airport are stored using items, and as you can see on json file the CountryItem (link, name attribute) finally store multiple AirportItem (code_total, link, lat, lon, name, schedule) : class CountryItem(scrapy.Item): name =

scrapy: Populate nested items with itemLoader

﹥>﹥吖頭↗ 提交于 2019-12-01 10:37:40
I have this object I'm trying to populate with an itemLoader: { "domains": "string", "date_insert": "2016-12-23T11:25:00.213Z", "title": "string", "url": "string", "body": "string", "date": "2016-12-23T11:25:00.213Z", "authors": [ "string" ], "categories": [ "string" ], "tags": [ "string" ], "stats": { "views_count": 0, "comments_count": 0 } } Here's my items.py class StatsItem(scrapy.Item): views_count=scrapy.Field() comments_count=scrapy.Field() class ArticleItem(scrapy.Item): domain = scrapy.Field() date_insert=scrapy.Field() date_update=scrapy.Field() date=scrapy.Field() title=scrapy.Field