scrapy-spider

Multiple nested request with scrapy

北战南征 提交于 2019-12-01 07:54:34
问题 I try to scrap some airplane schedule information on www.flightradar24.com website for research project. The hierarchy of json file i want to obtain is something like that : Object ID - country - link - name - airports - airport0 - code_total - link - lat - lon - name - schedule - ... - ... - airport1 - code_total - link - lat - lon - name - schedule - ... - ... Country and Airport are stored using items, and as you can see on json file the CountryItem (link, name attribute) finally store

Scrapy: scraping a list of links

↘锁芯ラ 提交于 2019-12-01 07:28:19
问题 This question is somewhat a follow-up of this question that I asked previously. I am trying to scrape a website which contains some links on the first page. Something similar to this. Now, since I want to scrape the details of the items present on the page I have extracted their individual URLs. I have saved these URLS in a list. How do I launch spiders to scrape the pages individually? For better understanding: [urlA, urlB, urlC, urlD...] This is the list of URLs that I have scraped. Now I

Dynamic rules based on start_urls for Scrapy CrawlSpider?

烈酒焚心 提交于 2019-12-01 01:43:39
I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain). I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately. Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites: class HomepagesSpider

Scrapy Spider not Following Links

断了今生、忘了曾经 提交于 2019-11-30 23:49:46
I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com , it successfully extracts a list of article urls with le.extract_links(response) , but I can't get my crawl command ( scrapy crawl nyt -o out.json ) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse function? Any help is greatly appreciated. from datetime import date import scrapy from scrapy.contrib

python scrapy parse() function, where is the return value returned to?

只愿长相守 提交于 2019-11-30 21:04:34
I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example: import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = ’example.com’ allowed_domains = [’example.com’] start_urls = [ ’http://www.example.com/1.html’, ’http://www.example.com/2.html’, ’http://www.example.com/3.html’, ] def parse(self, response): for h3 in response.xpath(’//h3’).extract(): yield MyItem(title=h3) for url in response.xpath(’//a/@href’).extract(): yield scrapy.Request(url

How to send post data in start_urls of the scrapy spider

你说的曾经没有我的故事 提交于 2019-11-30 19:33:33
I want to crawl a website which supports only post data. I want to send the query params in post data in all the requests. How to achieve this? alecxe POST requests can be made using scrapy's Request or FormRequest classes. Also, consider using start_requests() method instead of start_urls property. Example: from scrapy.http import FormRequest class myspiderSpider(Spider): name = "myspider" allowed_domains = ["www.example.com"] def start_requests(self): return [ FormRequest("http://www.example.com/login", formdata={'someparam': 'foo', 'otherparam': 'bar'}, callback=self.parse) ] Hope that

Twisted Python Failure - Scrapy Issues

时间秒杀一切 提交于 2019-11-30 18:33:47
问题 I am trying to use SCRAPY to scrape this website's search reqults for any search query - http://www.bewakoof.com . The website uses AJAX (in the form of XHR) to display the search results. I managed to track the XHR, and you notice it in my code as below ( inside the for loop, wherein i am storing the URL to temp, and incrementing 'i' in the loop )-: from twisted.internet import reactor from scrapy.crawler import CrawlerProcess, CrawlerRunner import scrapy from scrapy.utils.log import

Scrapy file download how to use custom filename

纵然是瞬间 提交于 2019-11-30 18:29:03
问题 For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names. [(True, {'checksum': '2b00042f7481c7b056c4b410d28f33cf', 'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg', 'url': 'http://www.example.com/files/product1.pdf'}), (False, Failure(...))] How can I store the files using my custom file names instead? In the example above, I would want the file name being "product1

Scrapy Spider not Following Links

假如想象 提交于 2019-11-30 18:05:48
问题 I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com , it successfully extracts a list of article urls with le.extract_links(response) , but I can't get my crawl command ( scrapy crawl nyt -o out.json ) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse

python scrapy parse() function, where is the return value returned to?

左心房为你撑大大i 提交于 2019-11-30 17:14:51
问题 I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example: import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = ’example.com’ allowed_domains = [’example.com’] start_urls = [ ’http://www.example.com/1.html’, ’http://www.example.com/2.html’, ’http://www.example.com/3.html’, ] def parse(self, response): for h3 in response.xpath(’//h3’)