scrapy-spider | 易学教程

Multiple nested request with scrapy

阅读更多关于 Multiple nested request with scrapy

问题 I try to scrap some airplane schedule information on www.flightradar24.com website for research project. The hierarchy of json file i want to obtain is something like that : Object ID - country - link - name - airports - airport0 - code_total - link - lat - lon - name - schedule - ... - ... - airport1 - code_total - link - lat - lon - name - schedule - ... - ... Country and Airport are stored using items, and as you can see on json file the CountryItem (link, name attribute) finally store

Scrapy: scraping a list of links

阅读更多关于 Scrapy: scraping a list of links

问题 This question is somewhat a follow-up of this question that I asked previously. I am trying to scrape a website which contains some links on the first page. Something similar to this. Now, since I want to scrape the details of the items present on the page I have extracted their individual URLs. I have saved these URLS in a list. How do I launch spiders to scrape the pages individually? For better understanding: [urlA, urlB, urlC, urlD...] This is the list of URLs that I have scraped. Now I

Dynamic rules based on start_urls for Scrapy CrawlSpider?

阅读更多关于 Dynamic rules based on start_urls for Scrapy CrawlSpider?

I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain). I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately. Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites: class HomepagesSpider

Scrapy Spider not Following Links

阅读更多关于 Scrapy Spider not Following Links

I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com , it successfully extracts a list of article urls with le.extract_links(response) , but I can't get my crawl command ( scrapy crawl nyt -o out.json ) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse function? Any help is greatly appreciated. from datetime import date import scrapy from scrapy.contrib

python scrapy parse() function, where is the return value returned to?

阅读更多关于 python scrapy parse() function, where is the return value returned to?

I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example: import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = ’example.com’ allowed_domains = [’example.com’] start_urls = [ ’http://www.example.com/1.html’, ’http://www.example.com/2.html’, ’http://www.example.com/3.html’, ] def parse(self, response): for h3 in response.xpath(’//h3’).extract(): yield MyItem(title=h3) for url in response.xpath(’//a/@href’).extract(): yield scrapy.Request(url

How to send post data in start_urls of the scrapy spider

阅读更多关于 How to send post data in start_urls of the scrapy spider

I want to crawl a website which supports only post data. I want to send the query params in post data in all the requests. How to achieve this? alecxe POST requests can be made using scrapy's Request or FormRequest classes. Also, consider using start_requests() method instead of start_urls property. Example: from scrapy.http import FormRequest class myspiderSpider(Spider): name = "myspider" allowed_domains = ["www.example.com"] def start_requests(self): return [ FormRequest("http://www.example.com/login", formdata={'someparam': 'foo', 'otherparam': 'bar'}, callback=self.parse) ] Hope that

Twisted Python Failure - Scrapy Issues

阅读更多关于 Twisted Python Failure - Scrapy Issues

问题 I am trying to use SCRAPY to scrape this website's search reqults for any search query - http://www.bewakoof.com . The website uses AJAX (in the form of XHR) to display the search results. I managed to track the XHR, and you notice it in my code as below ( inside the for loop, wherein i am storing the URL to temp, and incrementing 'i' in the loop )-: from twisted.internet import reactor from scrapy.crawler import CrawlerProcess, CrawlerRunner import scrapy from scrapy.utils.log import

Scrapy file download how to use custom filename

阅读更多关于 Scrapy file download how to use custom filename

问题 For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names. [(True, {'checksum': '2b00042f7481c7b056c4b410d28f33cf', 'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg', 'url': 'http://www.example.com/files/product1.pdf'}), (False, Failure(...))] How can I store the files using my custom file names instead? In the example above, I would want the file name being "product1

Scrapy Spider not Following Links

阅读更多关于 Scrapy Spider not Following Links

问题 I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com , it successfully extracts a list of article urls with le.extract_links(response) , but I can't get my crawl command ( scrapy crawl nyt -o out.json ) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse

python scrapy parse() function, where is the return value returned to?

阅读更多关于 python scrapy parse() function, where is the return value returned to?

问题 I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example: import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = ’example.com’ allowed_domains = [’example.com’] start_urls = [ ’http://www.example.com/1.html’, ’http://www.example.com/2.html’, ’http://www.example.com/3.html’, ] def parse(self, response): for h3 in response.xpath(’//h3’)