scrapy | 易学教程

Preferred way to run Scrapyd in the background / as a service

阅读更多关于 Preferred way to run Scrapyd in the background / as a service

问题 I am trying to run Scrapyd on a virtual Ubuntu 16.04 server, to which I connect via SSH. When I run scrapy by simply running $ scrapyd I can connect to the web interface by going to http://82.165.102.18:6800. However, once I close the SSH connection, the web interface is no longer available, therefore, I think I need to run Scrapyd in the background as a service somehow. After some research I came across a few proposed solutions: daemon (sudo apt install daemon) screen (sudo apt install

How to crawl links on all pages of a web site with Scrapy

阅读更多关于 How to crawl links on all pages of a web site with Scrapy

问题 I'm learning about scrapy and I'm trying to extract all links that contains: "http://lattes.cnpq.br/andasequenceofnumbers" , example: http://lattes.cnpq.br/0281123427918302 But I don't know what is the page on the web site that contains these information. For example this web site: http://www.ppgcc.ufv.br/ The links that I want are on this page: http://www.ppgcc.ufv.br/?page_id=697 What could I do? I'm trying to use rules but I don't know how to use regular expressions correctly. Thank you 1

How many items has been scraped per start_url

阅读更多关于 How many items has been scraped per start_url

问题 I use scrapy to crawl 1000 urls and store scraped item in a mongodb. I'd to know how many items have been found for each url. From scrapy stats I can see 'item_scraped_count': 3500 However, I need this count for each start_url separately. There is also referer field for each item that I might use to count each url items manually: 2016-05-24 15:15:10 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=6w-_ucPV674> (referer: https://www.youtube.com/results?q=billys&sp=EgQIAhAB)

Scrapy & captcha

阅读更多关于 Scrapy & captcha

问题 I use scrapy for submit form in site https://www.barefootstudent.com/jobs (any links into page, etc http://www.barefootstudent.com/los_angeles/jobs/full_time/full_time_nanny_needed_in_venice_217021) My scapy bot successfully log in but i can not avoid captcha. For form submit i use scrapy.FormRequest.from_reponse frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 'security': captcha, 'name': 'fx', 'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id':

Python Scrapy & Yield

阅读更多关于 Python Scrapy & Yield

问题 I am currently developing a scraper using Scrapy for the first time and I am using Yield for the first time as well. I am still trying to wrap my head around yield. The Scraper: Scrapes one page to get a list of dates (parse) Uses these dates to format URLS to then scrape (parse_page_contents) On this page, it find URLS of each individual listing and scrapes the individual listings (parse_page_listings) On the individual list I want to extract all the data. There are also 4 links on each

Scrapy - Scraping different web pages in one scrapy script

阅读更多关于 Scrapy - Scraping different web pages in one scrapy script

问题 I'm creating a web app that scrapes a long list of shoes from different websites. Here are my two individual scrapy scripts: http://store.nike.com/us/en_us/pw/mens-clearance-soccer-shoes/47Z7puZ896Zoi3 from scrapy import Spider from scrapy.http import Request class ShoesSpider(Spider): name = "shoes" allowed_domains = ["store.nike.com"] start_urls = ['http://store.nike.com/us/en_us/pw/mens-clearance-soccer-shoes/47Z7puZ896Zoi3'] def parse(self, response): shoes = response.xpath('//*[@class=

How do I use Scrapy to crawl within pages?

阅读更多关于 How do I use Scrapy to crawl within pages?

问题 I am using Python and Scrapy for this question. I am attempting to crawl webpage A , which contains a list of links to webpages B1, B2, B3, ... Each B page contains a link to another page, C1, C2, C3, ... , which contains an image. So, using Scrapy, the idea in pseudo-code is: links = getlinks(A) for link in links: B = getpage(link) C = getpage(B) image = getimage(C) However, I am running into a problem when trying to parse more than one page in Scrapy. Here is my code: def parse(self,

Scrapy ITEM_PIPELINES warning

阅读更多关于 Scrapy ITEM_PIPELINES warning

问题 I have the following in my settings.py ITEM_PIPELINES = ['mybot.pipelines.custompipeline'] But when I start scrapy, I get the following warning. /lib/python2.7/site-packages/scrapy/contrib/pipeline/ init .py:21: ScrapyDeprecationWarning: ITEM_PIPELINES defined as a list or a set is deprecated, switch to a dict category=ScrapyDeprecationWarning, stacklevel=1) It still seems to be working properly. But, what do I need to do in order to remove this warning? 回答1: see scrapy documentation for

Python Scrapy - mimetype based filter to avoid non-text file downloads

阅读更多关于 Python Scrapy - mimetype based filter to avoid non-text file downloads

问题 I have a running scrapy project, but it is being bandwidth intensive because it tries to download a lot of binary files (zip, tar, mp3, ..etc). I think the best solution is to filter the requests based on the mimetype (Content-Type:) HTTP header. I looked at the scrapy code and found this setting: DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory' I changed it to: DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.webclients.ScrapyHTTPClientFactory' And played a

Scrape ajax web page with python and/or scrapy

阅读更多关于 Scrape ajax web page with python and/or scrapy

问题 What I want to do is scrape petition data - name, city, state, date, signature number - from one or more petitions at petitions.whitehouse.gov I assume at this point python is the way to go - probably the scrapy library - along with some functions to deal with the ajax aspects of the site. The reason for this scraper is that this petition data is not available to the public. I am a freelance tech journalist and I want to be able to dump each petition's data into a CSV file in order to analyze