scrapy | 易学教程

Scrapy not downloading images and getting pipeline error

阅读更多关于 Scrapy not downloading images and getting pipeline error

问题 I have this code class MyImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url) and this is the spider subclassed from BaseSpider. This basespider is giving me nightmare def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//strong[@class="genmed"]') items = [] for site in sites[:5]: item = PanduItem() item['username'] = site.select('dl/dd/h2/a').select("string()").extract() item[

How to get all pages from the whole website using python?

阅读更多关于 How to get all pages from the whole website using python?

问题 I am trying to make a tool that should get every link from website. For example I need to get all questions pages from stackoverflow. I tried using scrapy. class MySpider(CrawlSpider): name = 'myspider' start_urls = ['https://stackoverflow.com/questions/'] def parse(self, response): le = LinkExtractor() for link in le.extract_links(response): url_lnk = link.url print (url_lnk) Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just

Call back functions are not fired

阅读更多关于 Call back functions are not fired

问题 I am trying to scrape MichaelKors.com. I have had success until now; my script just stopped working. Call back functions are not being fired. I have removed everything from my functions and even then they are not being called. Here is my code: class MichaelKorsClass(CrawlSpider): name = 'michaelkors' allowed_domains = ['www.michaelkors.com'] start_urls = ['https://www.michaelkors.com/women/clothing/dresses/_/N-28ei' ] rules = ( # Rule(LinkExtractor(allow=('(.*\/_\/R-\w\w_)([\-a-zA-Z0-9]*)$',

Close a scrapy spider when a condition is met and return the output object

阅读更多关于 Close a scrapy spider when a condition is met and return the output object

问题 I have made a spider to get reviews from a page like this here using scrapy. I want product reviews only till a certain date(2nd July 2016 in this case). I want to close my spider as soon as the review date goes earlier than the given date and return the items list. Spider is working well but my problem is that i am not able to close my spider if the condition is met..if i raise an exception, spider closes without returning anything. Please suggest the best way to close the spider manually.

setting start urls for scrapy outside of class

阅读更多关于 setting start urls for scrapy outside of class

问题 I am new Scrapy, how can I pass start_urls from outside of the class, I tried to make start_urls outside of class but it didn't work.What I am trying to do is to create a file with file name from dictionary ( search_dict ) and value of it as a start url for Scrapy search_dict={'hello world':'https://www.google.com/search?q=hello+world', 'my code':'https://www.google.com/search?q=stackoverflow+questions', 'test':'https://www.google.com/search?q="test"'} class googlescraper(scrapy.Spider): name

Scrapy FakeUserAgentError: Error occurred during getting browser

阅读更多关于 Scrapy FakeUserAgentError: Error occurred during getting browser

问题 I use Scrapy FakeUserAgent and keep getting this error on my Linux Server. Traceback (most recent call last): File "/usr/local/lib64/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks result = g.send(result) File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request response = yield method(request=request, spider=spider) File "/usr/local/lib/python2.7/site-packages/scrapy_fake_useragent/middleware.py", line

InterfaceError:(sqlte3.InterfaceError)Error binding parameter 0

阅读更多关于 InterfaceError:(sqlte3.InterfaceError)Error binding parameter 0

问题 Recently, I used Python and Scrapy to crawl article information like 'title' from a blog. Without using a database, the results are fine / as expected. However, when I use SQLalchemy, I received the following error: InterfaceError:(sqlite3.InterfaceError)Error binding parameter 0 -probably unsupported type.[SQL:u'INSERT INTO myblog(title) VALUES (?)'] [PARAMETERS:([u'\r\n Accelerated c++\u5b66\u4e60 chapter3 -----\u4f7f\u7528\u6279\u636e \r\n '],)] My xpath expression is: item['title'] = sel

How to install Scrapy on Ubuntu 16.04?

阅读更多关于 How to install Scrapy on Ubuntu 16.04?

问题 I followed the official guide, but got this error message: The following packages have unmet dependencies: scrapy : Depends: python-support (>= 0.90.0) but it is not installable Recommends: python-setuptools but it is not going to be installed E: Unable to correct problems, you have held broken packages. I then tried sudo apt-get python-support , but found ubuntu 16.04 removed python-support . Lastly, I tried to install python-setuptools , but seems it only would install python2 instead. The

Scrapy Follow & Scrape next Pages

阅读更多关于 Scrapy Follow & Scrape next Pages

问题 I am having a problem where none of my scrapy spiders will crawl a website, just scrape one page and seize. I was under the impression that the rules member variable was responsible for this, but I can't get it to follow any links. I have been following the documentation from here: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider What could I be missing that is making none of my bots crawl? from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors

How to use scrapy for Amazon.com links after “Next” Button?

阅读更多关于 How to use scrapy for Amazon.com links after “Next” Button?

问题 I am relatively new to Python and Scrapy. I'm trying to scrap the links in "Customers who bought this item also bought". For example: http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/. There are 17 pages for "Customers who bought this item also bought". If I ask scrapy to scrap that url, it only scraps the first page (6 items). How do I ask scrapy to press the "Next Button" to scrap all the items in the 17 pages? A sample code (just the part that matters in the