scrapy

Scrapy not downloading images and getting pipeline error

雨燕双飞 提交于 2019-12-24 07:26:44
问题 I have this code class MyImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url) and this is the spider subclassed from BaseSpider. This basespider is giving me nightmare def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//strong[@class="genmed"]') items = [] for site in sites[:5]: item = PanduItem() item['username'] = site.select('dl/dd/h2/a').select("string()").extract() item[

How to get all pages from the whole website using python?

只谈情不闲聊 提交于 2019-12-24 07:10:09
问题 I am trying to make a tool that should get every link from website. For example I need to get all questions pages from stackoverflow. I tried using scrapy. class MySpider(CrawlSpider): name = 'myspider' start_urls = ['https://stackoverflow.com/questions/'] def parse(self, response): le = LinkExtractor() for link in le.extract_links(response): url_lnk = link.url print (url_lnk) Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just

Call back functions are not fired

点点圈 提交于 2019-12-24 06:37:16
问题 I am trying to scrape MichaelKors.com. I have had success until now; my script just stopped working. Call back functions are not being fired. I have removed everything from my functions and even then they are not being called. Here is my code: class MichaelKorsClass(CrawlSpider): name = 'michaelkors' allowed_domains = ['www.michaelkors.com'] start_urls = ['https://www.michaelkors.com/women/clothing/dresses/_/N-28ei' ] rules = ( # Rule(LinkExtractor(allow=('(.*\/_\/R-\w\w_)([\-a-zA-Z0-9]*)$',

Close a scrapy spider when a condition is met and return the output object

半腔热情 提交于 2019-12-24 06:33:43
问题 I have made a spider to get reviews from a page like this here using scrapy. I want product reviews only till a certain date(2nd July 2016 in this case). I want to close my spider as soon as the review date goes earlier than the given date and return the items list. Spider is working well but my problem is that i am not able to close my spider if the condition is met..if i raise an exception, spider closes without returning anything. Please suggest the best way to close the spider manually.

setting start urls for scrapy outside of class

那年仲夏 提交于 2019-12-24 06:07:38
问题 I am new Scrapy, how can I pass start_urls from outside of the class, I tried to make start_urls outside of class but it didn't work.What I am trying to do is to create a file with file name from dictionary ( search_dict ) and value of it as a start url for Scrapy search_dict={'hello world':'https://www.google.com/search?q=hello+world', 'my code':'https://www.google.com/search?q=stackoverflow+questions', 'test':'https://www.google.com/search?q="test"'} class googlescraper(scrapy.Spider): name

Scrapy FakeUserAgentError: Error occurred during getting browser

戏子无情 提交于 2019-12-24 05:09:09
问题 I use Scrapy FakeUserAgent and keep getting this error on my Linux Server. Traceback (most recent call last): File "/usr/local/lib64/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks result = g.send(result) File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request response = yield method(request=request, spider=spider) File "/usr/local/lib/python2.7/site-packages/scrapy_fake_useragent/middleware.py", line

InterfaceError:(sqlte3.InterfaceError)Error binding parameter 0

让人想犯罪 __ 提交于 2019-12-24 05:03:03
问题 Recently, I used Python and Scrapy to crawl article information like 'title' from a blog. Without using a database, the results are fine / as expected. However, when I use SQLalchemy, I received the following error: InterfaceError:(sqlite3.InterfaceError)Error binding parameter 0 -probably unsupported type.[SQL:u'INSERT INTO myblog(title) VALUES (?)'] [PARAMETERS:([u'\r\n Accelerated c++\u5b66\u4e60 chapter3 -----\u4f7f\u7528\u6279\u636e \r\n '],)] My xpath expression is: item['title'] = sel

How to install Scrapy on Ubuntu 16.04?

岁酱吖の 提交于 2019-12-24 04:57:35
问题 I followed the official guide, but got this error message: The following packages have unmet dependencies: scrapy : Depends: python-support (>= 0.90.0) but it is not installable Recommends: python-setuptools but it is not going to be installed E: Unable to correct problems, you have held broken packages. I then tried sudo apt-get python-support , but found ubuntu 16.04 removed python-support . Lastly, I tried to install python-setuptools , but seems it only would install python2 instead. The

Scrapy Follow & Scrape next Pages

自闭症网瘾萝莉.ら 提交于 2019-12-24 04:27:10
问题 I am having a problem where none of my scrapy spiders will crawl a website, just scrape one page and seize. I was under the impression that the rules member variable was responsible for this, but I can't get it to follow any links. I have been following the documentation from here: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider What could I be missing that is making none of my bots crawl? from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors

How to use scrapy for Amazon.com links after “Next” Button?

不问归期 提交于 2019-12-24 02:23:53
问题 I am relatively new to Python and Scrapy. I'm trying to scrap the links in "Customers who bought this item also bought". For example: http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/. There are 17 pages for "Customers who bought this item also bought". If I ask scrapy to scrap that url, it only scraps the first page (6 items). How do I ask scrapy to press the "Next Button" to scrap all the items in the 17 pages? A sample code (just the part that matters in the