scrapy | 易学教程

python exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in

阅读更多关于 python exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in

问题 I am using scrapy with python and I have this code in a python item pipline def process_item(self, item, spider): import pdb; pdb.set_trace() ID = str(uuid.uuid5(uuid.NAMESPACE_DNS, item['link'])) I got this error : Traceback (most recent call last): File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\mid dleware.py", line 62, in _process_chain return process_chain(self.methods[methodname], obj, *args) File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\uti ls

Running scrapy from script not including pipeline

阅读更多关于 Running scrapy from script not including pipeline

问题 I'm running scrapy from a script but all it does is activate the spider. It doesn't go through my item pipeline. I've read http://scrapy.readthedocs.org/en/latest/topics/practices.html but it doesn't say anything about including pipelines. My setup: Scraper/ scrapy.cfg ScrapyScript.py Scraper/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py my_spider.py My script: from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy.settings import Settings

Running Scrapy from a script with file output

阅读更多关于 Running Scrapy from a script with file output

问题 I'm currently using Scrapy with the following command line arguments: scrapy crawl my_spider -o data.json However, I'd prefer to 'save' this command in a Python script. Following https://doc.scrapy.org/en/latest/topics/practices.html, I have the following script: import scrapy from scrapy.crawler import CrawlerProcess from apkmirror_scraper.spiders.sitemap_spider import ApkmirrorSitemapSpider process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' })

Microsoft Visual C++ 14.0 is required.

阅读更多关于 Microsoft Visual C++ 14.0 is required.

问题 when i install scrapy package,The following error occurred: error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools but the website is not found,so how to solve the problem? 回答1: The package is asking for the VS2015 build tools, which are now available as part of the VS2017 build tools. Download them here, or more specifically, here. 回答2: You need to install the latest version of the Visual Studio.

Is scrapy supported on google app engine?

阅读更多关于 Is scrapy supported on google app engine?

问题 It has following dependencies: - Twisted 2.5.0, 8.0 or above - lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended) - simplejson - pyopenssl 回答1: You cannot use C extensions on App Engine, which rules out lxml and (I believe) libxml2 and pyopenssl. I doubt most of what Twisted does is possible in the App Engine sandbox either; you can't directly open sockets or spawn threads. EDIT (January 2013): The Python 2.7 runtime does include some C extensions, including

Async query database for keys to use in multiple requests

阅读更多关于 Async query database for keys to use in multiple requests

问题 I want to asynchronously query a database for keys, then make requests to several urls for each key. I have a function that returns a Deferred from the database whose value is the key for several requests. Ideally, I would call this function and return a generator of Deferreds from start_requests . @inlineCallbacks def get_request_deferred(self): d = yield engine.execute(select([table])) # async d.addCallback(make_url) d.addCallback(Request) return d def start_requests(self): ???? But

How to password protect Scrapyd UI?

阅读更多关于 How to password protect Scrapyd UI?

问题 I have my website available to public and there is Scrapyd running at port 6800 like http://website.com:6800/ I do not want anyone to see list of my crawlers. I know anyone can easily guess type up port 6800 and can see whats going on. I have few questions, answer any of them will help me. Is there way to password protect Scrapyd UI? Can I password protect a specific Port on Linux? I know it can be done with IPTables to ONLY ALLOW PARTICULAR IPs but thats not a good solution Should I make

Unable to use proxies in Scrapy project

阅读更多关于 Unable to use proxies in Scrapy project

问题 I have been trying to crawl a website that has seemingly identified and blocked my IP and is throwing a 429 Too many requests response. I installed scrapy-proxies from this link: https://github.com/aivarsk/scrapy-proxies and followed the given instructions. I got a list of proxies from here: http://www.gatherproxy.com/ and now here is how my settings.py and proxylist.txt look like: Settings.py BOT_NAME = 'project' SPIDER_MODULES = ['project.spiders'] NEWSPIDER_MODULE = 'project.spiders' #

Scrapy: populate items with item loaders over multiple pages

阅读更多关于 Scrapy: populate items with item loaders over multiple pages

问题 I'm trying to crawl and scrape multiple pages, given multiple urls. I am testing with Wikipedia, and to make it easier I just used the same Xpath selector for each page, but I eventually want to use many different Xpath selectors unique to each page, so each page has its own separate parsePage method. This code works perfectly when I don't use item loaders, and just populate items directly. When I use item loaders, the items are populated strangely, and it seems to be completely ignoring the

How to get the followers of a person as well as comments under the photos in instagram using scrapy?

阅读更多关于 How to get the followers of a person as well as comments under the photos in instagram using scrapy?

问题 As you see, the following json has number of followers as well as number of comments but how can I access the data within each comment as well as ID of followers so I could crawl into them? { "logging_page_id": "profilePage_20327023", "user": { "biography": null, "blocked_by_viewer": false, "connected_fb_page": null, "country_block": false, "external_url": null, "external_url_linkshimmed": null, "followed_by": { "count": 2585 }, "followed_by_viewer": false, "follows": { "count": 561 },