scrapy

python exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in

雨燕双飞 提交于 2019-12-22 04:56:08
问题 I am using scrapy with python and I have this code in a python item pipline def process_item(self, item, spider): import pdb; pdb.set_trace() ID = str(uuid.uuid5(uuid.NAMESPACE_DNS, item['link'])) I got this error : Traceback (most recent call last): File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\mid dleware.py", line 62, in _process_chain return process_chain(self.methods[methodname], obj, *args) File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\uti ls

Running scrapy from script not including pipeline

喜夏-厌秋 提交于 2019-12-22 04:43:45
问题 I'm running scrapy from a script but all it does is activate the spider. It doesn't go through my item pipeline. I've read http://scrapy.readthedocs.org/en/latest/topics/practices.html but it doesn't say anything about including pipelines. My setup: Scraper/ scrapy.cfg ScrapyScript.py Scraper/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py my_spider.py My script: from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy.settings import Settings

Running Scrapy from a script with file output

淺唱寂寞╮ 提交于 2019-12-22 04:42:53
问题 I'm currently using Scrapy with the following command line arguments: scrapy crawl my_spider -o data.json However, I'd prefer to 'save' this command in a Python script. Following https://doc.scrapy.org/en/latest/topics/practices.html, I have the following script: import scrapy from scrapy.crawler import CrawlerProcess from apkmirror_scraper.spiders.sitemap_spider import ApkmirrorSitemapSpider process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' })

Microsoft Visual C++ 14.0 is required.

不打扰是莪最后的温柔 提交于 2019-12-22 04:37:31
问题 when i install scrapy package,The following error occurred: error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools but the website is not found,so how to solve the problem? 回答1: The package is asking for the VS2015 build tools, which are now available as part of the VS2017 build tools. Download them here, or more specifically, here. 回答2: You need to install the latest version of the Visual Studio.

Is scrapy supported on google app engine?

孤人 提交于 2019-12-22 04:32:25
问题 It has following dependencies: - Twisted 2.5.0, 8.0 or above - lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended) - simplejson - pyopenssl 回答1: You cannot use C extensions on App Engine, which rules out lxml and (I believe) libxml2 and pyopenssl. I doubt most of what Twisted does is possible in the App Engine sandbox either; you can't directly open sockets or spawn threads. EDIT (January 2013): The Python 2.7 runtime does include some C extensions, including

Async query database for keys to use in multiple requests

那年仲夏 提交于 2019-12-22 01:22:20
问题 I want to asynchronously query a database for keys, then make requests to several urls for each key. I have a function that returns a Deferred from the database whose value is the key for several requests. Ideally, I would call this function and return a generator of Deferreds from start_requests . @inlineCallbacks def get_request_deferred(self): d = yield engine.execute(select([table])) # async d.addCallback(make_url) d.addCallback(Request) return d def start_requests(self): ???? But

How to password protect Scrapyd UI?

和自甴很熟 提交于 2019-12-22 00:49:47
问题 I have my website available to public and there is Scrapyd running at port 6800 like http://website.com:6800/ I do not want anyone to see list of my crawlers. I know anyone can easily guess type up port 6800 and can see whats going on. I have few questions, answer any of them will help me. Is there way to password protect Scrapyd UI? Can I password protect a specific Port on Linux? I know it can be done with IPTables to ONLY ALLOW PARTICULAR IPs but thats not a good solution Should I make

Unable to use proxies in Scrapy project

我的梦境 提交于 2019-12-22 00:34:33
问题 I have been trying to crawl a website that has seemingly identified and blocked my IP and is throwing a 429 Too many requests response. I installed scrapy-proxies from this link: https://github.com/aivarsk/scrapy-proxies and followed the given instructions. I got a list of proxies from here: http://www.gatherproxy.com/ and now here is how my settings.py and proxylist.txt look like: Settings.py BOT_NAME = 'project' SPIDER_MODULES = ['project.spiders'] NEWSPIDER_MODULE = 'project.spiders' #

Scrapy: populate items with item loaders over multiple pages

左心房为你撑大大i 提交于 2019-12-22 00:27:59
问题 I'm trying to crawl and scrape multiple pages, given multiple urls. I am testing with Wikipedia, and to make it easier I just used the same Xpath selector for each page, but I eventually want to use many different Xpath selectors unique to each page, so each page has its own separate parsePage method. This code works perfectly when I don't use item loaders, and just populate items directly. When I use item loaders, the items are populated strangely, and it seems to be completely ignoring the

How to get the followers of a person as well as comments under the photos in instagram using scrapy?

牧云@^-^@ 提交于 2019-12-21 22:40:27
问题 As you see, the following json has number of followers as well as number of comments but how can I access the data within each comment as well as ID of followers so I could crawl into them? { "logging_page_id": "profilePage_20327023", "user": { "biography": null, "blocked_by_viewer": false, "connected_fb_page": null, "country_block": false, "external_url": null, "external_url_linkshimmed": null, "followed_by": { "count": 2585 }, "followed_by_viewer": false, "follows": { "count": 561 },