scrapy | 易学教程

Scrapy - doesn't crawl

阅读更多关于 Scrapy - doesn't crawl

问题 I'm trying to get a recursive crawl running and since the one I wrote wasn't working fine, I pulled an example from web and tried. I really don't know, where the problem is, but the crawl doesn't display any ERRORS. Can anyone help me with this. Also, Is there any step-by-step debugging tool to help understand the crawl flow of a spider. Any help regarding this is greatly appreciated. MacBook:spiders hadoop$ scrapy crawl craigs -o items.csv -t csv /System/Library/Frameworks/Python.framework

Scrapy spider finishing scraping process without scraping anything

阅读更多关于 Scrapy spider finishing scraping process without scraping anything

问题 I have this spider that scrapes amazon for information. The spider reads a .txt file in which I write which product it must search and then enters amazon page for that product, for example : https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=laptop I use the keyword=laptop for changing which product to search and such. The issue that I'm having is that the spider just does not work, which is weird because a week ago it did her job just fine. Also, no errors appear

XPath syntax: How to get the child div information based on parent div

阅读更多关于 XPath syntax: How to get the child div information based on parent div

问题 The result from my scrapy project looks like this: <div class="news_li">...</div> <div class="news_li">...</div> <div class="news_li">...</div> ... <div class="news_li">...</div> And each "news_li" class looks like this: <div class="news_li"> <div class="a"> <a href="aaa"> <div class="a1"></div> </a> </div> <a href="xxx"> <div class="b"> <div class="b1"></div> <div class="b2"></div> <div class="b3"></div> </div> </a> </div> I am trying to extract information one at a time in the scrapy shell

python数据采集与多线程效率分析

阅读更多关于 python数据采集与多线程效率分析

以前一直使用PHP写爬虫，用 Snoopy 配合 simple_html_dom 用起来也挺好的，至少能够解决问题。 PHP一直没有一个好用的多线程机制，虽然可以使用一些trick的手段来实现并行的效果（例如借助apache或者nginx服务器等，或者fork一个子进程，或者直接动态生成多个PHP脚本多进程运行），但是无论从代码结构上，还是从使用的复杂程度上，用起来都不是那么顺手。还听说过一个 pthreads 的PHP的扩展，这是一个真正能够实现PHP多线程的扩展，看github上它的介绍：Absolutely, this is not a hack, we don't use forking or any other such nonsense, what you create are honest to goodness posix threads that are completely compatible with PHP and safe ... this is true multi-threading :) 扯远了，PHP的内容在本文中不再赘述，既然决定尝试一下Python的采集，同时一定要学习一下Python的多线程知识的。以前一直听各种大牛们将Python有多么多么好用，不真正用一次试试，自己也没法明确Python具体的优势在哪，处理哪些问题用Python合适。

How to get double quotes in Scrapy .csv results

阅读更多关于 How to get double quotes in Scrapy .csv results

问题 I have a problem with quotations within outputs using Scrapy. I am trying to scrap data that contains commas and this results in double quotations in some columns like so: TEST,TEST,TEST,ON,TEST,TEST,"$2,449,000, 4,735 Sq Ft, 6 Bed, 5.1 Bath, Listed 03/01/2016" TEST,TEST,TEST,ON,TEST,TEST,"$2,895,000, 4,975 Sq Ft, 5 Bed, 4.1 Bath, Listed 01/03/2016" Only columns with commas get double quoted. How can I double quote all my data columns? I want Scrapy to output: "TEST","TEST","TEST","ON","TEST"

Using scrapy to extract XHR request?

阅读更多关于 Using scrapy to extract XHR request?

问题 I'm trying to scrape social like counts that are being generated with javascript. I am able to scrape the desired data if I absolutely reference the XHR url. But the site I am trying to scrape dynamically generates these XMLHttpRequests with query string parameters that I do not know how to extract. For example, you can see that using the m, p, i, and g parameters unique to each page are used to construct the request url. Here is the assembled url: http://aeon.co/magazine/social/social.php

scrapy unable to make Request() callback

阅读更多关于 scrapy unable to make Request() callback

问题 I am trying to make recursive parsing script with Scrapy, but Request() function doesn't call callback function suppose_to_parse() , nor any function provided in callback value. I tried different variations but none of them work. Where to dig ? from scrapy.http import Request from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class joomler(BaseSpider): name = "scrapy" allowed_domains = ["scrapy.org"] start_urls = ["http://blog.scrapy.org/"] def parse(self,

Force scrapy to crawl link in order they appear

阅读更多关于 Force scrapy to crawl link in order they appear

问题 I'm writing a spider with scrapy to crawl a website, the index page is a list of link like www.link1.com, www.link2.com, www.link3.com and that site is updated really often, so my crawler is part of a process that runs everey hours, but I would like to crawl only the new link that i havent crawled yet. my problem is that scrapy randomise the way it treats each link when going deep. is it possible to force sracpy to crawl in order ? Like 1 then 2 and then 3, so that I can save the last link

Scrapy on Windows XP ImportError: No module named w3lib.html

阅读更多关于 Scrapy on Windows XP ImportError: No module named w3lib.html

问题 I just tried installing and running scrapy on my PC at work, which runs Windows XP. If I run scrapy startproject myproject I will get the following error: ImportError: No module named w3lib.html Wining: It's really troublesome running Python / Scrapy on windows XP. On linux I just run pip install Scrapy and it's ok lol. 回答1: It appears they forgot to list w3lib and simplejson. The latter is only required for Python versions before 2.6. Here's an installer for Distribute, in case you don't

Scrapy on Windows XP ImportError: No module named w3lib.html

阅读更多关于 Scrapy on Windows XP ImportError: No module named w3lib.html