scrapy | 易学教程

How can I reuse the parse method of my scrapy Spider-based spider in an inheriting CrawlSpider?

阅读更多关于 How can I reuse the parse method of my scrapy Spider-based spider in an inheriting CrawlSpider?

问题 I currently have a Spider-based spider that I wrote for crawling an input JSON array of start_urls : from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from foo.items import AtlanticFirearmsItem from scrapy.contrib.loader import ItemLoader import json import datetime import re class AtlanticFirearmsSpider(Spider): name = "atlantic_firearms" allowed_domains = ["atlanticfirearms.com"] def __init_

How do I use BeautifulSoup4 to get ALL text before <br> tag

阅读更多关于 How do I use BeautifulSoup4 to get ALL text before tag

问题 I'm trying to scrape some data for my app. My question is I need some Here is the HTML code: <tr> <td> This <a class="tip info" href="blablablablabla">is a first</a> sentence. <br> This <a class="tip info" href="blablablablabla">is a second</a> sentence. <br>This <a class="tip info" href="blablablablabla">is a third</a> sentence. <br> </td> </tr> I want output to looks like This is a first sentence. This is a second sentence. This is a third sentence. Is it possible to do that? 回答1: Try this.

Normalize whitespace with Python

阅读更多关于 Normalize whitespace with Python

问题 I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string: Sapphire RX460 OC 2/4GB Notice two groups of two whitespaces preceeding the string literal and between OC and 2 . Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC and 2 , which I need collapsed into a single space. I've tried using normalize-space() from XPath while extracting data with

Scrapy ERROR: Error downloading - Could not open CONNECT tunnel

阅读更多关于 Scrapy ERROR: Error downloading - Could not open CONNECT tunnel

问题 I have written a spider to crawl https://tecnoblog.net/categoria/review/ but when I let the spider crawl, there is one error: 2015-05-19 15:13:20+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews) 2015-05-19 15:13:20+0100 [scrapy] INFO: Optional features available: ssl, http11 2015-05-19 15:13:20+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'reviews'} 2015-05-19 15:13:20+0100

Scrapy rules not working when process_request and callback parameter are set

阅读更多关于 Scrapy rules not working when process_request and callback parameter are set

问题 I have this rule for scrapy CrawlSpider rules = [ Rule(LinkExtractor( allow= '/topic/\d+/organize$', restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]' ), process_request='request_tagPage', callback = "parse_tagPage", follow = True) ] request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned,

Scrapy rules not working when process_request and callback parameter are set

阅读更多关于 Scrapy rules not working when process_request and callback parameter are set

Retrying a Scrapy Request even when receiving a 200 status code

阅读更多关于 Retrying a Scrapy Request even when receiving a 200 status code

问题 There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector). Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do. def check_response(response): if response.body != '': return response else: return Request(copy_of_response.request, callback=check_response) Basically, is

install scrapy on win 7 (64-bit)

阅读更多关于 install scrapy on win 7 (64-bit)

问题 I'm trying to install scrapy for python2.6, but it seems not going well. Here is the packages installed: G:\Python26\Scripts>pip freeze Scrapy==0.16.4 Twisted==12.3.0 libxml2-python==2.7.7 lxml==2.3.6 pyopenssl==0.13 w3lib==1.2 zope.interface==3.8.0 I also got iconv and zlib. And this is the log when installing scrapy with pip. I don't know what I should do next, am I missing sth? Need instructions, thank you. win 7 64-bit, Visual C++ installed C:\Users\d>pip install scrapy Downloading

Issue packaging scrapy spider with cx_Freeze or py2exe

阅读更多关于 Issue packaging scrapy spider with cx_Freeze or py2exe

问题 I've created a scraper with Scrapy and wxPython which works as expected, exporting a file with results to the desktop in CSV format. I'm attempting to package this into an executable with cx_Freeze using the below command prompt line: cxfreeze ItemStatusChecker.py --target-dir dist This seems to work fine, building the dist directory with ItemStatusChecker.exe However, when I open ItemStatusChecker.exe, I get the below error in the command prompt and my GUI does not launch: Traceback (most

Portia/Scrapy - how to replace or add values to output JSON

阅读更多关于 Portia/Scrapy - how to replace or add values to output JSON

问题 just 2 quick doubts: 1- I want my final JSON file to replace the text extract (for example text extracted is ADD TO CART but I want to change to IN STOCK in my final JSON. Is it possible? 2- I also would like to add some custom data to my final JSON file that is not in the website, for example "Store name"... so every product that I scrape will have the store name after it. Is it possible? I am using both Portia and Scrapy so your suggestions are welcome in both platforms. My Scrapy spider