scrapy

How can I reuse the parse method of my scrapy Spider-based spider in an inheriting CrawlSpider?

久未见 提交于 2020-01-03 17:22:46
问题 I currently have a Spider-based spider that I wrote for crawling an input JSON array of start_urls : from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from foo.items import AtlanticFirearmsItem from scrapy.contrib.loader import ItemLoader import json import datetime import re class AtlanticFirearmsSpider(Spider): name = "atlantic_firearms" allowed_domains = ["atlanticfirearms.com"] def __init_

How do I use BeautifulSoup4 to get ALL text before <br> tag

不打扰是莪最后的温柔 提交于 2020-01-03 17:12:31
问题 I'm trying to scrape some data for my app. My question is I need some Here is the HTML code: <tr> <td> This <a class="tip info" href="blablablablabla">is a first</a> sentence. <br> This <a class="tip info" href="blablablablabla">is a second</a> sentence. <br>This <a class="tip info" href="blablablablabla">is a third</a> sentence. <br> </td> </tr> I want output to looks like This is a first sentence. This is a second sentence. This is a third sentence. Is it possible to do that? 回答1: Try this.

Normalize whitespace with Python

不问归期 提交于 2020-01-03 17:07:41
问题 I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string: Sapphire RX460 OC 2/4GB Notice two groups of two whitespaces preceeding the string literal and between OC and 2 . Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC and 2 , which I need collapsed into a single space. I've tried using normalize-space() from XPath while extracting data with

Scrapy ERROR: Error downloading - Could not open CONNECT tunnel

ε祈祈猫儿з 提交于 2020-01-03 16:54:56
问题 I have written a spider to crawl https://tecnoblog.net/categoria/review/ but when I let the spider crawl, there is one error: 2015-05-19 15:13:20+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews) 2015-05-19 15:13:20+0100 [scrapy] INFO: Optional features available: ssl, http11 2015-05-19 15:13:20+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'reviews'} 2015-05-19 15:13:20+0100

Scrapy rules not working when process_request and callback parameter are set

人走茶凉 提交于 2020-01-03 15:54:52
问题 I have this rule for scrapy CrawlSpider rules = [ Rule(LinkExtractor( allow= '/topic/\d+/organize$', restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]' ), process_request='request_tagPage', callback = "parse_tagPage", follow = True) ] request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned,

Scrapy rules not working when process_request and callback parameter are set

帅比萌擦擦* 提交于 2020-01-03 15:54:27
问题 I have this rule for scrapy CrawlSpider rules = [ Rule(LinkExtractor( allow= '/topic/\d+/organize$', restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]' ), process_request='request_tagPage', callback = "parse_tagPage", follow = True) ] request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned,

Retrying a Scrapy Request even when receiving a 200 status code

浪子不回头ぞ 提交于 2020-01-03 09:24:54
问题 There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector). Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do. def check_response(response): if response.body != '': return response else: return Request(copy_of_response.request, callback=check_response) Basically, is

install scrapy on win 7 (64-bit)

早过忘川 提交于 2020-01-03 09:10:14
问题 I'm trying to install scrapy for python2.6, but it seems not going well. Here is the packages installed: G:\Python26\Scripts>pip freeze Scrapy==0.16.4 Twisted==12.3.0 libxml2-python==2.7.7 lxml==2.3.6 pyopenssl==0.13 w3lib==1.2 zope.interface==3.8.0 I also got iconv and zlib. And this is the log when installing scrapy with pip. I don't know what I should do next, am I missing sth? Need instructions, thank you. win 7 64-bit, Visual C++ installed C:\Users\d>pip install scrapy Downloading

Issue packaging scrapy spider with cx_Freeze or py2exe

拈花ヽ惹草 提交于 2020-01-03 05:34:08
问题 I've created a scraper with Scrapy and wxPython which works as expected, exporting a file with results to the desktop in CSV format. I'm attempting to package this into an executable with cx_Freeze using the below command prompt line: cxfreeze ItemStatusChecker.py --target-dir dist This seems to work fine, building the dist directory with ItemStatusChecker.exe However, when I open ItemStatusChecker.exe, I get the below error in the command prompt and my GUI does not launch: Traceback (most

Portia/Scrapy - how to replace or add values to output JSON

时光总嘲笑我的痴心妄想 提交于 2020-01-03 05:28:04
问题 just 2 quick doubts: 1- I want my final JSON file to replace the text extract (for example text extracted is ADD TO CART but I want to change to IN STOCK in my final JSON. Is it possible? 2- I also would like to add some custom data to my final JSON file that is not in the website, for example "Store name"... so every product that I scrape will have the store name after it. Is it possible? I am using both Portia and Scrapy so your suggestions are welcome in both platforms. My Scrapy spider