scrapy | 易学教程

Using scrapy to find specific text from multiple websites

阅读更多关于 Using scrapy to find specific text from multiple websites

问题 I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ? I have been reading scrapy's documentation, but I can't seem to find this. Thank you. class FinalSpider(scrapy.Spider): name = "final" allowed_domains = [

TypeError when putting scraped data from scrapy into elasticsearch

阅读更多关于 TypeError when putting scraped data from scrapy into elasticsearch

问题 I've been following this tutorial (http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html) and using this scrapy elasticsearch pipeline (https://github.com/knockrentals/scrapy-elasticsearch) and am able to extract data from scrapy to a JSON file and have an elasticsearch server up and running on localhost. However, when I attempt to send scraped data into elasticsearch using the pipeline, I get the following error: 2015-08-05 21:21:53 [scrapy] ERROR: Error processing {'link': [u

Scrapy store returned items in variables to use in main script

阅读更多关于 Scrapy store returned items in variables to use in main script

问题 I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/' ] custom_settings = { 'LOG_ENABLED': 'False', } def parse(self, response): global title # This would work, but there should

how to retrieve data … The page is loaded using ajax

阅读更多关于 how to retrieve data … The page is loaded using ajax

问题 I want to get costs of mobile phones from this site http://www.univercell.in/buy/SMART i tried to test it so i used: scarpy shell http://www.univercell.in/control/AjaxCategoryDetail?productCategoryId=PRO-SMART&category_id=PRO-SMART&attrName=&min=&max=&sortSearchPrice=&VIEW_INDEX=2&VIEW_SIZE=15&serachupload=&sortupload= But I am not able to connect to this site. As the page is loaded using ajax I found out the start_url using firebug. Can any one suggest me where I am going wrong 回答1: How

Scrapy spider does not store state (persistent state)

阅读更多关于 Scrapy spider does not store state (persistent state)

问题 Hi have a basic spider that runs to fetch all links on a given domain. I want to make sure it persists its state so that it can resume from where it left. I have followed the given url http://doc.scrapy.org/en/latest/topics/jobs.html .But when i try it the first time it runs fine and i end it with Ctrl+C and when I try to resume it the crawl stops on the first url itself. Below is the log when it ends: 2016-08-29 16:51:08 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 896,

Unable to import items in scrapy

阅读更多关于 Unable to import items in scrapy

问题 I have a very basic spider, following the instructions in the getting started guide, but for some reason, trying to import my items into my spider returns an error. Spider and items code is shown below: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from myProject.items import item class MyProject(BaseSpider): name = "spider" allowed_domains = ["website.com"] start_urls = [ "website.com/start" ] def parse(self, response): print response.body from scrapy

CrawlSpider with Splash getting stuck after first URL

阅读更多关于 CrawlSpider with Splash getting stuck after first URL

问题 I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong? class VideoSpider(CrawlSpider): start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2'] rules = ( Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",), ) def use

Scrapy-Recursively Scrape Webpages and save content as html file

阅读更多关于 Scrapy-Recursively Scrape Webpages and save content as html file

问题 I am using scrapy to extract the information in tag of web pages and then save those webpages as HTML files.Eg http://www.austlii.edu.au/au/cases/cth/HCA/1945/ this site has some webpages related to judicial cases.I want to go to each link and save only the content related to the particular judicial case as an HTML page.eg go to this http://www.austlii.edu.au/au/cases/cth/HCA/1945/1.html and then save information related to case. Is there a way to do this recursively in scrapy and save

自定义scrapy启动器

阅读更多关于自定义scrapy启动器

一个偶然写的模板 # 此文件用来启动指定爬虫 import configparser as cps import os , time , sLogin , sys , base64 from scrapy import cmdline # 配置文件目录 ini_path = "E:\Code\Zhihu3.0\huxijun.ini" class Scp ( ) : def makedir ( self , conf , DirName ) : """判断目录是否存在，不存在则创建, 最后返回路径""" thePath = os . path . join ( self . Root_path , conf [ "path" ] [ DirName ] ) if not os . path . exists ( thePath ) : os . makedirs ( thePath ) return thePath def __init__ ( self ) : """检查能否登录，并返回登录者的id，name，以及cookies""" # 加载配置文件 conf = cps . ConfigParser ( ) conf . read ( ini_path , encoding = 'utf-8' ) if str . isdigit ( conf . get ( "path"

Scrapy i/o block when downloading files

阅读更多关于 Scrapy i/o block when downloading files

问题 I using Scrapy to scrapy a webside and download some files. Since the file_url I get will redirect to another url (302 redirect).So I use another method handle_redirect to get the redirected url. I custom the filespipeline like this. class MyFilesPipeline(FilesPipeline): def handle_redirect(self, file_url): response = requests.head(file_url) if response.status_code == 302: file_url = response.headers["Location"] return file_url def get_media_requests(self, item, info): redirect_url = self