scrapy | 易学教程

Scraping from javascript using Scrapy

阅读更多关于 Scraping from javascript using Scrapy

问题 I need to scrape the content with javascript tag using scrapy as follows: <script type='text/javascript' id='script-id'> attribute={"pid":"123","title":"abc","url":"http://example.com","date":"2014-07-31 14:56:39 CDT","channels":["test"],"tags":[],"authors":["james Catcher"]};</script> I can extract the content using xpath response.xpath('id("script-id")//text()').extract() Output [u'\nattribute = {"pid":"123","title":"abc","url":"http:/example.com","date":"2014-07-30 15:34:10 ","channels":[

Struggling with XPath expression for Scrapy

阅读更多关于 Struggling with XPath expression for Scrapy

问题 Below, there is the part of some html page (all names of the parameters are in russian). It has the main class and two inner classes. The detailed html-code: <div class="obj-params"> <div class="wrap"> <div class="obj-params-col" style="min-width:50%;"> <p> <b>Param1_name</b>" Param1_value"</p> <p> <strong>Param2_name</strong>" Param2_value</p> <p> <strong>Param3_name</strong>" Param3_value"</p> </div> </div> <div class="wrap"> <div class="obj-params-col"> <p> <b>Param4_name</b>Param4_value<

ImportError: cannot import name unwrap

阅读更多关于 ImportError: cannot import name unwrap

问题 I have installed scrapy with pip install scrapy . But in python shell I am getting an ImportError: >>> from scrapy.spider import Spider Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/scrapy/__init__.py", line 56, in <module> from scrapy.spider import Spider File "/usr/local/lib/python2.7/dist-packages/scrapy/spider.py", line 7, in <module> from scrapy.http import Request File "/usr/local/lib/python2.7/dist-packages/scrapy

Scrapy celery and multiple spiders

阅读更多关于 Scrapy celery and multiple spiders

问题 I'm using scrapy and i'm trying to use celery to manage multiple spiders on one machine. The problem i have is (bit difficult to explain), that the spiders gets multiplied -> meaning if my first spider starts and i start a second spider the first spider executes twice. See my code here: ProcessJob.py class ProcessJob(): def processJob(self, job): #update job mysql = MysqlConnector.Mysql() db = mysql.getConnection(); cur = db.cursor(); job.status = 1 update = "UPDATE job SET status=1 WHERE id=

Django redirect to results page after scrapy finish

阅读更多关于 Django redirect to results page after scrapy finish

问题 I have a Django project with a scrapy application. After the user fill some form fields, I pass the filled data to the spider and crawl some pages. Everything is working like a charm, the database is being populated. Except for one thing. When the user press the submit button, the results page is blank because the spider didn't finish crawling and the data isn't in the database. How can I, inside a Django view, the same that called the spider, know that that crawl has finished? Here goes my

How to scrape xml feed with xmlfeedspider

阅读更多关于 How to scrape xml feed with xmlfeedspider

问题 I am trying to scrape an xml file with the below format file_sample.xml: <rss version="2.0"> <channel> <item> <title>SENIOR BUDGET ANALYST (new)</title> <link>https://hr.example.org/psp/hrapp&SeqId=1</link> <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate> <category>All Open Jobs</category> </item> <item> <title>BUDGET ANALYST (healthcare)</title> <link>https://hr.example.org/psp/hrapp&SeqId=2</link> <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate> <category>All category</category> </item> <

Get scrapy result inside a Django view

阅读更多关于 Get scrapy result inside a Django view

问题 I'm scrapping a page successfully that returns me an unique item. I don't want neither to save the scrapped item in the database nor to a file. I need to get it inside a Django view. My view is as follows: def start_crawl(process_number, court): """ Starts the crawler. Args: process_number (str): Process number to be found. court (str): Court of the process. """ runner = CrawlerRunner(get_project_settings()) results = list() def crawler_results(sender, parse_result, **kwargs): results.append

Scrapy getting data from links within table

阅读更多关于 Scrapy getting data from links within table

问题 I am trying to scrape data from the html table, Texas Death Row I able to pull the existing data from the table using the spider script below: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from texasdeath.items import DeathItem class DeathSpider(BaseSpider): name = "death" allowed_domains = ["tdcj.state.tx.us"] start_urls = [ "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html" ] def parse(self, response): hxs = HtmlXPathSelector(response)

How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

阅读更多关于 How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

问题 I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start... I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website After doing some research, I know that crawling ajax web is nothing different from those simple ideas: •open browser developer tools, network tab •go to the target site •click submit button and see what XHR request is going to the server •simulate this XHR request in your spider The last one

How to log scrapy spiders running from script

阅读更多关于 How to log scrapy spiders running from script

问题 Hi all i have multiple spider running from the script. Script will schedule daily once. I want to log the infos, errors separately. log filename must be a spider_infolog_[date] and spider_errlog_[date] i am trying following code, spider __init__ file from twisted.python import log import logging LOG_FILE = 'logs/spider.log' ERR_FILE = 'logs/spider_error.log' logging.basicConfig(level=logging.INFO, filemode='w+', filename=LOG_FILE) logging.basicConfig(level=logging.ERROR, filemode='w+',