scrapy

Scraping from javascript using Scrapy

亡梦爱人 提交于 2019-12-23 04:36:35
问题 I need to scrape the content with javascript tag using scrapy as follows: <script type='text/javascript' id='script-id'> attribute={"pid":"123","title":"abc","url":"http://example.com","date":"2014-07-31 14:56:39 CDT","channels":["test"],"tags":[],"authors":["james Catcher"]};</script> I can extract the content using xpath response.xpath('id("script-id")//text()').extract() Output [u'\nattribute = {"pid":"123","title":"abc","url":"http:/example.com","date":"2014-07-30 15:34:10 ","channels":[

Struggling with XPath expression for Scrapy

心已入冬 提交于 2019-12-23 04:12:17
问题 Below, there is the part of some html page (all names of the parameters are in russian). It has the main class and two inner classes. The detailed html-code: <div class="obj-params"> <div class="wrap"> <div class="obj-params-col" style="min-width:50%;"> <p> <b>Param1_name</b>" Param1_value"</p> <p> <strong>Param2_name</strong>" Param2_value</p> <p> <strong>Param3_name</strong>" Param3_value"</p> </div> </div> <div class="wrap"> <div class="obj-params-col"> <p> <b>Param4_name</b>Param4_value<

ImportError: cannot import name unwrap

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-23 03:13:35
问题 I have installed scrapy with pip install scrapy . But in python shell I am getting an ImportError: >>> from scrapy.spider import Spider Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/scrapy/__init__.py", line 56, in <module> from scrapy.spider import Spider File "/usr/local/lib/python2.7/dist-packages/scrapy/spider.py", line 7, in <module> from scrapy.http import Request File "/usr/local/lib/python2.7/dist-packages/scrapy

Scrapy celery and multiple spiders

∥☆過路亽.° 提交于 2019-12-23 02:55:09
问题 I'm using scrapy and i'm trying to use celery to manage multiple spiders on one machine. The problem i have is (bit difficult to explain), that the spiders gets multiplied -> meaning if my first spider starts and i start a second spider the first spider executes twice. See my code here: ProcessJob.py class ProcessJob(): def processJob(self, job): #update job mysql = MysqlConnector.Mysql() db = mysql.getConnection(); cur = db.cursor(); job.status = 1 update = "UPDATE job SET status=1 WHERE id=

Django redirect to results page after scrapy finish

谁说我不能喝 提交于 2019-12-23 02:52:40
问题 I have a Django project with a scrapy application. After the user fill some form fields, I pass the filled data to the spider and crawl some pages. Everything is working like a charm, the database is being populated. Except for one thing. When the user press the submit button, the results page is blank because the spider didn't finish crawling and the data isn't in the database. How can I, inside a Django view, the same that called the spider, know that that crawl has finished? Here goes my

How to scrape xml feed with xmlfeedspider

ぐ巨炮叔叔 提交于 2019-12-23 02:51:18
问题 I am trying to scrape an xml file with the below format file_sample.xml: <rss version="2.0"> <channel> <item> <title>SENIOR BUDGET ANALYST (new)</title> <link>https://hr.example.org/psp/hrapp&SeqId=1</link> <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate> <category>All Open Jobs</category> </item> <item> <title>BUDGET ANALYST (healthcare)</title> <link>https://hr.example.org/psp/hrapp&SeqId=2</link> <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate> <category>All category</category> </item> <

Get scrapy result inside a Django view

时光总嘲笑我的痴心妄想 提交于 2019-12-23 02:49:17
问题 I'm scrapping a page successfully that returns me an unique item. I don't want neither to save the scrapped item in the database nor to a file. I need to get it inside a Django view. My view is as follows: def start_crawl(process_number, court): """ Starts the crawler. Args: process_number (str): Process number to be found. court (str): Court of the process. """ runner = CrawlerRunner(get_project_settings()) results = list() def crawler_results(sender, parse_result, **kwargs): results.append

Scrapy getting data from links within table

喜欢而已 提交于 2019-12-23 02:46:13
问题 I am trying to scrape data from the html table, Texas Death Row I able to pull the existing data from the table using the spider script below: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from texasdeath.items import DeathItem class DeathSpider(BaseSpider): name = "death" allowed_domains = ["tdcj.state.tx.us"] start_urls = [ "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html" ] def parse(self, response): hxs = HtmlXPathSelector(response)

How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

夙愿已清 提交于 2019-12-23 02:43:18
问题 I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start... I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website After doing some research, I know that crawling ajax web is nothing different from those simple ideas: •open browser developer tools, network tab •go to the target site •click submit button and see what XHR request is going to the server •simulate this XHR request in your spider The last one

How to log scrapy spiders running from script

笑着哭i 提交于 2019-12-23 02:41:34
问题 Hi all i have multiple spider running from the script. Script will schedule daily once. I want to log the infos, errors separately. log filename must be a spider_infolog_[date] and spider_errlog_[date] i am trying following code, spider __init__ file from twisted.python import log import logging LOG_FILE = 'logs/spider.log' ERR_FILE = 'logs/spider_error.log' logging.basicConfig(level=logging.INFO, filemode='w+', filename=LOG_FILE) logging.basicConfig(level=logging.ERROR, filemode='w+',