scrapy-spider

Can't get rid of blank rows in csv output

旧城冷巷雨未停 提交于 2019-11-29 05:23:20
I've written a very tiny script in python scrapy to parse name, street and phone number displayed across multiple pages from yellowpage website. When I run my script i find it working smoothly. However, the only problem i encounter is the way data are getting scraped in csv output. It is always a line (row) gap between two rows. What I meant is: data are getting printed in every other row. Seeing the picture below you will get to know what I meant. If it were not for scrapy, I could have used [newline='']. But, unfortunately I am totally helpless here. How can i get rid of blank lines coming

Separate output file for every url given in start_urls list of spider in scrapy

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-29 01:41:55
问题 I want to create separate output file for every url I have set in start_urls of spider or somehow want to split ouput files start url wise. Following is the start_urls of my spider start_urls = ['http://www.dmoz.org/Arts/', 'http://www.dmoz.org/Business/', 'http://www.dmoz.org/Computers/'] I want to create separate output file like Arts.xml Business.xml Computers.xml I don't know exactly how to do this. I am thinking to achieve this by implementing some thing like following in spider_opened

scrapy from script output in json

别来无恙 提交于 2019-11-28 23:10:17
I am running scrapy in a python script def setup_crawler(domain): dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = ArgosSpider(domain=domain) settings = get_project_settings() crawler = Crawler(settings) crawler.configure() crawler.crawl(spider) crawler.start() reactor.run() it runs successfully and stops but where is the result ? I want the result in json format, how can I do that? result = responseInJSON like we do using command scrapy crawl argos -o result.json -t json alecxe You need to set FEED_FORMAT and FEED_URI settings manually: settings.overrides['FEED_FORMAT']

Running Multiple spiders in scrapy for 1 website in parallel?

此生再无相见时 提交于 2019-11-28 22:05:12
I want to crawl a website with 2 parts and my script is not as fast as I need. Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part? I tried to have 2 different classes, and run them scrapy crawl firstSpider scrapy crawl secondSpider but i think that it is not smart. I read the documentation of scrapyd but I don't know if it's good for my case. I think what you are looking for is something like this: import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2

Pass Scrapy Spider a list of URLs to crawl via .txt file

﹥>﹥吖頭↗ 提交于 2019-11-28 20:40:44
问题 I'm a little new to Python and very new to Scrapy. I've set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of URLs to the start_urls variable. For exmaple: class LinkChecker(BaseSpider): name = 'linkchecker' start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line. I've done a little bit of research and keep coming up empty handed. I've seen this type of example (How to pass a user

Passing arguments to process.crawl in Scrapy python

不问归期 提交于 2019-11-28 20:25:27
I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start() I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and

Multiprocessing of Scrapy Spiders in Parallel Processes

拥有回忆 提交于 2019-11-28 08:44:40
There as several similar questions that I have already read on Stack Overflow. Unfortunately, I lost links of all of them, because my browsing history got deleted unexpectedly. All of the above questions, couldn't help me. Either, some of them have used CELERY or some of them SCRAPYD, and I want to use the MULTIPROCESSISNG Library. Also, the Scrapy Official Documentation shows how to run multiple spiders on a SINGLE PROCESS, not on MULTIPLE PROCESSES. None of them couldn't help me, and hence I decided to ask this question. After several try's, I came up with this code . My Output-: Enter a

Get Scrapy crawler output/results in script file function

时光毁灭记忆、已成空白 提交于 2019-11-28 07:57:40
问题 I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from scrapy.utils

Can't get rid of blank rows in csv output

旧城冷巷雨未停 提交于 2019-11-27 23:10:22
问题 I've written a very tiny script in python scrapy to parse name, street and phone number displayed across multiple pages from yellowpage website. When I run my script i find it working smoothly. However, the only problem i encounter is the way data are getting scraped in csv output. It is always a line (row) gap between two rows. What I meant is: data are getting printed in every other row. Seeing the picture below you will get to know what I meant. If it were not for scrapy, I could have used

Export csv file from scrapy (not via command line)

↘锁芯ラ 提交于 2019-11-27 14:37:46
问题 I successfully tried to export my items into a csv file from the command line like: scrapy crawl spiderName -o filename.csv My question is: What is the easiest solution to do the same in the code? I need this as i extract the filename from another file. End scenario should be, that i call scrapy crawl spiderName and it writes the items into filename.csv 回答1: Why not use an item pipeline? WriteToCsv.py import csv from YOUR_PROJECT_NAME_HERE import settings def write_to_csv(item): writer = csv