scrapy-spider | 易学教程

Can't get rid of blank rows in csv output

阅读更多关于 Can't get rid of blank rows in csv output

I've written a very tiny script in python scrapy to parse name, street and phone number displayed across multiple pages from yellowpage website. When I run my script i find it working smoothly. However, the only problem i encounter is the way data are getting scraped in csv output. It is always a line (row) gap between two rows. What I meant is: data are getting printed in every other row. Seeing the picture below you will get to know what I meant. If it were not for scrapy, I could have used [newline='']. But, unfortunately I am totally helpless here. How can i get rid of blank lines coming

Separate output file for every url given in start_urls list of spider in scrapy

阅读更多关于 Separate output file for every url given in start_urls list of spider in scrapy

问题 I want to create separate output file for every url I have set in start_urls of spider or somehow want to split ouput files start url wise. Following is the start_urls of my spider start_urls = ['http://www.dmoz.org/Arts/', 'http://www.dmoz.org/Business/', 'http://www.dmoz.org/Computers/'] I want to create separate output file like Arts.xml Business.xml Computers.xml I don't know exactly how to do this. I am thinking to achieve this by implementing some thing like following in spider_opened

scrapy from script output in json

阅读更多关于 scrapy from script output in json

I am running scrapy in a python script def setup_crawler(domain): dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = ArgosSpider(domain=domain) settings = get_project_settings() crawler = Crawler(settings) crawler.configure() crawler.crawl(spider) crawler.start() reactor.run() it runs successfully and stops but where is the result ? I want the result in json format, how can I do that? result = responseInJSON like we do using command scrapy crawl argos -o result.json -t json alecxe You need to set FEED_FORMAT and FEED_URI settings manually: settings.overrides['FEED_FORMAT']

Running Multiple spiders in scrapy for 1 website in parallel?

阅读更多关于 Running Multiple spiders in scrapy for 1 website in parallel?

I want to crawl a website with 2 parts and my script is not as fast as I need. Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part? I tried to have 2 different classes, and run them scrapy crawl firstSpider scrapy crawl secondSpider but i think that it is not smart. I read the documentation of scrapyd but I don't know if it's good for my case. I think what you are looking for is something like this: import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2

Pass Scrapy Spider a list of URLs to crawl via .txt file

阅读更多关于 Pass Scrapy Spider a list of URLs to crawl via .txt file

问题 I'm a little new to Python and very new to Scrapy. I've set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of URLs to the start_urls variable. For exmaple: class LinkChecker(BaseSpider): name = 'linkchecker' start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line. I've done a little bit of research and keep coming up empty handed. I've seen this type of example (How to pass a user

Passing arguments to process.crawl in Scrapy python

阅读更多关于 Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start() I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and

Multiprocessing of Scrapy Spiders in Parallel Processes

阅读更多关于 Multiprocessing of Scrapy Spiders in Parallel Processes

There as several similar questions that I have already read on Stack Overflow. Unfortunately, I lost links of all of them, because my browsing history got deleted unexpectedly. All of the above questions, couldn't help me. Either, some of them have used CELERY or some of them SCRAPYD, and I want to use the MULTIPROCESSISNG Library. Also, the Scrapy Official Documentation shows how to run multiple spiders on a SINGLE PROCESS, not on MULTIPLE PROCESSES. None of them couldn't help me, and hence I decided to ask this question. After several try's, I came up with this code . My Output-: Enter a

Get Scrapy crawler output/results in script file function

阅读更多关于 Get Scrapy crawler output/results in script file function

问题 I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from scrapy.utils

Can't get rid of blank rows in csv output

阅读更多关于 Can't get rid of blank rows in csv output

问题 I've written a very tiny script in python scrapy to parse name, street and phone number displayed across multiple pages from yellowpage website. When I run my script i find it working smoothly. However, the only problem i encounter is the way data are getting scraped in csv output. It is always a line (row) gap between two rows. What I meant is: data are getting printed in every other row. Seeing the picture below you will get to know what I meant. If it were not for scrapy, I could have used

Export csv file from scrapy (not via command line)

阅读更多关于 Export csv file from scrapy (not via command line)

问题 I successfully tried to export my items into a csv file from the command line like: scrapy crawl spiderName -o filename.csv My question is: What is the easiest solution to do the same in the code? I need this as i extract the filename from another file. End scenario should be, that i call scrapy crawl spiderName and it writes the items into filename.csv 回答1: Why not use an item pipeline? WriteToCsv.py import csv from YOUR_PROJECT_NAME_HERE import settings def write_to_csv(item): writer = csv