scrapy

Start scrapy from Flask route

落花浮王杯 提交于 2019-12-19 08:00:55
问题 I want to build a crawler which takes the URL of a webpage to be scraped and returns the result back to a webpage. Right now I start scrapy from the terminal and store the response in a file. How can I start the crawler when some input is posted on to Flask, process, and return a response back? 回答1: You need to create a CrawlerProcess inside your Flask application and run the crawl programmatically. See the docs. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy

Recording the total time taken for running a spider in scrapy

泪湿孤枕 提交于 2019-12-19 07:30:07
问题 I am using scrapy to scrap a site I had written a spider and fetched all the items from the page and saved to a csv file, and now i want to save the total execution time taken by scrapy to run the spider file, actually after spider execution is completed and when we have at at terminal it will display some results like starttime, endtime and so on .... so now in my program i need to calculate the total time taken by scrapy to run the spider and storing the total time some where.... Can anyone

'NoneType' object has no attribute '_app_data' in scrapy\twisted\openssl

限于喜欢 提交于 2019-12-19 05:25:15
问题 During the scraping process using scrapy one error appears in my logs from time to time. It doesnt seem to be anywhere in my code, and looks like it something inside twisted\openssl. Any ideas what caused this and how to get rid of it? Stacktrace here: [Launcher,27487/stderr] Error during info_callback Traceback (most recent call last): File "/opt/webapps/link_crawler/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived self._write(bytes) File "/opt/webapps/link

Dynamic rules based on start_urls for Scrapy CrawlSpider?

梦想与她 提交于 2019-12-19 05:09:04
问题 I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain). I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately. Here's what I came up with so far, it

Why Scrapy returns an Iframe?

那年仲夏 提交于 2019-12-19 05:08:30
问题 i want to crawl this site by Python-Scrapy i try this class Parik(scrapy.Spider): name = "ooshop" allowed_domains = ["http://www.ooshop.com/courses-en-ligne/Home.aspx"] def __init__(self, idcrawl=None, proxy=None, *args, **kwargs): super(Parik, self).__init__(*args, **kwargs) self.start_urls = ['http://www.ooshop.com/courses-en-ligne/Home.aspx'] def parse(self, response): print response.css('body').extract_first() but i don't have the first page, i have an empty iframe 2016-09-06 19:09:24

I'm trying to store more than one spider's results into seperate tables in MySQL

旧时模样 提交于 2019-12-19 04:22:47
问题 Here is my pipelines.py. I have two spiders one called bristol.py and one bath.py. When I run 'scrapy crawl bristol' it automatically adds the results to my MySQL database tabled called 'Bristol'. I want to run 'srapy crawl bath' and to be able to store the results in the MySQL database under the table name 'Bath'. I've tried adding the exact same line of code for the 'Bristol' table but I receive an error. This is the code I've tried putting directly underneath the first self.cursor.execute

Export scrapy items to different files

杀马特。学长 韩版系。学妹 提交于 2019-12-19 04:07:04
问题 I'm scraping review from moocs likes this one From there I'm getting all the course details, 5 items and another 6 items from each review itself. This is the code I have for the course details: def parse_reviews(self, response): l = ItemLoader(item=MoocsItem(), response=response) l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()') l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()') l.add_xpath('course_instructors', '/

How to dynamically set Scrapy rules?

江枫思渺然 提交于 2019-12-19 04:04:06
问题 I have a class running some code before the init: class NoFollowSpider(CrawlSpider): rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items", follow= True), ) def __init__(self, moreparams=None, *args, **kwargs): super(NoFollowSpider, self).__init__(*args, **kwargs) self.moreparams = moreparams I am running this scrapy code with the following command: > scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt Now, I want the static variable named rules to

[Python3网络爬虫开发实战] 1.8.2-Scrapy的安装

情到浓时终转凉″ 提交于 2019-12-19 03:30:23
Scrapy是一个十分强大的爬虫框架,依赖的库比较多,至少需要依赖的库有Twisted 14.0、lxml 3.4和pyOpenSSL 0.14。在不同的平台环境下,它所依赖的库也各不相同,所以在安装之前,最好确保把一些基本库安装好。本节就来介绍Scrapy在不同平台的安装方法。 1. 相关链接 官方网站: https://scrapy.org 官方文档: https://docs.scrapy.org PyPI: https://pypi.python.org/pypi/Scrapy GitHub: https://github.com/scrapy/scrapy 中文文档: http://scrapy-chs.readthedocs.io 2. Anaconda安装 这是一种比较简单的安装Scrapy的方法(尤其是对于Windows来说),如果你的Python是使用Anaconda安装的,或者还没有安装Python的话,可以使用此方法安装,这种方法简单、省力。当然,如果你的Python不是通过Anaconda安装的,可以继续看后面的内容。 关于Anaconda的安装方式,可以查看1.1节,在此不再赘述。 如果已经安装好了Anaconda,那么可以通过 conda 命令安装Scrapy,具体如下: 1 conda install Scrapy 3. Windows下的安装

Scraping text without javascript code using scrapy

↘锁芯ラ 提交于 2019-12-19 03:22:40
问题 I'm currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the target sites. The problem is : sometimes, my target node contains a <script> tag and so the scraped text contains javascript code. Here is a link to a real example of what I'm working with. In this case my target node is //td[@id='contenuStory'] . The problem is that there's a <script> tag in the first child div. I've spent a lot of time