scrapy | 易学教程

Start scrapy from Flask route

阅读更多关于 Start scrapy from Flask route

问题 I want to build a crawler which takes the URL of a webpage to be scraped and returns the result back to a webpage. Right now I start scrapy from the terminal and store the response in a file. How can I start the crawler when some input is posted on to Flask, process, and return a response back? 回答1: You need to create a CrawlerProcess inside your Flask application and run the crawl programmatically. See the docs. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy

Recording the total time taken for running a spider in scrapy

阅读更多关于 Recording the total time taken for running a spider in scrapy

问题 I am using scrapy to scrap a site I had written a spider and fetched all the items from the page and saved to a csv file, and now i want to save the total execution time taken by scrapy to run the spider file, actually after spider execution is completed and when we have at at terminal it will display some results like starttime, endtime and so on .... so now in my program i need to calculate the total time taken by scrapy to run the spider and storing the total time some where.... Can anyone

'NoneType' object has no attribute '_app_data' in scrapy\twisted\openssl

阅读更多关于 'NoneType' object has no attribute '_app_data' in scrapy\twisted\openssl

问题 During the scraping process using scrapy one error appears in my logs from time to time. It doesnt seem to be anywhere in my code, and looks like it something inside twisted\openssl. Any ideas what caused this and how to get rid of it? Stacktrace here: [Launcher,27487/stderr] Error during info_callback Traceback (most recent call last): File "/opt/webapps/link_crawler/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived self._write(bytes) File "/opt/webapps/link

Dynamic rules based on start_urls for Scrapy CrawlSpider?

阅读更多关于 Dynamic rules based on start_urls for Scrapy CrawlSpider?

问题 I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain). I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately. Here's what I came up with so far, it

Why Scrapy returns an Iframe?

阅读更多关于 Why Scrapy returns an Iframe?

问题 i want to crawl this site by Python-Scrapy i try this class Parik(scrapy.Spider): name = "ooshop" allowed_domains = ["http://www.ooshop.com/courses-en-ligne/Home.aspx"] def __init__(self, idcrawl=None, proxy=None, *args, **kwargs): super(Parik, self).__init__(*args, **kwargs) self.start_urls = ['http://www.ooshop.com/courses-en-ligne/Home.aspx'] def parse(self, response): print response.css('body').extract_first() but i don't have the first page, i have an empty iframe 2016-09-06 19:09:24

I'm trying to store more than one spider's results into seperate tables in MySQL

阅读更多关于 I'm trying to store more than one spider's results into seperate tables in MySQL

问题 Here is my pipelines.py. I have two spiders one called bristol.py and one bath.py. When I run 'scrapy crawl bristol' it automatically adds the results to my MySQL database tabled called 'Bristol'. I want to run 'srapy crawl bath' and to be able to store the results in the MySQL database under the table name 'Bath'. I've tried adding the exact same line of code for the 'Bristol' table but I receive an error. This is the code I've tried putting directly underneath the first self.cursor.execute

Export scrapy items to different files

阅读更多关于 Export scrapy items to different files

问题 I'm scraping review from moocs likes this one From there I'm getting all the course details, 5 items and another 6 items from each review itself. This is the code I have for the course details: def parse_reviews(self, response): l = ItemLoader(item=MoocsItem(), response=response) l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()') l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()') l.add_xpath('course_instructors', '/

How to dynamically set Scrapy rules?

阅读更多关于 How to dynamically set Scrapy rules?

问题 I have a class running some code before the init: class NoFollowSpider(CrawlSpider): rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items", follow= True), ) def __init__(self, moreparams=None, *args, **kwargs): super(NoFollowSpider, self).__init__(*args, **kwargs) self.moreparams = moreparams I am running this scrapy code with the following command: > scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt Now, I want the static variable named rules to

[Python3网络爬虫开发实战] 1.8.2-Scrapy的安装

阅读更多关于 [Python3网络爬虫开发实战] 1.8.2-Scrapy的安装

Scrapy是一个十分强大的爬虫框架，依赖的库比较多，至少需要依赖的库有Twisted 14.0、lxml 3.4和pyOpenSSL 0.14。在不同的平台环境下，它所依赖的库也各不相同，所以在安装之前，最好确保把一些基本库安装好。本节就来介绍Scrapy在不同平台的安装方法。 1. 相关链接官方网站： https://scrapy.org 官方文档： https://docs.scrapy.org PyPI： https://pypi.python.org/pypi/Scrapy GitHub： https://github.com/scrapy/scrapy 中文文档： http://scrapy-chs.readthedocs.io 2. Anaconda安装这是一种比较简单的安装Scrapy的方法（尤其是对于Windows来说），如果你的Python是使用Anaconda安装的，或者还没有安装Python的话，可以使用此方法安装，这种方法简单、省力。当然，如果你的Python不是通过Anaconda安装的，可以继续看后面的内容。关于Anaconda的安装方式，可以查看1.1节，在此不再赘述。如果已经安装好了Anaconda，那么可以通过 conda 命令安装Scrapy，具体如下： 1 conda install Scrapy 3. Windows下的安装

Scraping text without javascript code using scrapy

阅读更多关于 Scraping text without javascript code using scrapy

问题 I'm currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the target sites. The problem is : sometimes, my target node contains a <script> tag and so the scraped text contains javascript code. Here is a link to a real example of what I'm working with. In this case my target node is //td[@id='contenuStory'] . The problem is that there's a <script> tag in the first child div. I've spent a lot of time