scrapy | 易学教程

Scrapy start_urls

阅读更多关于 Scrapy start_urls

问题 The script (below) from this tutorial contains two start_urls . from scrapy.spider import Spider from scrapy.selector import Selector from dirbot.items import Website class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] def parse(self, response): """ The lines below is a spider contract. For more info see: http://doc

Scrapy: Passing item between methods

阅读更多关于 Scrapy: Passing item between methods

问题 Suppose I have a Bookitem, I need to add information to it in both the parse phase and detail phase def parse(self, response) data = json.loads(response) for book in data['result']: item = BookItem(); item['id'] = book['id'] url = book['url'] yield Request(url, callback=self.detail) def detail(self,response): hxs = HtmlXPathSelector(response) item['price'] = ...... #I want to continue the same book item as from the for loop above Using the code as is would led to undefined item in the detail

Python Scrapy - populate start_urls from mysql

阅读更多关于 Python Scrapy - populate start_urls from mysql

问题 I am trying to populate start_url with a SELECT from a MYSQL table using spider.py . When i run "scrapy runspider spider.py" i get no output, just that it finished with no error. I have tested the SELECT query in a python script and start_url get populated with the entrys from the MYSQL table. spider.py from scrapy.spider import BaseSpider from scrapy.selector import Selector import MySQLdb class ProductsSpider(BaseSpider): name = "Products" allowed_domains = ["test.com"] start_urls = [] def

Using Middleware to ignore duplicates in Scrapy

阅读更多关于 Using Middleware to ignore duplicates in Scrapy

问题 I'm beginner in Python, and I'm using Scrapy for a personnel web project. I use Scrapy to extract data from several websites repeatedly, so I need to check on every crawling if a link is already in the database before adding it. I did this in a piplines.py class: class DuplicatesPipline(object): def process_item(self, item, spider): if memc2.get(item['link']) is None: return item else: raise DropItem('Duplication %s', item['link']) But I heard that using Middleware is better for this task. I

How to bypass cloudflare bot/ddos protection in Scrapy?

阅读更多关于 How to bypass cloudflare bot/ddos protection in Scrapy?

问题 I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapy in a while and yesterday was trying to use it - I run into a problem with bot protection. It is using CloudFlare’s DDOS protection which is basically using JavaScript evaluation to filter out the browsers (and therefore scrapers) with JS disabled. Once the function is evaluated, the response with calculated number is generated. In return, service sends back two

Scrapy read list of URLs from file to scrape?

阅读更多关于 Scrapy read list of URLs from file to scrape?

问题 I've just installed scrapy and followed their simple dmoz tutorial which works. I just looked up basic file handling for python and tried to get the crawler to read a list of URL's from a file but got some errors. This is probably wrong but I gave it a shot. Would someone please show me an example of reading a list of URL's into scrapy? Thanks in advance. from scrapy.spider import BaseSpider class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["dmoz.org"] f = open("urls.txt") start

Installing Scrapy-Python and Easy_install on windows 7

阅读更多关于 Installing Scrapy-Python and Easy_install on windows 7

问题 I'm trying to install Scrapy on windows 7. I'm following these instructions: http://doc.scrapy.org/en/0.24/intro/install.html#intro-install I’ve downloaded and installed python-2.7.5.msi for windows following this tutorial https://adesquared.wordpress.com/2013/07/07/setting-up-python-and-easy_install-on-windows-7/, and I set up the environment variables as mentioned, but when I try to run python in my command prompt I get this error: Microsoft Windows [Version 6.1.7600] Copyright (c) 2009

How to run several versions of one single spider at one time with Scrapy?

阅读更多关于 How to run several versions of one single spider at one time with Scrapy?

问题 My problematic is the following: To win time, I would like to run several versions of one single spider. The process (parsing definitions) is the same, the items are the same, and the collection in database is the same. What is changing is the start_url variable. It looks like this: "https://www.website.com/details/{0}-{1}-{2}/{3}/meeting".format(year,month,day,type_of_meeting) Considering the date is the same, for instance 2018-10-24, I would like to launch two versions in the same time:

How to get scraped items into Pyqt5 widget?

阅读更多关于 How to get scraped items into Pyqt5 widget?

问题 I'm trying to make a simple GUI for Scrapy crawler, where user can push the Start button to run scraping and see the scraped results in textBrowser (or other qt widget, please advise). My spider: import scrapy, json class CarSpider(scrapy.Spider): name = 'car' start_urls = ["https://www.target-website.com/"] def parse(self, response): """some code """ yield scrapy.Request(url=url, callback=self.parse_page) def parse_page(self, response): items = json.loads(response.body_as_unicode())['items']

Scrapy实战篇（六）之爬取360图片数据和图片

阅读更多关于 Scrapy实战篇（六）之爬取360图片数据和图片

　　　　本篇文章我们以360图片为例，介绍scrapy框架的使用以及图片数据的下载。　　目标网站：http://images.so.com/z?ch=photography 　　思路：分析目标网站为ajax加载方式，通过构造目标url从而请求数据，将图片数据存储在本地，将图片的属性存储在mongodb中。　　1、首先定义我们需要抓取的字段 class ImageItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() collection = 'images' #代表mongodb的的集合名称 #下面四个字段分别是图片id，链接，标题，缩率图 id = Field() url = Field() title = Field() thumb = Field() 　　　　2、构造我们要爬取的url；由于目标网站时ajax加载的，展示的数据在http://images.so.com/zj?ch=photography&sn=30&listtype=new&temp=1中以json的形式存储，不断的下拉页面之后，每次变化的参数只有sn，并且每次以30的增量增加，第一页sn=30，第二页为60，则sn和页码的关系为sn*30，所以我们可以构造出url