scrapy | 易学教程

Scrapy LinkExtractor - Limit the number of pages crawled per URL

阅读更多关于 Scrapy LinkExtractor - Limit the number of pages crawled per URL

问题 I am trying to limit the number of crawled pages per URL in a CrawlSpider in Scrapy. I have a list of start_urls and I want to set a limit on the numbers pages are being crawled in each URL. Once the limit is reached, the spider should move to the next start_url. I know there is the DEPTH_LIMIT parameter on setting but this is not what I am looking for. Any help will be useful. Here is the code I currently have: class MySpider(CrawlSpider): name = 'test' allowed_domains = domainvarwebsite

Crawling dynamic content with scrapy

阅读更多关于 Crawling dynamic content with scrapy

问题 I am trying to get latest review from Google play store. I'm following this question for getting the latest reviews here Method specified in the above link's answer works fine with scrapy shell but when I try this in my crawler it gets completely ignored. Code snippet: import re import sys import time import urllib import urlparse from scrapy import Spider from scrapy.spider import BaseSpider from scrapy.http import Request, FormRequest from scrapy.contrib.spiders import CrawlSpider, Rule

converting scrapy to lxml

阅读更多关于 converting scrapy to lxml

问题 I have scrapy code that looks like this for row in response.css("div#flexBox_flex_calendar_mainCal table tr.calendar_row"): print "================" print row.xpath(".//td[@class='time']/text()").extract() print row.xpath(".//td[@class='currency']/text()").extract() print row.xpath(".//td[@class='impact']/span/@title").extract() print row.xpath(".//td[@class='event']/span/text()").extract() print row.xpath(".//td[@class='actual']/text()").extract() print row.xpath(".//td[@class='forecast']

scrapy “Missing scheme in request url”

阅读更多关于 scrapy “Missing scheme in request url”

问题 Here's my code below- import scrapy from scrapy.http import Request class lyricsFetch(scrapy.Spider): name = "lyricsFetch" allowed_domains = ["metrolyrics.com"] print "\nEnter the name of the ARTIST of the song for which you want the lyrics for. Minimise the spelling mistakes, if possible." artist_name = raw_input('>') print "\nNow comes the main part. Enter the NAME of the song itself now. Again, try not to have any spelling mistakes." song_name = raw_input('>') artist_name = artist_name

scrapy - item loader - default processors

阅读更多关于 scrapy - item loader - default processors

问题 I'm new to python and scrapy, so I apologise for maybe silly questions in advance. I have some troubles with default item loader's processors, and related questions: I use default_input_processor variable to extract first value from list using TakeFirst() processor like that: class CaseLoader(scrapy.loader.ItemLoader): default_input_processor = TakeFirst() and usage: def load_row_data(self, row): cl = CaseLoader(CaseItem(), row) cl.add_xpath('case_num', './/td[1]/a/text()') cl.add_xpath('case

Scrapy on arabic letters returns something strange

阅读更多关于 Scrapy on arabic letters returns something strange

问题 I am using scrapy on arabic letters and english letters. The english letters work perfectly. However, the arabic letters shows like this: gs300 2006 \u0644\u0643\u0632\u0633 \u062c\u064a Any help, please? I am using python with scrapy 0.20.2. The way i extract data is: site.xpath('my selector').extract() and I call the json operation from cmd like this scrapy crawl dmoz -o items.json -t json 回答1: The strings \u0000 are Unicode code points. Each represents a single character (e.g. \u064a

Scrapy 1.8.0之图片下载器

阅读更多关于 Scrapy 1.8.0之图片下载器

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 创建模板爬虫 (全局: )scrapy startproject [BMW] (项目：) scrapy genspider -t crwal bmw_spider ["car.autohome.com.cn"] : 创建爬虫 items.py class BmwItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() category = scrapy.Field() # 必须添加以下两个变量，其中　image_urls为可迭代的图片url集，images保存的是图片信息 image_urls = scrapy.Field() images = scrapy.Field() spider.py class BmwSpiderSpider(CrawlSpider): name = 'bmw_spider' allowed_domains = ['car.autohome.com.cn'] start_urls = ['https://car.autohome.com.cn/pic/series/2139.html'] rules = ( Rule(LinkExtractor( allow=r

What is the correct form of work with cookies in scrapy

阅读更多关于 What is the correct form of work with cookies in scrapy

问题 I'm very newbie,I am working with scrapy in a web that use cookies, This is a problem for me , because I can obtain data the a web without cookies but obtain the data of a web with cookies is dificult for me. I have this code structure class mySpider(BaseSpider): name='data' allowed_domains =[] start_urls =["http://...."] def parse(self, response): sel = HtmlXPathSelector(response) items = sel.xpath('//*[@id=..............') vlrs =[] for item in items: myItem['img'] = item.xpath('....')

Proper way to run multiple scrapy spiders

阅读更多关于 Proper way to run multiple scrapy spiders

问题 I just tried running multiple spiders in the same process using the new scrapy documentation but I am getting: AttributeError: 'CrawlerProcess' object has no attribute 'crawl' I found this SO post with the same problem so I tried using the code from the 0.24 documentation and got: runspider: error: Unable to load 'price_comparator.py': No module named testspiders.spiders.followall For 1.0 I imported: from scrapy.crawler import CrawlerProcess and for 0.24 I imported: from twisted.internet

python之scrapy爬取某集团招聘信息以及招聘详情

阅读更多关于 python之scrapy爬取某集团招聘信息以及招聘详情

1、定义爬取的字段items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class GosuncnItem(scrapy.Item): """ 定义爬虫的字段 """ # define the fields for your item here like: # name = scrapy.Field() platform = scrapy.Field() position = scrapy.Field() num = scrapy.Field() time = scrapy.Field() url = scrapy.Field() content = scrapy.Field() responsible = scrapy.Field() page = scrapy.Field() pass View Code 2、配置设置settings.py # -*- coding: utf-8 -*- # Scrapy settings for gosuncn project # # For