scrapy | 易学教程

scrapy框架运行多个爬虫

阅读更多关于 scrapy框架运行多个爬虫

run.py配置以下内容，仅仅可以运行单一爬虫，如果想要一次运行多个爬虫，就需要换一种方式。 from scrapy import cmdline """ 单一爬虫运行 """ cmdline.execute('scrapy crawl runoob'.split()) 第一步： from scrapy.commands import ScrapyCommand from scrapy.utils.project import get_project_settings """ 运行多个爬虫 """ class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): spider_list = self.crawler_process.spiders.list() for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start() 第二步setting添加如下代码： #

Returning Items in scrapy's start_requests()

阅读更多关于 Returning Items in scrapy's start_requests()

问题 I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items). These URLs are fed to the spider via my crawler's start_requests() method. Some URLs can be classified without downloading them, so I would like to yield directly an Item for them in start_requests() , which is forbidden by scrapy. How can I circumvent this? I have thought about catching these requests in a custom middleware that would turn them into spurious Response objects,

网络爬虫之scrapy框架设置代理

阅读更多关于网络爬虫之scrapy框架设置代理

前戏 os.environ()简介 os.environ()可以获取到当前进程的环境变量，注意，是当前进程。如果我们在一个程序中设置了环境变量，另一个程序是无法获取设置的那个变量的。环境变量是以一个字典的形式存在的，可以用字典的方法来取值或者设置值。 os.environ() key字段详解 windows： os.environ['HOMEPATH']:当前用户主目录。 os.environ['TEMP']:临时目录路径。 os.environ[PATHEXT']:可执行文件。 os.environ['SYSTEMROOT']:系统主目录。 os.environ['LOGONSERVER']:机器名。 os.environ['PROMPT']:设置提示符。 linux： os.environ['USER']:当前使用用户。 os.environ['LC_COLLATE']:路径扩展的结果排序时的字母顺序。 os.environ['SHELL']:使用shell的类型。 os.environ['LAN']:使用的语言。 os.environ['SSH_AUTH_SOCK']:ssh的执行路径。内置的方式原理 scrapy框架内部已经实现了设置代理的方法，它的原理是从环境变量中取出设置的代理，然后再使用，所以我们只需要在程序执行前将代理以键值对的方式设置到环境变量中即可。

scrapy - can´t scrape multiple tables at once

阅读更多关于 scrapy - can´t scrape multiple tables at once

问题 So I am trying to scrape a website and I want to scrape many tables. The problem is that when I use those two for loops it will scrape every month and year as it should but it will mix the data from different months and years instead of giving the tables by the order defined by the loops. Any idea how to solve this problem? import scrapy from ..items import RenItem from scrapy.utils.response import open_in_browser from scrapy.http import FormRequest class ScrapeTableSpider(scrapy.Spider):

Scrapy run multiple spiders from a script

阅读更多关于 Scrapy run multiple spiders from a script

问题 Hey following question: I'm having a script I want Scrapy spiders to start from. For that I used a solution from another stack overflow post to integrate the settings so I don't have to overwrite them manually. So until now I'm able to start two crawlers from outside the Scrapy project: from scrapy_bots.update_Database.update_Database.spiders.m import M from scrapy_bots.update_Database.update_Database.spiders.p import P from scrapy.crawler import CrawlerProcess from scrapy.utils.project

CsvItemExporter for multiple files in custom item pipeline not exporting all items

阅读更多关于 CsvItemExporter for multiple files in custom item pipeline not exporting all items

问题 I have created an item pipeline as an answer to this question. It is supposed to create a new file for every page according to the page_no value set in the item. This works mostly fine. The problem is with the last csv file generated by the pipeline/item exporter, page-10.csv . The last 10 values are not exported, so the file stays empty. What could be the reason for this behaviour? pipelines.py from scrapy.exporters import CsvItemExporter class PerFilenameExportPipeline: """Distribute items

Scrapy yeild items from multiple requests

阅读更多关于 Scrapy yeild items from multiple requests

问题 I am trying to yield items from different requests as shown here. If I add items = PrintersItem() to each request I get endless loops.. It I take it out other errors occur. Not sure how to combine yield request with yield items for each import scrapy from scrapy.http import Request, FormRequest from ..items import PrintersItem from scrapy.utils.response import open_in_browser class PrinterSpider(scrapy.Spider): name = 'printers' start_urls = ['http://192.168.137.9', 'http://192.168.137.35',

django admin scrapy with UI

阅读更多关于 django admin scrapy with UI

问题 i m new to this and struggling to find a way where i can possibly integrate scrapy into my django admin. Meaning there by , - able to trigger scrapy spider from my django admin I have already created the spider to collect data persisting data into mongodb (so no worries of using wrapper to w.r.t performance) have django admin interface able to read that data and display in admin Now i would like to put a feature where i can see this is my spider and i use the clickbutton to trigger it. OR

Scrapy FormRequest login to imdb involving javascript form field

阅读更多关于 Scrapy FormRequest login to imdb involving javascript form field

问题 I have an imdb lists url that I want to parse. I say it as base_url. I have done lot of search online but couldn't find anybody making it through to login to imdb. Probably due to almost 10 items required in FormRequest formdata or some other complexity. I need to login to imdb before parsing that is not working at all. I understand and strongly think there are multiple errors in this code that will pop up once currently active error is fixed, so please keep patience with how this matter is

How to scrape the dynamic table data

阅读更多关于 How to scrape the dynamic table data

问题 I want to scrape the table data from http://5000best.com/websites/ The content of the table is paginated upto several pages and are dynamic. I want to scrape the table data for each category. I can scrape the table manually for each category but this is not what I want. Please look at it and give me the approach to do it. I am able to make links for each category i.e. http://5000best.com/websites/Movies/, http://5000best.com/websites/Games/ etc. But I am not sure how to make it further to