scrapy

爬取bilibili的弹幕制作词云

╄→гoц情女王★ 提交于 2021-01-12 20:36:07
爬取哔哩哔哩的弹幕,http://comment.bilibili.com/6315651.xml 需要知道cid,可以F12,F5刷新,找cid,找到之后拼接url 也可以写代码,解析response获取cid,然后再拼接 使用requests或者urllib都可以 我是用requests,请求该链接获取到xml文件 代码:获取xml def get_data (): res = requests.get( 'http://comment.bilibili.com/6315651.xml' ) res.encoding = 'utf8' with open ( 'gugongdanmu.xml' , 'a' , encoding = 'utf8' ) as f: f.writelines(res.text) 解析xml, def analyze_xml (): f1 = open ( "gugongdanmu.xml" , "r" , encoding = 'utf8' ) f2 = open ( "tanmu2.txt" , "w" , encoding = 'utf8' ) count = 0 # 正则匹配解决 xml 的多余的字符 dr = re.compile( r'<[^>]+>' , re.S) while 1 : line = f1.readline() if

Python3.7 Scrapy安装(Windows)

大憨熊 提交于 2021-01-10 17:01:12
本文分为两个部分,前大半部分说的都是Windows下手动安装Scrapy,文末给初学编程的童鞋或者不想这么手工安装的童鞋推荐了Scrapy中文网,直接使用其推荐的Anaconda安装Scrapy即可啦! 自己动手,红红脸颊系列: Scrapy依赖的库比较多,在安装之前,你需要确保以下库已经安装:wheel、lxml、pyOpenSSL、Twisted、pywin32,如果没有,先装完,再装Scrapy。 安装wheel 用途: pip安装固然方便,但有时候会遇到安装失败的问题。wheel和egg都是打包的格式,支持不需要编译或制作的安装过程。wheel现在被认为是Python标准的二进制打包格式。 安装命令: pip install wheel 注意:如果你是刚刚安装过python并且从没有安装过wheel,你可以直接运行上述命令。但如果你的pip版本不够新,你需要在执行install命令之前更新一下pip,在命令行中输入:python -m pip install --upgrade pip更新pip,再输入安装命令即可。 安装lxml 用途: python的一个解析库,支持HTML和XML的解析,支持XPath解析方式,而且解析效率非常高。 安装命令: pip install lxml 安装zope.interface 用途: python本身不提供interface的实现

Python抓取框架:Scrapy的架构

强颜欢笑 提交于 2021-01-07 07:26:43
最近在学Python,同时也在学如何使用python抓取数据,于是就被我发现了这个非常受欢迎的Python抓取框架Scrapy,下面一起学习下Scrapy的架构,便于更好的使用这个工具。 一、概述 下图显示了Scrapy的大体架构,其中包含了它的主要组件及系统的数据处理流程(绿色箭头所示)。下面就来一个个解释每个组件的作用及数据的处理过程。 二、组件 1、Scrapy Engine(Scrapy引擎) Scrapy引擎是用来控制整个系统的数据处理流程,并进行事务处理的触发。更多的详细内容可以看下面的数据处理流程。 2、Scheduler(调度) 调度程序从Scrapy引擎接受请求并排序列入队列,并在Scrapy引擎发出请求后返还给他们。 3、Downloader(下载器) 下载器的主要职责是抓取网页并将网页内容返还给蜘蛛( Spiders)。 4、Spiders(蜘蛛) 蜘蛛是有Scrapy用户自己定义用来解析网页并抓取制定URL返回的内容的类,每个蜘蛛都能处理一个域名或一组域名。换句话说就是用来定义特定网站的抓取和解析规则。 蜘蛛的整个抓取流程(周期)是这样的: 首先获取第一个URL的初始请求,当请求返回后调取一个回调函数。第一个请求是通过调用start_requests()方法。该方法默认从start_urls中的Url中生成请求,并执行解析来调用回调函数。 在回调函数中

Why does scrapy crawler only work once in flask app?

巧了我就是萌 提交于 2021-01-07 02:36:54
问题 I am currently working on a Flask app. The app takes a url from the user and then crawls that website and returns the links found in that website. This is what my code looks like: from flask import Flask, render_template, request, redirect, url_for, session, make_response from flask_executor import Executor from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.crawler import CrawlerProcess from urllib.parse import urlparse from uuid import

Scrapy does not find text in Xpath or Css

我怕爱的太早我们不能终老 提交于 2021-01-07 02:19:03
问题 I've been at this one for a few days, and no matter how I try, I cannot get scrapy to abstract text that is in one element. to spare you all the code, here are the important pieces. The setup does grab everything else off the page, just not this text. from scrapy.selector import Selector start_url = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html" #BASIC ITEM AND SPIDER YADA, SPARE

Scrapy does not find text in Xpath or Css

我与影子孤独终老i 提交于 2021-01-07 02:16:15
问题 I've been at this one for a few days, and no matter how I try, I cannot get scrapy to abstract text that is in one element. to spare you all the code, here are the important pieces. The setup does grab everything else off the page, just not this text. from scrapy.selector import Selector start_url = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html" #BASIC ITEM AND SPIDER YADA, SPARE

how to crawl a limited number of pages from a site using scrapy?

一曲冷凌霜 提交于 2021-01-07 01:30:51
问题 I need to crawl a number of sites and I only want to crawl a certain number of pages each site. So how to implement this? My thought is use a dict which the key is the domain name and the value is the number of pages that have been stored in mongodb. so when a page is crawled and stored in the database successfully then the number of pages of this domain will increase by one. if the number is greater than the maximum number then the spider should stop crwling from this site. Below is my code

Python | Python学习之mysql交互详解

好久不见. 提交于 2021-01-05 12:02:02
前言 最近在学习scrapy redis,在复习redis的同时打算把mysql和mongodb也复习一下,本篇为mysql篇,实例比较简单,学习sql还是要动手实操记的比较牢。 安装与启动 安装:sudo apt-get install mysql-server 查看服务:ps ajx | grep mysql 停止服务:sudo service mysql stop 开启服务:sudo service mysql start 重启服务:sudo service mysql restart 链接数据库:mysql -uroot -p后输入密码 查看版本:select version(); 常见数据库语句 查看数据库:show database; 创建数据库:create database 库名 [charset = UTF8]; 查看建库语句:show create database 库名; 使用数据库:use 库名; 删除数据库:drop 库名; 常见数据表语句 查看表:show table; 查看表结构:desc 表名; 创建表: CREATE TABLE table_name( column1 datatype contrai, column2 datatype, column3 datatype, ..... columnN datatype, PRIMARY KEY

Can't install Scrapy on PC regarding multiple steps

删除回忆录丶 提交于 2021-01-05 11:10:52
问题 I am trying to install scrapy on PC but keep getting error messages. I tried to install Microsoft Visual Studio and other types of things, but NOTHING works :( src/twisted/internet/iocpreactor/iocpsupport/iocpsupport.c(2229): warning C4047: '=': '__pyx_t_11iocpsupport_HANDLE' differs in levels of indirection from 'HANDLE' src/twisted/internet/iocpreactor/iocpsupport/iocpsupport.c(2377): warning C4022: 'CreateIoCompletionPort': pointer mismatch for actual parameter 1 src/twisted/internet

Scrapy: How to limit number of urls scraped in SitemapSpider

岁酱吖の 提交于 2021-01-05 06:24:06
问题 I'm working on a sitemap spider. This spider gets one sitemap url and scrape all urls in this sitemap. I want to limit the number of urls to 100. I can't use CLOSESPIDER_PAGECOUNT because I use XML export pipeline. It seems that when scrapy gets to the pagecount, it stops everything including XML generating. So the XML file is not closed etc. it's invalid. class MainSpider(SitemapSpider): name = 'main_spider' allowed_domains = ['doman.com'] sitemap_urls = ['http://doman.com/sitemap.xml'] def