scrapy

How can I get all the plain text from a website with Scrapy?

旧城冷巷雨未停 提交于 2019-12-28 01:56:08
问题 I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? 回答1: The easiest option would be to extract //body//text() and join everything found: ''.join(sel.select("//body//text()").extract()).strip() where sel is a Selector instance. Another option is to use nltk's clean_html() : >>> import nltk >>>

Running Scrapy spiders in a Celery task

浪子不回头ぞ 提交于 2019-12-27 18:21:32
问题 I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users. Something like this: class StandAloneSpider(Spider): #a regular spider settings.overrides['LOG_ENABLED'] = True #more settings can be changed... crawler = CrawlerProcess( settings ) crawler.install() crawler.configure() spider = StandAloneSpider() crawler.crawl( spider ) crawler.start() I've

by combining base url getting text out of image in python using scrapy?

纵然是瞬间 提交于 2019-12-27 02:35:09
问题 i tried this code : src1 = "https://hms.harvard.edu/"<br/> src = response.css('div.person-line > div > img::attr("src")').extract_first()<br/> src = sites/default/files/hms-faculty-emails/BX0UVXkP.jpg <br/> import urlparse <br/> urlparse.urljoin(src1, src)<br/> https://hms.harvard.edu/sites/default/files/hms-faculty-emails/BX0UVXkP.jpg<br/> src2 = urlparse.urljoin(src1,src)<br/> email = pytesseract.image_to_string(Image.open(src2))<br/> i'm getting this error ioerror errno 22 invalid mode (

Scrapy 爬虫模板--CrawlSpider

喜你入骨 提交于 2019-12-27 02:31:54
从这篇文章开始,我将利用三篇文章分别讲解 Scrapy 爬虫模板。 Scrapy 爬虫模板包含四个模板: Basic :最基本的模板,这里我们不会讲解; CrawlSpider XMLFeedSpider CSVFEEDSpider 这篇文章我先来讲解一下 CrawlSpider 模板。 零、讲解 CrawlSpider 是常用的 Spider ,通过定制规则来跟进链接。对于大部分网站我们可以通过修改规则来完成爬取任务。 CrawlSpider 常用属性是 rules * ,它是一个或多个 Rule 对象以 tuple 的形式展现。其中每个 Rule 对象定义了爬取目标网站的行为。 Tip:如果有多个 Rule 对象命中同一个链接,那么只有第一个 Rule 对象生效。 我们来看一下 Role 的语法: Rule ( link_extractor [ , callback = None ] [ , cb_kwargs = None ] [ , follow = None ] [ , process_links = None ] [ , process_request = None ] ) 参数解析: link_extractor:Link Extrator 对象,是一个正则表达式。主要定义了从网页中提取哪些元素作为继续跟进的链接; callback:回调函数

scrapycrawl 爬取笔趣阁小说

六月ゝ 毕业季﹏ 提交于 2019-12-26 19:56:37
前言 第一次发到博客上..不太会排版见谅 最近在看一些爬虫教学的视频,有感而发,大学的时候看盗版小说网站觉得很能赚钱,心想自己也要搞个,正好想爬点小说能不能试试做个网站(网站搭建啥的都不会...) 站点拥有的全部小说不全,只能使用crawl爬全站 不过写完之后发现用scrapy爬的也没requests多线程爬的快多少,保存也不好一本保存,由于scrapy是异步爬取,不好保存本地为txt文件,只好存mongodb 捂脸 下面是主代码 # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from biquge5200.items import Biquge5200Item class BqgSpider(CrawlSpider): name = 'bqg' allowed_domains = ['bqg5200.com'] start_urls = ['https://www.bqg5200.com/'] rules = ( Rule(LinkExtractor(allow=r'https://www.bqg5200.com/book/\d+/'), follow=True),

Scrapy 安装与使用

走远了吗. 提交于 2019-12-26 19:09:04
Scrapy的安装: 当前环境win10,python_3.6.4,64bit。在命令提示符窗口运行pip install Scrapy,出现以下结果: building 'twisted.test.raiser' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools 按报错提示安装之后错误依旧存在; http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted > 下载 Twisted 对应版本的whl文件,cp后面是python版本,amd64代表64位,重新运行pip install命令: pip install C:\Users\E5-573G\Desktop\2018寒假\Python\爬虫\scrapy\Twisted-17.5.0-cp36-cp36m-win_amd64.whl1 其中install后面为下载的whl文件的完整路径名 。 然后再次运行 pip install scrapy 命令即可安装成功。 Scrapy命令行格式: Scrapy常用命令:

How to remove \r\n when scraping a page?

我与影子孤独终老i 提交于 2019-12-26 13:34:10
问题 I have made a function by scraping a page and I run but the output gives \r\n . I used strip function to remove \r\n but its not working. Why and how to remove \r\n ? Here is the link: https://ibb.co/VtVV2fb import scrapy from .. items import FetchingItem class SiteFetching(scrapy.Spider): name = 'Site' start_urls = ['https://www.rev.com/freelancers'] transcription_page = 'https://www.rev.com/freelancers/transcription' def parse(self, response): items = { 'Heading': response.css('#sign-up:

How do I extract the email address using scrapy?

心已入冬 提交于 2019-12-25 18:49:06
问题 I'm trying to extract the email address of each restaurant on TripAdvisor. I've tried this but keeps returning an [ ]: response.xpath('//*[@class= "restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--89flT6"]') Code snippet off the TripAdvisor page is below: <div class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard_

Python爬虫框架Scrapy获得定向打击批量招聘信息

无人久伴 提交于 2019-12-25 18:22:15
爬虫,就是一个在网上到处或定向抓取数据的程序,当然,这样的说法不够专业,更专业的描写叙述就是。抓取特定站点网页的HTML数据。只是因为一个站点的网页非常多,而我们又不可能事先知道全部网页的URL地址,所以,怎样保证我们抓取到了站点的全部HTML页面就是一个有待考究的问题了。 一般的方法是,定义一个入口页面。然后一般一个页面会有其它页面的URL,于是从当前页面获取到这些URL增加到爬虫的抓取队列中。然后进入到新页面后再递归的进行上述的操作。事实上说来就跟深度遍历或广度遍历一样。 Scrapy是一个基于Twisted,纯Python实现的爬虫框架,用户仅仅须要定制开发几个模块就能够轻松的实现一个爬虫,用来抓取网页内容以及各种图片,很之方便~ Scrapy 使用 Twisted这个异步网络库来处理网络通讯,架构清晰,而且包括了各种中间件接口,能够灵活的完毕各种需求。总体架构例如以下图所看到的: 绿线是数据流向,首先从初始URL 開始。Scheduler 会将其交给 Downloader 进行下载。下载之后会交给 Spider 进行分析。Spider分析出来的结果有两种:一种是须要进一步抓取的链接,比如之前分析的“下一页”的链接,这些东西会被传回 Scheduler ;还有一种是须要保存的数据,它们则被送到Item Pipeline 那里。那是对数据进行后期处理(具体分析、过滤、存储等

Scrapy: Extracting data from source and its links

我们两清 提交于 2019-12-25 17:19:41
问题 Edited question to link to original: Scrapy getting data from links within table From the link https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html I am trying to get info from the main table as well as the data within the other 2 links within the table. I managed to pull from one, but question is going to the other link and appending the data in one line. from urlparse import urljoin import scrapy from texasdeath.items import DeathItem class DeathItem(Item): firstName =