scrapy

Behavior of the scrapy xpath selector on h1-h6 tags

流过昼夜 提交于 2019-12-19 11:55:12
问题 Why does the following two code snippets give different outputs? The only difference between them is that the h1 tag in the first case is replaced with an h tag in the second case. Is this because the h1 tag has a special "meaning" in html? I tried with h1 through h6 and all of them give [] as output, while with h7 it starts to give [u'xxx'] as output. from scrapy import Selector # scrapy version: 1.2.2 text = '<h1><p>xxx</p></h1>' print Selector(text=text).xpath('//h1/p/text()').extract()

How to extract data from javascript in a json format?

为君一笑 提交于 2019-12-19 11:39:32
问题 I am getting a hardtime extracting the data First I need to extract the title post and the posted date of the post here's the url. URL: https://cheddar.com/media/safety-concerns-over-teslas-autopilot-from-consumer-reports-as-wall-street-turns-bearish Inside view-source there's a script in a json format that contains the data that I needed Something like this, I crop the other text to minimize the space <script> window.__RELAY_STORE__ = {"public_at":"2019-05-22T11:02:43- 04:00","updated_at":

How to write customize Downloader Middleware for selenium and Scrapy?

天涯浪子 提交于 2019-12-19 11:32:49
问题 I am having issue communicating between selenium and scrapy object. I am using selenium to login to some site, once I get that response I want to use scrape's functionaries to parse and process. Please can some one help me writing middleware so that every request should go through selenium web driver and response should be pass to scrapy. Thank you! 回答1: It's pretty straightforward, create a middleware with a webdriver and use process_request to intercept the request, discard it and use the

scrapy convert_image

徘徊边缘 提交于 2019-12-19 10:22:59
问题 I use Scrapy to crawl some images, the images need to cut a part or add water mark. I overwrite the function convert_image in pipelines.py but it didn't work. The code looks like this: class MyImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url) def convert_image(self, image, size=None): if image.format == 'PNG' and image.mode == 'RGBA': background = Image.new('RGBA', image.size, (255, 255, 255)) background

Scrapy Shell

自作多情 提交于 2019-12-19 10:19:40
这篇文章很简单,可以说是 Scrapy 系列中最短最简单的文章。本篇文章主要讲解 Scrapy Shell 的相关知识。 零、 Scrapy Shell Scrapy Shell 是一个交互终端,可以在没有启动 Spider 的情况下调试代码。我们在开发爬虫时会经常利用它来测试我们编写的 XPath 或者 Css 表达式是否可以提取到正确的数据。它的语法也很简单: scrapy shell [ url ] [ 设置信息 ] Scrapy Shell 既可以爬取网络上的网页信息,也可以爬取本地文件,以下几种都是正确的: scrapy shell ./html/1.html scrapy shell .. /html/2.html scrapy shell /html/3.html scrapy shell d:\\html\\4.html 这里需要注意的是如果要访问本地的网址必须加上 ./ ,如果是本地相对路径就必须使用 …/ 。它的常用命令一共有 4 个,如下表。 命令 说明 shelp() 打印所有可以使用的属性和命令 fetch(url[,redurect=True]) 从指定的url获取一个新的response。如果 redurect 为 true 时将不会进行重定向获取数据 fetch(response) 使用指定的response获取一个新的response view

cl.exe' failed: No such file or directory when installing Scrapy

我怕爱的太早我们不能终老 提交于 2019-12-19 09:47:52
问题 I'm trying to install Scrapy framework. After installing all dependent libraries and start installing setup.py file, I get this error message: "cl.exe' failed: No such file or directory" I'm working with python 3.6, Windows 7 64 bits. Here is the full error message: copying src\twisted\trial\_dist\test\test_worker.py -> build\lib.win32-3.6\t wisted\trial\_dist\test copying src\twisted\trial\_dist\test\test_workerreporter.py -> build\lib.win 32-3.6\twisted\trial\_dist\test copying src\twisted

How to use scrapy.log module with custom log handler?

坚强是说给别人听的谎言 提交于 2019-12-19 09:17:21
问题 I have been working on a Scrapy project and so far everything works quite well. However, I'm not satisfied with Scrapy's logging configuration possibilities. At the moment, I have set LOG_FILE = 'my_spider.log' in the settings.py of my project. When I execute scrapy crawl my_spider on the command line, it creates one big log file for the entire crawling process. This is not feasible for my purposes. How can I use Python's custom log handlers in combination with the scrapy.log module?

How to get last OPTION from SELECT list using XPath - Scrapy

徘徊边缘 提交于 2019-12-19 09:14:23
问题 I am using this selector but it is giving error //*[@id="quantity"]/option/[last()-1] How do I select last OPTION? I am using Scrapy Framework. 回答1: You have an extra / before the [ making the XPath expression invalid . Remove it: //*[@id="quantity"]/option[last()-1] Note that you can also solve it using Python/Scrapy: response.xpath('//*[@id="quantity"]/option')[-1].extract() Or, in a CSS selector form: response.css('#quantity option:last-child').extract_first() response.css('#quantity

How to get the pipeline object in Scrapy spider

笑着哭i 提交于 2019-12-19 08:14:28
问题 I have use the mongodb to store the data of the crawl. Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html) I want only one connection object to take the database operation, which is in pipeline. So, I want to know how can I get the pipeline object(not new one) in the spider. Or, any better solution for incremental update... Thanks in advance.

How to get the pipeline object in Scrapy spider

情到浓时终转凉″ 提交于 2019-12-19 08:14:20
问题 I have use the mongodb to store the data of the crawl. Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html) I want only one connection object to take the database operation, which is in pipeline. So, I want to know how can I get the pipeline object(not new one) in the spider. Or, any better solution for incremental update... Thanks in advance.