scrapy | 易学教程

scrapy发送翻页请求

阅读更多关于 scrapy发送翻页请求

1.翻页请求的思路对于要提取如下图中所有页面上的数据该怎么办？回顾requests模块是如何实现翻页请求的：找到下一页的URL地址调用requests.get(url) scrapy实现翻页的思路：找到下一页的url地址构造url地址的请求，传递给引擎 2.scrapy实现翻页请求 2.1 实现方法确定url地址构造请求，scrapy.Request(url,callback) callback：指定解析函数名称，表示该请求返回的响应使用哪一个函数进行解析把请求交给引擎：yield scrapy.Request(url,callback) 2.2 腾讯招聘爬虫通过爬取腾讯招聘的页面的招聘信息,学习如何实现翻页请求地址： http://hr.tencent.com/position.php 思路分析：获取首页的数据寻找下一页的地址，进行翻页，获取数据注意： 1.可以在settings中设置ROBOTS协议 # False表示忽略网站的robots.txt协议，默认为True ROBOTSTXT_OBEY = False 2.可以在settings中设置User-Agent： # scrapy发送的每一个请求的默认UA都是设置的这个User-Agent USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X

How to write scraped data into a CSV file in Scrapy?

阅读更多关于 How to write scraped data into a CSV file in Scrapy?

问题 I am trying to scrape a website by extracting the sub-links and their titles, and then save the extracted titles and their associated links into a CSV file. I run the following code, the CSV file is created but it is empty. Any help? My Spider.py file looks like this: from scrapy import cmdline from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class HyperLinksSpider(CrawlSpider): name = "linksSpy" allowed_domains = ["some_website"]

Scrapy - Get all data within selector

阅读更多关于 Scrapy - Get all data within selector

问题 If I have some HTML in the response that looks like: <body> Body text <div> Div text </div> </body> If I do response.xpath('//body/text()').extract() I will only get [Body text] I want to get everything inside <body> including the tags i.e. this whole thing: Body text <div> Div text </div> How can I accomplish that? Thank you. 回答1: Try it: response.xpath('//body/node()/text()') Or if you want the tags too: response.xpath('//body/node()') 回答2: Try //body/(descendant::text() | following::text()

scrapy item loader to get a absolute url from extracted url

阅读更多关于 scrapy item loader to get a absolute url from extracted url

问题 I am using/learning scrapy , python framework to scrape few of my interested web pages. In that go I extract the links in a page. But those links are relative in most of the case. I used urljoin_rfc which is present in scrapy.utils.url to get the absolute path. It worked fine. In a process of learning I came across a feature called Item Loader . Now I want to do the same using Item loader. My urljoin_rfc() is in a user defined function function _urljoin(url,response) . I want my loader to

Scrapy - access data while crawling and randomly change user agent

阅读更多关于 Scrapy - access data while crawling and randomly change user agent

问题 Is it possible to access the data while scrapy is crawling ? I have a script that finds a specific keyword and writes the keyword in .csv as well as the link where it was found. However, I have to wait for scrapy to be done crawling, and when that is done it actually outputs the data in the .csv file I'm also trying to change my user agent randomly, but it's not working. If I'm not allowed for two questions in one, i will post this as a separate question. #!/usr/bin/env python # -*- coding:

Scrapy Use both the CORE in the system

阅读更多关于 Scrapy Use both the CORE in the system

问题 I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time. When I checked the core

JSON URL sometimes returns a null response

阅读更多关于 JSON URL sometimes returns a null response

问题 I'm scraping a website which loads product data from individual JSON files. I found the URLs to the JSONs by inspecting the network traffic. The problem is this: when I follow the JSON URLs, most of the links will provide a JSON result. But the JSON URLs of products that have special characters in them, eg é, return a null response. Of course the data is shown on the browser but I can't seem to get the JSON response directly. Any tips? (I'm trying to find a similar website that acts in the

Using js2xml and Scrapy, how can I iterate through a json object to select a specific node?

阅读更多关于 Using js2xml and Scrapy, how can I iterate through a json object to select a specific node?

问题 I'm trying to iterate through a JSON response from a page using js2xml. The question I have, is how do I call the 'stores' node and pass only that as my response? The JSON looks like this: <script> window.appData = { "ressSize": "large", "cssPath": "http://css.bbystatic.com/", "imgPath": "http://images.bbystatic.com/", "jsPath": "http://js.bbystatic.com/", "bbyDomain": "http://www.bestbuy.com/", "bbySslDomain": "https://www-ssl.bestbuy.com/", "isUserLoggedIn": false, "zipCode": "46801",

Scrapy CrawlSpider Crawls Nothing

阅读更多关于 Scrapy CrawlSpider Crawls Nothing

问题 I am trying to crawl Booking.Com. The spider opens and closes without opening and crawling the url.[Output][1] [1]: https://i.stack.imgur.com/9hDt6.png I am new to python and Scrapy. Here is the code I have written so far. Please point out what I am doing wrong. import scrapy import urllib from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.item import Item from scrapy.loader import ItemLoader from CinemaScraper.items import CinemascraperItem

Scrapy shell doesn't work

阅读更多关于 Scrapy shell doesn't work

问题 I am new to scrapy, then I want to try scrapy shell to debug and learn, but it's strange the shell command doesn't work at all. it seems the website was successfully crawled, but nothing has been printed more. THe program is pending, seems dead, I must use ctrl-c to end it. can you help to figure out what's wrong? I'm using Anaconda + scrapy 1.0.3 $ ping 135.251.157.2 Pinging 135.251.157.2 with 32 bytes of data: Reply from 135.251.157.2: bytes=32 time=13ms TTL=56 Reply from 135.251.157.2: