scrapy | 易学教程

Requesting URLs with base64 data encoded

阅读更多关于 Requesting URLs with base64 data encoded

问题 I'm trying to request a URL with data encoded in base64 on it, like so: http://www.somepage.com/es_e/bla_bla#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ== What I do, is build a JSON object, encode it into base64, and append it to a url like this: new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1}, "config": {"page": 2}} json_data = json.dumps(new_data) new_url = "http:/

Parsing stray text with Scrapy

阅读更多关于 Parsing stray text with Scrapy

问题 Any idea how to extract 'TEXT TO GRAB' from this piece of markup: <a itemprop="url" href="http://www.example.com"> LINK </a> > TEXT TO GRAB 回答1: It's not an ideal solution but it should do the trick: from scrapy import Selector content=""" <a itemprop="url" href="http://www.example.com"> LINK </a>

How to scrap paginated links in Scrapy?

阅读更多关于 How to scrap paginated links in Scrapy?

问题 The code for my scrappy is : import scrapy class DummymartSpider(scrapy.Spider): name = 'dummymart' allowed_domains = ['www.dummymart.com/product'] start_urls = ['https://www.dummymart.net/product/auto-parts--118'] def parse(self, response): Company = response.xpath('//*[@class="word-wrap item-title"]/text()').extract() for item in zip(Company): scraped_info = { 'Company':item[0], } yield scraped_info next_page_url = response.css('li >a::attr(href)').extract_first() #next_page_url = response

How to bypass a 'cookiewall' when using scrapy?

阅读更多关于 How to bypass a 'cookiewall' when using scrapy?

问题 I'm a new user to Scrapy. After following the tutorials for extracting data from websites, I am trying to accomplish something similar on forums. What I want is to extract all posts on a forum page (to start with). However, this particular forum has a 'cookie wall'. So when I want to extract from http://forum.fok.nl/topic/2413069, each session I first need to click the "Yes, I accept cookies"-button. My very basic scraper currently looks like this: class FokSpider(scrapy.Spider): name = 'fok'

following the information using scrapy in nested div and span tags

阅读更多关于 following the information using scrapy in nested div and span tags

问题 I am trying to make web crawler, using scrapy from python, that extracts the information that google shows in the right side when you make a search, for example: I want to extract the information in the box in the rigth side The link is: search in google The source code: source code Part of the HTML code is: <div class="g rhsvw kno-kp mnr-c g-blk" lang="es-419" data-hveid="CAoQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQjh8oAHoECAoQAA"> <div class="kp-blk knowledge-panel Wnoohf OJXvsb"

How to run Scrapy/Portia on Azure Web App

阅读更多关于 How to run Scrapy/Portia on Azure Web App

问题 I am trying to run Scrapy or Portia on a Microsoft Azure Web App. I have installed Scrapy by creating a virtual environment: D:\Python27\Scripts\virtualenv.exe D:\home\Python And then installed Scrapy: D:\home\Python\Scripts\pip install Scrapy The installation seemed to work. But executing a spider returns the following output: D:\home\Python\Scripts\tutorial>d:\home\python\scripts\scrapy.exe crawl example 2015-09-13 23:09:31 [scrapy] INFO: Scrapy 1.0.3 started (bot: tutorial) 2015-09-13 23

How to split output from a list of urls in scrapy

阅读更多关于 How to split output from a list of urls in scrapy

问题 I am trying to generate a csv file for each scraped url from a list of urls in scrapy. I do understand I shall modify pipeline.py, however all my attempts have failed so far. I do not understand how I can pass the url being scraped to the pipeline and use this as name for the output and split the output accordingly. Any help? Thanks Here the spider and the pipeline from scrapy import Spider from scrapy.selector import Selector from vApp.items import fItem class VappSpider(Spider): name =

Empty list returning by xpath in scrapy

阅读更多关于 Empty list returning by xpath in scrapy

问题 I am working on scrapy , i am trying to gather some data from a site , Spider Code class NaaptolSpider(BaseSpider): name = "naaptol" domain_name = "www.naaptol.com" start_urls = ["http://www.naaptol.com/buy/mobile_phones/mobile_handsets.html"] def parse(self, response): hxs = HtmlXPathSelector(response) cell_matter = hxs.select('//div[@class="gridInfo"]/div[@class="gridProduct gridProduct_special"]') items=[] for i in cell_matter: cell_names = i.select('//p[@class="proName"]/a/text()')

Python——爬虫（一）

阅读更多关于 Python——爬虫（一）

爬虫准备工作爬虫简介 urllib 爬虫准备工作参考资料 python网络数据采集 ’ 图灵工业出版精通python爬虫框架Scrapy ’ 人民邮电出版社 python3网络爬虫 Scrapy官方教程前提知识 url http协议 web前端 ’ html, css, js ajax re, xpath xml 爬虫简介爬虫定义：网络爬虫（又被称为网页蜘蛛、网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫两大特征能按作者要求下载数据或者内容能自动在网络上流窜三大步骤下载网页提取正确的信息来源： CSDN 作者：若尘链接： https://blog.csdn.net/qq_29339467/article/details/103681146

Scrapy output issue

阅读更多关于 Scrapy output issue

问题 I am having issues displaying my items as i wanted. My code is as follows: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from texashealth.items import TexashealthItem class texashealthspider(CrawlSpider): name="texashealth" allowed_domains=['jobs.texashealth.org'] start_urls=['http://jobs.texashealth.org/search/?&q=&title=Filter%3A%20title