scrapy

Requesting URLs with base64 data encoded

落花浮王杯 提交于 2019-12-24 19:18:53
问题 I'm trying to request a URL with data encoded in base64 on it, like so: http://www.somepage.com/es_e/bla_bla#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ== What I do, is build a JSON object, encode it into base64, and append it to a url like this: new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1}, "config": {"page": 2}} json_data = json.dumps(new_data) new_url = "http:/

Parsing stray text with Scrapy

旧时模样 提交于 2019-12-24 19:08:10
问题 Any idea how to extract 'TEXT TO GRAB' from this piece of markup: <span class="navigation_page"> <span> <a itemprop="url" href="http://www.example.com"> <span itemprop="title">LINK</span> </a> </span> <span class="navigation-pipe">></span> TEXT TO GRAB </span> 回答1: It's not an ideal solution but it should do the trick: from scrapy import Selector content=""" <span class="navigation_page"> <span> <a itemprop="url" href="http://www.example.com"> <span itemprop="title">LINK</span> </a> </span>

How to scrap paginated links in Scrapy?

纵饮孤独 提交于 2019-12-24 18:56:22
问题 The code for my scrappy is : import scrapy class DummymartSpider(scrapy.Spider): name = 'dummymart' allowed_domains = ['www.dummymart.com/product'] start_urls = ['https://www.dummymart.net/product/auto-parts--118'] def parse(self, response): Company = response.xpath('//*[@class="word-wrap item-title"]/text()').extract() for item in zip(Company): scraped_info = { 'Company':item[0], } yield scraped_info next_page_url = response.css('li >a::attr(href)').extract_first() #next_page_url = response

How to bypass a 'cookiewall' when using scrapy?

时光毁灭记忆、已成空白 提交于 2019-12-24 18:55:06
问题 I'm a new user to Scrapy. After following the tutorials for extracting data from websites, I am trying to accomplish something similar on forums. What I want is to extract all posts on a forum page (to start with). However, this particular forum has a 'cookie wall'. So when I want to extract from http://forum.fok.nl/topic/2413069, each session I first need to click the "Yes, I accept cookies"-button. My very basic scraper currently looks like this: class FokSpider(scrapy.Spider): name = 'fok'

following the information using scrapy in nested div and span tags

有些话、适合烂在心里 提交于 2019-12-24 18:44:52
问题 I am trying to make web crawler, using scrapy from python, that extracts the information that google shows in the right side when you make a search, for example: I want to extract the information in the box in the rigth side The link is: search in google The source code: source code Part of the HTML code is: <div class="g rhsvw kno-kp mnr-c g-blk" lang="es-419" data-hveid="CAoQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQjh8oAHoECAoQAA"> <div class="kp-blk knowledge-panel Wnoohf OJXvsb"

How to run Scrapy/Portia on Azure Web App

孤者浪人 提交于 2019-12-24 17:25:14
问题 I am trying to run Scrapy or Portia on a Microsoft Azure Web App. I have installed Scrapy by creating a virtual environment: D:\Python27\Scripts\virtualenv.exe D:\home\Python And then installed Scrapy: D:\home\Python\Scripts\pip install Scrapy The installation seemed to work. But executing a spider returns the following output: D:\home\Python\Scripts\tutorial>d:\home\python\scripts\scrapy.exe crawl example 2015-09-13 23:09:31 [scrapy] INFO: Scrapy 1.0.3 started (bot: tutorial) 2015-09-13 23

How to split output from a list of urls in scrapy

强颜欢笑 提交于 2019-12-24 16:27:26
问题 I am trying to generate a csv file for each scraped url from a list of urls in scrapy. I do understand I shall modify pipeline.py, however all my attempts have failed so far. I do not understand how I can pass the url being scraped to the pipeline and use this as name for the output and split the output accordingly. Any help? Thanks Here the spider and the pipeline from scrapy import Spider from scrapy.selector import Selector from vApp.items import fItem class VappSpider(Spider): name =

Empty list returning by xpath in scrapy

只愿长相守 提交于 2019-12-24 15:48:20
问题 I am working on scrapy , i am trying to gather some data from a site , Spider Code class NaaptolSpider(BaseSpider): name = "naaptol" domain_name = "www.naaptol.com" start_urls = ["http://www.naaptol.com/buy/mobile_phones/mobile_handsets.html"] def parse(self, response): hxs = HtmlXPathSelector(response) cell_matter = hxs.select('//div[@class="gridInfo"]/div[@class="gridProduct gridProduct_special"]') items=[] for i in cell_matter: cell_names = i.select('//p[@class="proName"]/a/text()')

Python——爬虫(一)

随声附和 提交于 2019-12-24 15:08:42
爬虫准备工作 爬虫简介 urllib 爬虫准备工作 参考资料 python网络数据采集 ’ 图灵工业出版 精通python爬虫框架Scrapy ’ 人民邮电出版社 python3网络爬虫 Scrapy官方教程 前提知识 url http协议 web前端 ’ html, css, js ajax re, xpath xml 爬虫简介 爬虫定义:网络爬虫(又被称为网页蜘蛛、网络机器人,在FOAF社区中间,更经常的称为网页追逐者), 是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本 另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫 两大特征 能按作者要求下载数据或者内容 能自动在网络上流窜 三大步骤 下载网页 提取正确的信息 来源: CSDN 作者: 若尘 链接: https://blog.csdn.net/qq_29339467/article/details/103681146

Scrapy output issue

╄→гoц情女王★ 提交于 2019-12-24 14:46:36
问题 I am having issues displaying my items as i wanted. My code is as follows: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from texashealth.items import TexashealthItem class texashealthspider(CrawlSpider): name="texashealth" allowed_domains=['jobs.texashealth.org'] start_urls=['http://jobs.texashealth.org/search/?&q=&title=Filter%3A%20title