rdd | 易学教程

toDF does not compile though import sqlContext.implicits._ is used

阅读更多关于 toDF does not compile though import sqlContext.implicits._ is used

问题 I have some issue with compiling the Spark Scala code when I want to use toDF in order to pass RDD to DataFrame. I checked both Spark 2.0.0 and 1.6.2, and the problem is the same all the time. Below I provide my POM file and the piece of code: POM.xml <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

GeoSpark 详细介绍

阅读更多关于 GeoSpark 详细介绍

GeoSpark 一. 概要 GeoSpark是一个用来处理大规模空间数据的计算集群，用SRDDs（弹性分布式数据集 Spatial Resilient Disilient Distributed Datasets ）扩展了Apache Spark /SparkSQL，来高效导入，处理和分析大规模跨集群空间数据。 GeoSpark整体分为三层，上层为空间查询处理层，体层为几何操作库，中间为空间RDD层。二. 模块及概念 2.1 模块 GeoSpark 有四个模块组成，Core ，SQL ， Viz 和 Zeppelin 模块。 Spark Core Core 提供 SpatialRDDS 和查询操作等能力 Spark SQL GeoSpark的SQL接口，提供对 SQL/DataFrame的空间处理能力 Spark Viz Viz主要用于可视化SRDD（ Spatial RDD）和 DataFrame 主要用与转换Spatial RDD/Spatial DataFrame 为常用图片格式. GeoSparkViz 是大规模内存空间可视化系统，支持可视化Spatial RDD 和 Spatial Queries 查询结构。此模块支持生成瓦片数据。 Zeppelin GeoSpark插件，可以可视化空间数据，对于小数据可直接加载到 Zeppelin中进行可视化

【Spark】（七）spark partition 理解 / coalesce 与 repartition的区别

阅读更多关于【Spark】（七）spark partition 理解 / coalesce 与 repartition的区别

文章目录一、spark 分区 partition的理解二、coalesce 与 repartition的区别（我们下面说的coalesce都默认shuffle参数为false的情况）三、实例四、总结一、spark 分区 partition的理解 spark中是以vcore级别调度task 如果读取的是hdfs，那么有多少个block，就有多少个partition 举例来说： sparksql 要读表T, 如果表T有1w个小文件，那么就有1w个partition 这时候读取效率会较低。假设设置资源为 --executor-memory 2g --executor-cores 2 --num-executors 5 。步骤是：拿出1-10号10个小文件（也就是10个partition）分别给5个executor读取（spark调度会以vcore为单位，实际就是5个executor，10个task读10个partition）如果5个executor执行速度相同，再拿11-20号文件依次给这5个executor读取而实际执行速度不会完全相同，那就是哪个task先执行完，哪个task领取下一个partition读取执行，以此类推。这样往往读取文件的调度时间大于读取文件本身，而且会频繁打开关闭文件句柄，浪费较为宝贵的io资源，执行效率也大大降低。二、coalesce 与

大数据优化方案----Spark案例优化(一)

阅读更多关于大数据优化方案----Spark案例优化(一)

“无意中发现了一个巨牛的人工智能教程，忍不住分享一下给大家。教程不仅是零基础，通俗易懂，而且非常风趣幽默，像看小说一样！觉得太牛了，所以分享给大家。点这里可以跳转到教程。”。大数据面试宝典目录,请点击目录一、需求二、样例数据三、实现方式一四、实现方式二自定义分区取重写排序规则排序五、实现方式三在shuffle是在每一个分区中实现排序另一种方式实现，使用ShuffleRDD 一、需求通过分析用户浏览新闻热门话题的日志，统计每个话题下被浏览量最多的用户topN，即按照话题分组，在每一个组内进行排序二、样例数据数据格式：话题,时间,被浏览的用户id #高以翔去世# , 2019 - 11 - 29 , u011 #高以翔去世# , 2019 - 11 - 29 , u011 #高以翔去世# , 2019 - 11 - 29 , u011 #高以翔去世# , 2019 - 11 - 29 , u011 #高以翔去世# , 2019 - 11 - 29 , u011 #高以翔去世# , 2019 - 11 - 29 , u008 #高以翔去世# , 2019 - 11 - 29 , u008 #高以翔去世# , 2019 - 11 - 29 , u008 #高以翔去世# , 2019 - 11 - 29 , u008 #高以翔去世# , 2019 - 11

Spark get top N highest score results for each (item1, item2, score)

阅读更多关于 Spark get top N highest score results for each (item1, item2, score)

问题 I have a DataFrame of the following format: item_id1: Long, item_id2: Long, similarity_score: Double What I'm trying to do is to get top N highest similarity_score records for each item_id1. So, for example: 1 2 0.5 1 3 0.4 1 4 0.3 2 1 0.5 2 3 0.4 2 4 0.3 With top 2 similar items would give: 1 2 0.5 1 3 0.4 2 1 0.5 2 3 0.4 I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to

36_SparkStreaming二—编程

阅读更多关于 36_SparkStreaming二—编程

SparkStreaming编程 1 Transformation 高级算子 1.1 updateStateByKey /** * 单词计数 * * Driver服务： * 上一次运行结果，状态 * Driver服务 * 新的数据 * */ object UpdateStateBykeyWordCount { def main ( args : Array [ String ] ) : Unit = { val conf = new SparkConf ( ) . setMaster ( "local[2]" ) . setAppName ( "NetWordCount" ) val sc = new SparkContext ( conf ) val ssc = new StreamingContext ( sc , Seconds ( 2 ) ) ssc . checkpoint ( "hdfs://hadoop1:9000/streamingcheckpoint" ) /** * 数据的输入 */ val dstream : ReceiverInputDStream [ String ] = ssc . socketTextStream ( "hadoop1" , 9999 ) /*** * 数据的处理 * * Option: * Some:有值 * None：没有值 *

Spark: difference of semantics between reduce and reduceByKey

阅读更多关于 Spark: difference of semantics between reduce and reduceByKey

问题 In Spark's documentation, it says that RDDs method reduce requires a associative AND commutative binary function. However, the method reduceByKey ONLY requires an associative binary function. sc.textFile("file4kB", 4) I did some tests, and apparently it's the behavior I get. Why this difference? Why does reduceByKey ensure the binary function is always applied in certain order (to accommodate for the lack of commutativity) when reduce does not? Example, if a load some (small) text with 4

Spark: difference of semantics between reduce and reduceByKey

阅读更多关于 Spark: difference of semantics between reduce and reduceByKey

Spark-SQL 面试准备 1

阅读更多关于 Spark-SQL 面试准备 1

Spark Knowledge NO.1 1. spark中的RDD是什么，有哪些特性？答：RDD（Resilient Distributed Dataset）叫做分布式数据集，是spark中最基本的数据抽象，它代表一个不可变，可分区，里面的元素可以并行计算的集合 Resilient：表示弹性的，弹性表示 Dataset：就是一个集合，用于存放数据的 Destributed：分布式，可以并行在集群计算 1.RDD中的数据可以存储在内存或者磁盘中； 2.RDD中的分区是可以改变的；五大特性： A list of partitions：一个分区列表，RDD中的数据都存储在一个分区列表中 A function for computing each split：作用在每一个分区中的函数 A list of dependencies on other RDDs：一个RDD依赖于其他多个RDD，这个点很重要，RDD的容错机制就是依据这个特性而来的 Optionally, a Partitioner for key-value RDDs(eg:to say that the RDD is hash-partitioned)：可选的，针对于kv类型的RDD才有这个特性，作用是决定了数据的来源以及数据处理后的去向可选项，数据本地性，数据位置最优 2. 概述一下spark中的常用算子区别

RDD的创建

阅读更多关于 RDD的创建

RDD有三种创建方式 1、从内存/集合中创建 val conf: SparkConf = new SparkConf().setAppName("test01").setMaster("local[*]") val sc = new SparkContext(conf) val rdd1: RDD[Int] = sc.parallelize(List(1, 2, 3, 4)) val rdd2: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4)) 注： makeRDD和parallelize方法是一回事，makeRDD内部会调用parallelize方法。 parallelize和makeRDD还有一个重要的参数就是把数据集切分成的分区数 Spark会为每个分区运行一个任务（task），正常情况下，Spark会自动的根据你的集群来设置分区数使用以上方法创建RDD时，可以指定分区数，如果不指定分区数，则按照创建创建conf里传入的setMaster()里的参数进行指定分区数。如果指定了分区数，则按照指定的分区数进行分区，分区规则如下： def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = { (0 until numSlices).iterator.map { i =>

订阅 rdd