rdd | 易学教程

reduceByKey和groupByKey的区别

阅读更多关于 reduceByKey和groupByKey的区别

reduceByKey：按照key进行聚合，在shuffle之前有combine（预聚合）操作，返回结果是RDD[k,v] groupByKey：按照key进行分组，直接进行shuffle 建议使用reduceByKey。但是需要注意是否会影响业务逻辑来源： https://www.cnblogs.com/xiangyuguan/p/11456759.html

Why does Spark save Map phase output to local disk?

阅读更多关于 Why does Spark save Map phase output to local disk?

问题 I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point. Spark writes the Map task(ShuffleMapTask) output directly to disk on completion. I would like to understand the following w.r.t to Hadoop MapReduce. If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce? Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?

spark笔记之Spark Streaming原理

阅读更多关于 spark笔记之Spark Streaming原理

2.1 Spark Streaming原理 Spark Streaming 是基于spark的流式批处理引擎，其基本原理是把输入数据以某一时间间隔批量的处理，当批处理间隔缩短到秒级时，便可以用于处理实时数据流。 2.2 Spark Streaming计算流程 Spark Streaming是将流式计算分解成一系列短小的批处理作业。这里的批处理引擎是Spark Core，也就是把Spark Streaming的输入数据按照batch size（如1秒）分成一段一段的数据（Discretized Stream），每一段数据都转换成Spark中的RDD（Resilient Distributed Dataset），然后将Spark Streaming中对DStream的Transformation操作变为针对Spark中对RDD的Transformation操作，将RDD经过操作变成中间结果保存在内存中。整个流式计算根据业务的需求可以对中间的结果进行缓存或者存储到外部设备。下图显示了Spark Streaming的整个流程。 SparkStreaming架构图 2.3 Spark Streaming容错性对于流式计算来说，容错性至关重要。首先我们要明确一下Spark中RDD的容错机制。每一个RDD都是一个不可变的分布式可重算的数据集，其记录着确定性的操作继承关系（lineage）

how to interpret RDD.treeAggregate

阅读更多关于 how to interpret RDD.treeAggregate

问题 I ran into this line in the Apache Spark code source val (gradientSum, lossSum, miniBatchSize) = data .sample(false, miniBatchFraction, 42 + i) .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))( seqOp = (c, v) => { // c: (grad, loss, count), v: (label, features) val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1)) (c._1, c._2 + l, c._3 + 1) }, combOp = (c1, c2) => { // c: (grad, loss, count) (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3) } ) I have multiple trouble

DataFrame equality in Apache Spark

阅读更多关于 DataFrame equality in Apache Spark

问题 Assume df1 and df2 are two DataFrame s in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API. Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns? The motivation for the question is that there are often many ways to compute some big data result, each

spark算子之Aggregate

阅读更多关于 spark算子之Aggregate

Aggregate函数一、源码定义 /** * Aggregate the elements of each partition, and then the results for all the partitions, using * given combine functions and a neutral "zero value". This function can return a different result * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U * and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are * allowed to modify and return their first argument instead of creating a new U to avoid memory * allocation. * * @param zeroValue the initial value for the accumulated result of each

Spark dataframe transform multiple rows to column

阅读更多关于 Spark dataframe transform multiple rows to column

问题 I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+-----+ |A |count|major| +--+-----+-----+ | a| 1| m1| | a| 1| m2| | a| 2| m3| | a| 3| m4| | b| 4| m1| | b| 1| m2| | b| 2| m3| | c| 3| m1| | c| 4| m3| | c| 5| m4| | d| 6| m1| | d| 1| m2| | d| 2| m3| | d| 3| m4| | d| 4| m5| | e| 4| m1| | e| 5| m2| | e| 1| m3| | e| 1| m4| | e| 1| m5| +--+-----+-----+ Into below result dataframe : +--+--+--+--+--+--+ |A |m1|m2|m3|m4|m5| +--+--+--+--+--+--+ | a|

How to partition RDD by key in Spark?

阅读更多关于 How to partition RDD by key in Spark?

Given that the HashPartitioner docs say: [HashPartitioner] implements hash-based partitioning using Java's Object.hashCode. Say I want to partition DeviceData by its kind . case class DeviceData(kind: String, time: Long, data: String) Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind ? But given that HashPartitioner takes a number of partitions parameter I am confused as to whether I need to know the number of kinds in advance and what happens if there are more kinds than partitions? Is it correct that if I

sparkRDD：第1节 RDD概述；第2节创建RDD

阅读更多关于 sparkRDD：第1节 RDD概述；第2节创建RDD

Spark计算模型RDD 一、课程目标目标1：掌握RDD的原理目标2：熟练使用RDD的算子完成计算任务目标3：掌握RDD的宽窄依赖目标4：掌握RDD的缓存机制目标5：掌握划分stage 目标6：掌握spark的任务调度流程二、弹性分布式数据集RDD 2. RDD概述 2.1 什么是RDD RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是Spark中最基本的数据抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD具有数据流模型的特点：自动容错、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将数据缓存在内存中，后续的查询能够重用这些数据，这极大地提升了查询速度。 Dataset：一个数据集合，用于存放数据的。 Distributed：RDD中的数据是分布式存储的，可用于分布式计算。 Resilient：RDD中的数据可以存储在内存中或者磁盘中。 2.2 RDD的属性 1） A list of partitions ：一个分区（Partition）列表，数据集的基本组成单位。对于RDD来说，每个分区都会被一个计算任务处理，并决定并行计算的粒度。用户可以在创建RDD时指定RDD的分区个数，如果没有指定，那么就会采用默认值。（比如：读取HDFS上数据文件产生的RDD分区数跟block的个数相等） 2

How spark read a large file (petabyte) when file can not be fit in spark's main memory

阅读更多关于 How spark read a large file (petabyte) when file can not be fit in spark's main memory

What will happen for large files in these cases? 1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode? 2) Spark do partition of data as per datanode block size but all data can not be stored into main memory. Here we are not using StorageLevel. So what will happen here? 3) Spark do partition the data, some data will store on main memory once this main memory store's data will process again spark will load other data from disc. First of all, Spark only starts reading in the data when an action (like