rdd

Why is the fold action necessary in Spark?

只谈情不闲聊 提交于 2019-11-27 13:22:58
I've a silly question involving fold and reduce in PySpark . I understand the difference between these two methods, but, if both need that the applied function is a commutative monoid, I cannot figure out an example in which fold cannot be substituted by reduce`. Besides, in the PySpark implementation of fold it is used acc = op(obj, acc) , why this operation order is used instead of acc = op(acc, obj) ? (this second order sounds more closed to a leftFold to me) Cheers Tomas Empty RDD It cannot be substituted when RDD is empty: val rdd = sc.emptyRDD[Int] rdd.reduce(_ + _) // java.lang

Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

狂风中的少年 提交于 2019-11-27 13:12:09
Will rdd1.join(rdd2) cause a shuffle to happen if rdd1 and rdd2 have the same partitioner? Daniel Darabos No. If two RDDs have the same partitioner, the join will not cause a shuffle. You can see this in CoGroupedRDD.scala : override def getDependencies: Seq[Dependency[_]] = { rdds.map { rdd: RDD[_ <: Product2[K, _]] => if (rdd.partitioner == Some(part)) { logDebug("Adding one-to-one dependency with " + rdd) new OneToOneDependency(rdd) } else { logDebug("Adding shuffle dependency with " + rdd) new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer) } } } Note however, that the

知乎上看到的关于RDD的比较好的解释

℡╲_俬逩灬. 提交于 2019-11-27 12:06:07
源答案来源于知乎的大佬@昆吾。 -----------------------------------------------网友的原答案START---------------------------------------- 首先来思考一个问题:Spark的计算模型是如何做到并行的呢? 如果你有一箱苹果,让三个人拿回家吃完(只能举这种神经兮兮的例子了),如果不拆箱子就会很麻烦对吧~一个箱子嘛,当然只有一个人才能抱走了。这时候智商正常的人都知道不如把箱子打开,苹果倒出来,分别拿三个小箱子重新装起来,喏,各自抱回家去啃吧。 Spark和很多其他分布式计算系统都借用了这种思想来实现并行:把一个超大的数据集,切切切分成N个小堆,找M个执行器(M < N),各自拿一块或多块数据慢慢玩,玩出结果了再收集在一起,这就算执行完啦。那么Spark做了一项工作就是:凡是能够被我算的,都是要符合我的要求的,所以spark无论处理什么数据先整成一个拥有多个分块的数据集再说,这个数据集就叫RDD。 然而由于一些资料大肆宣传所谓的内存计算,所以很多人认为RDD和分布式内存容器(如memcache)这些类似,这显然是不对的。首先,当你自己写一个spark应用时,在代码上拥有了一个RDD,这个RDD是不包含任何待处理数据的(详情可以参考spark数据用户空间和集群空间的概念),真正的数据在执行时才会加载

How to find spark RDD/Dataframe size?

百般思念 提交于 2019-11-27 11:23:06
I know how to find the file size in scala.But how to find a RDD/dataframe size in spark? Scala: object Main extends App { val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString() println(file.length) } Spark: val distFile = sc.textFile(file) println(distFile.length) but if i process it not getting file size. How to find the RDD size? Glennie Helles Sindholt If you are simply looking to count the number of rows in the rdd , do: val distFile = sc.textFile(file) println(distFile.count) If you are interested in the bytes, you can use the SizeEstimator : import org.apache

Join two ordinary RDDs with/without Spark SQL

非 Y 不嫁゛ 提交于 2019-11-27 11:14:13
问题 I need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join operation of two tables. I wonder if this is possible only through Spark SQL or there are other ways of doing it. As a concrete example, consider RDD r1 with primary key ITEM_ID : (ITEM_ID, ITEM_NAME, ITEM_UNIT, COMPANY_ID) and RDD r2 with primary key COMPANY_ID : (COMPANY_ID, COMPANY_NAME, COMPANY_CITY) I want to join r1 and r2 . How can this be done? 回答1: Soumya Simanta gave a

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

半腔热情 提交于 2019-11-27 11:08:05
What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession ? Is there any method to convert or create a Context using a SparkSession ? Can I completely replace all the Contexts using one single entry SparkSession ? Are all the functions in SQLContext , SparkContext , and JavaSparkContext also in SparkSession ? Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext . How do they behave in SparkSession ? How can I create the following using a SparkSession ? RDD JavaRDD JavaPairRDD Dataset Is there a method to transform a

Explain the aggregate functionality in Spark

人盡茶涼 提交于 2019-11-27 10:56:06
I am looking for some better explanation of the aggregate functionality that is available via spark in python. The example I have is as follows (using pyspark from Spark 1.2.0 version) sc.parallelize([1,2,3,4]).aggregate( (0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) Output: (10, 4) I get the expected result (10,4) which is sum of 1+2+3+4 and 4 elements. If I change the initial value passed to the aggregate function to (1,0) from (0,0) I get the following result sc.parallelize([1,2,3,4]).aggregate( (1, 0), (lambda acc,

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

你说的曾经没有我的故事 提交于 2019-11-27 10:48:47
I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like: >>> rdd1.take(10) # Show a small sample. [(u'2013-10-09', 7.60117302052786), (u'2013-10-10', 9.322709163346612), (u'2013-10-10', 28.264462809917358), (u'2013-10-07', 9.664429530201343), (u'2013-10-07', 12.461538461538463), (u'2013-10-09', 20.76923076923077), (u'2013-10-08', 11.842105263157894), (u'2013-10-13', 32.32514177693762), (u'2013-10-13', 26

spark-DAG,宽窄依赖,Stage,Shuffle

别说谁变了你拦得住时间么 提交于 2019-11-27 10:47:52
spark-DAG图 DAG(Directed Acyclic Graph) 叫做 有向无环图 , 原始的RDD通过一系列的转换就就形成了DAG,RDD之间的依赖关系形成了DAG图,而根据RDD之间的依赖关系的不同将DAG划分成不同的Stage。 宽窄依赖 窄依赖:父RDD和子RDD partition之间的关系是一对一的。或者父RDD一个partition只对应一个子RDD的partition情况下的父RDD和子RDD partition关系是多对一的。 不会有shuffle的产生。父RDD 的 一个分区 去到 子RDD的一个分区 。 宽依赖:父RDD与子RDD partition之间的关系是一对多。 会有shuffle的产生。父RDD的一个分区的数据去到子RDD的不同分区里面。 Stage Spark任务会根据RDD之间的依赖关系,形成一个DAG有向无环图,DAG会提交给DAGScheduler,DAGScheduler会把DAG划分相互依赖的多个stage,划分stage的依据就是RDD之间的宽窄依赖。 遇到宽依赖就划分stage,每个stage包含一个或多个task任务。然后将这些task以taskSet的形式提交给TaskScheduler运行。stage是由一组并行的task组成。 stage切割规则 切割规则: 从后往前,遇到宽依赖就切割stage。 Shuffle

sparkstreaming的状态计算-updateStateByKey源码

笑着哭i 提交于 2019-11-27 10:27:24
转发请注明原创地址: https://www.cnblogs.com/dongxiao-yang/p/11358781.html 本文基于spark源码版本为2.4.3 在流式计算中通常会有状态计算的需求,即当前计算结果不仅依赖于目前收到数据还需要之前结果进行合并计算的场景,由于sparkstreaming的mini-batch机制,必须将之前的状态结果存储在RDD中并在下一次batch计算时将其取出进行合并,这就是updateStateByKey方法的用处。 简单用例: def main(args: Array[String]): Unit = { val host = "localhost" val port = "8001" StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint("/Users/dyang/Desktop/checkpoittmp