rdd | 易学教程

cross combine two RDDs using pyspark

阅读更多关于 cross combine two RDDs using pyspark

问题 How can I cross combine (is this the correct way to describe?) the two RDDS? input: rdd1 = [a, b] rdd2 = [c, d] output: rdd3 = [(a, c), (a, d), (b, c), (b, d)] I tried rdd3 = rdd1.flatMap(lambda x: rdd2.map(lambda y: (x, y)) , it complains that It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. . I guess that means you can not nest action as in the list comprehension, and one statement can only do one action . 回答1: So as you have

Add a new calculated column from 2 values in RDD

阅读更多关于 Add a new calculated column from 2 values in RDD

问题 I have 2 paired RDDs that I joined them together using the same key and I now I want to add a new calculated column using 2 columns from the values part. The new joined RDD type is: RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])] I want to add another field to the new RDD which show the delta between the 2 DateTime fields. How can I do this? 回答1: You should be able to do this using map to extend the 2-tuples into 3-tuples, roughly as follows:

Spark-Can not access first element in an JavaRDD using first()

阅读更多关于 Spark-Can not access first element in an JavaRDD using first()

问题 Using spark and its Java API. I have loaded data to an JavaRDD<CustomizedDataStructure> like this: JavaRDD<CustomizedDataStructure> myRDD; And when I do: myRDD.count(); it returns value to me, shows that it do contains data, not a null RDD. But then when running: myRDD.first(); It should return me a <CustomizedDataStructure> , but it gives such error: 14:30:39,782 ERROR [TaskSetManager] Task 0.0 in stage 0.0 (TID 0) had a not serializable result: Why it is not serializable ? 回答1: When you

Spark-Can not access first element in an JavaRDD using first()

阅读更多关于 Spark-Can not access first element in an JavaRDD using first()

Scala 隐式（implicit）详解

阅读更多关于 Scala 隐式（implicit）详解

文章正文通过隐式转换，程序员可以在编写Scala程序时故意漏掉一些信息，让编译器去尝试在编译期间自动推导出这些信息来，这种特性可以极大的减少代码量，忽略那些冗长，过于细节的代码。 1、Spark 中的隐式思考隐式转换是Scala的一大特性, 如果对其不是很了解, 在阅读Spark代码时候就会很迷糊,有人这样问过我？ RDD这个类没有reduceByKey,groupByKey等函数啊,并且RDD的子类也没有这些函数,但是好像PairRDDFunctions这个类里面好像有这些函数为什么我可以在RDD调用这些函数呢? 答案就是Scala的隐式转换; 如果需要在RDD上调用这些函数,有两个前置条件需要满足: 首先rdd必须是RDD[(K, V)], 即pairRDD类型需要在使用这些函数的前面Import org.apache.spark.SparkContext._;否则就会报函数不存在的错误; 参考SparkContext Object, 我们发现其中有上10个xxToXx类型的函数: implicit def intToIntWritable(i: Int) = new IntWritable(i) implicit def longToLongWritable(l: Long) = new LongWritable(l) implicit def

Spark textFile生成task数目和RDD的数目分析

阅读更多关于 Spark textFile生成task数目和RDD的数目分析

当我们使用Spark读取文件的时候，感觉很容易，也很快速。但是，我们想过其中实现的内在原理没？目前我总结了，四个小问题，作为思考。 1).RDD创建个数 2).当我们使用textFile Api的时候，指定minPartition=3的时候，为什么系统会创建四个分区，以及四个Task呢？ 3).当Spark读取文件的时候，文件是怎么划分的呢？我们观察Spark UI的时候，会发现有的task有输入数据，为什么task的输出的record为0呢？如下图所示： 1).创建RDD的个数： https://blog.csdn.net/qq_20064763/article/details/88391284 2).分区与Task的创建个数细节： https://blog.csdn.net/qq_20064763/article/details/88393205 当我们使用Spark读取文件的时候，感觉很容易，也很快速。但是，我们想过其中实现的内在原理没？来源： CSDN 作者：乖乖猪001 链接： https://blog.csdn.net/xiaozhaoshigedasb/article/details/103670930

how to use spark intersection() by key or filter() with two RDD?

阅读更多关于 how to use spark intersection() by key or filter() with two RDD?

问题 I want to use intersection() by key or filter() in spark. But I really don't know how to use intersection() by key. So I tried to use filter() , but it's not worked. example - here is two RDD: data1 //RDD[(String, Int)] = Array(("a", 1), ("a", 2), ("b", 2), ("b", 3), ("c", 1)) data2 //RDD[(String, Int)] = Array(("a", 3), ("b", 5)) val data3 = data2.map{_._1} data1.filter{_._1 == data3}.collect //Array[(String, Int] = Array() I want to get a (key, value) pair with the same key as data1 based

When to persist and when to unpersist RDD in Spark

阅读更多关于 When to persist and when to unpersist RDD in Spark

问题 Lets say i have the following: val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK) val dataset3 = dataset2.map(.....) If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist the previous or not? I am trying to figure out when to persist and unpersist RDDs. With every new rdd that is created do i have to persist it? Thanks 回答1: Spark automatically monitors cache usage on each node and drops out old data partitions in a least

SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

阅读更多关于 SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

问题 I am new to SPARK and figuring out a better way to achieve the following scenario. There is a database table containing 3 fields - Category, Amount, Quantity. First I try to pull all the distinct Categories from the database. val categories:RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString) Now for each category I want to execute the Pipeline which essentially creates dataframes from each category and apply some Machine Learning. categories.foreach(executePipeline) def

RDD size remains the same even after compressing

阅读更多关于 RDD size remains the same even after compressing

问题 I use SparkListener to monitor the cached RDDs' sizes. However, I notice that no matter what I do, the RDDs' size always remain the same. I did the following things to compress the RDDs. val conf = new SparkConf().setAppName("MyApp") conf.set("spark.rdd.compress","true") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") ..... val sc = new SparkContext(conf) .... myrdd.persist(MEMORY_ONLY_SER) Even, if I remove the second and third lines shown above, Spark listener