rdd

cross combine two RDDs using pyspark

感情迁移 提交于 2019-12-24 03:24:50
问题 How can I cross combine (is this the correct way to describe?) the two RDDS? input: rdd1 = [a, b] rdd2 = [c, d] output: rdd3 = [(a, c), (a, d), (b, c), (b, d)] I tried rdd3 = rdd1.flatMap(lambda x: rdd2.map(lambda y: (x, y)) , it complains that It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. . I guess that means you can not nest action as in the list comprehension, and one statement can only do one action . 回答1: So as you have

Add a new calculated column from 2 values in RDD

纵然是瞬间 提交于 2019-12-24 03:23:46
问题 I have 2 paired RDDs that I joined them together using the same key and I now I want to add a new calculated column using 2 columns from the values part. The new joined RDD type is: RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])] I want to add another field to the new RDD which show the delta between the 2 DateTime fields. How can I do this? 回答1: You should be able to do this using map to extend the 2-tuples into 3-tuples, roughly as follows:

Spark-Can not access first element in an JavaRDD using first()

走远了吗. 提交于 2019-12-24 03:07:25
问题 Using spark and its Java API. I have loaded data to an JavaRDD<CustomizedDataStructure> like this: JavaRDD<CustomizedDataStructure> myRDD; And when I do: myRDD.count(); it returns value to me, shows that it do contains data, not a null RDD. But then when running: myRDD.first(); It should return me a <CustomizedDataStructure> , but it gives such error: 14:30:39,782 ERROR [TaskSetManager] Task 0.0 in stage 0.0 (TID 0) had a not serializable result: Why it is not serializable ? 回答1: When you

Spark-Can not access first element in an JavaRDD using first()

你说的曾经没有我的故事 提交于 2019-12-24 03:07:07
问题 Using spark and its Java API. I have loaded data to an JavaRDD<CustomizedDataStructure> like this: JavaRDD<CustomizedDataStructure> myRDD; And when I do: myRDD.count(); it returns value to me, shows that it do contains data, not a null RDD. But then when running: myRDD.first(); It should return me a <CustomizedDataStructure> , but it gives such error: 14:30:39,782 ERROR [TaskSetManager] Task 0.0 in stage 0.0 (TID 0) had a not serializable result: Why it is not serializable ? 回答1: When you

Scala 隐式(implicit)详解

对着背影说爱祢 提交于 2019-12-24 02:32:29
文章正文 通过隐式转换,程序员可以在编写Scala程序时故意漏掉一些信息,让编译器去尝试在编译期间自动推导出这些信息来,这种特性可以极大的减少代码量,忽略那些冗长,过于细节的代码。 1、Spark 中的隐式思考 隐式转换是Scala的一大特性, 如果对其不是很了解, 在阅读Spark代码时候就会很迷糊,有人这样问过我? RDD这个类没有reduceByKey,groupByKey等函数啊,并且RDD的子类也没有这些函数,但是好像PairRDDFunctions这个类里面好像有这些函数 为什么我可以在RDD调用这些函数呢? 答案就是Scala的隐式转换; 如果需要在RDD上调用这些函数,有两个前置条件需要满足: 首先rdd必须是RDD[(K, V)], 即pairRDD类型 需要在使用这些函数的前面Import org.apache.spark.SparkContext._;否则就会报函数不存在的错误; 参考SparkContext Object, 我们发现其中有上10个xxToXx类型的函数: implicit def intToIntWritable(i: Int) = new IntWritable(i) implicit def longToLongWritable(l: Long) = new LongWritable(l) implicit def

Spark textFile生成task数目和RDD的数目分析

只愿长相守 提交于 2019-12-23 21:02:20
当我们使用Spark读取文件的时候,感觉很容易,也很快速。但是,我们想过其中实现的内在原理没? 目前我总结了,四个小问题,作为思考。 1).RDD创建个数 2).当我们使用textFile Api的时候,指定minPartition=3的时候,为什么系统会创建四个分区,以及四个Task呢? 3).当Spark读取文件的时候,文件是怎么划分的呢?我们观察Spark UI的时候,会发现有的task有输入数据,为什么task的输出的record为0呢?如下图所示: 1).创建RDD的个数: https://blog.csdn.net/qq_20064763/article/details/88391284 2).分区与Task的创建个数细节: https://blog.csdn.net/qq_20064763/article/details/88393205 当我们使用Spark读取文件的时候,感觉很容易,也很快速。但是,我们想过其中实现的内在原理没? 来源: CSDN 作者: 乖乖猪001 链接: https://blog.csdn.net/xiaozhaoshigedasb/article/details/103670930

how to use spark intersection() by key or filter() with two RDD?

余生长醉 提交于 2019-12-23 18:42:42
问题 I want to use intersection() by key or filter() in spark. But I really don't know how to use intersection() by key. So I tried to use filter() , but it's not worked. example - here is two RDD: data1 //RDD[(String, Int)] = Array(("a", 1), ("a", 2), ("b", 2), ("b", 3), ("c", 1)) data2 //RDD[(String, Int)] = Array(("a", 3), ("b", 5)) val data3 = data2.map{_._1} data1.filter{_._1 == data3}.collect //Array[(String, Int] = Array() I want to get a (key, value) pair with the same key as data1 based

When to persist and when to unpersist RDD in Spark

拈花ヽ惹草 提交于 2019-12-23 18:32:45
问题 Lets say i have the following: val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK) val dataset3 = dataset2.map(.....) If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist the previous or not? I am trying to figure out when to persist and unpersist RDDs. With every new rdd that is created do i have to persist it? Thanks 回答1: Spark automatically monitors cache usage on each node and drops out old data partitions in a least

SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

我们两清 提交于 2019-12-23 18:31:58
问题 I am new to SPARK and figuring out a better way to achieve the following scenario. There is a database table containing 3 fields - Category, Amount, Quantity. First I try to pull all the distinct Categories from the database. val categories:RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString) Now for each category I want to execute the Pipeline which essentially creates dataframes from each category and apply some Machine Learning. categories.foreach(executePipeline) def

RDD size remains the same even after compressing

喜你入骨 提交于 2019-12-23 17:41:05
问题 I use SparkListener to monitor the cached RDDs' sizes. However, I notice that no matter what I do, the RDDs' size always remain the same. I did the following things to compress the RDDs. val conf = new SparkConf().setAppName("MyApp") conf.set("spark.rdd.compress","true") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") ..... val sc = new SparkContext(conf) .... myrdd.persist(MEMORY_ONLY_SER) Even, if I remove the second and third lines shown above, Spark listener