rdd

Is groupByKey ever preferred over reduceByKey

血红的双手。 提交于 2019-11-26 05:33:58
问题 I always use reduceByKey when I need to group data in RDDs, because it performs a map side reduce before shuffling data, which often means that less data gets shuffled around and I thus get better performance. Even when the map side reduce function collects all values and does not actually reduce the data amount, I still use reduceByKey , because I\'m assuming that the performance of reduceByKey will never be worse than groupByKey . However, I\'m wondering if this assumption is correct or if

Difference between DataFrame, Dataset, and RDD in Spark

南楼画角 提交于 2019-11-26 04:56:50
问题 I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row] ) in Apache Spark? Can you convert one to the other? 回答1: A DataFrame is defined well with a google search for "DataFrame definition": A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional metadata due to its tabular format,

Spark performance for Scala vs Python

China☆狼群 提交于 2019-11-26 04:29:24
问题 I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very common preprocessing code for some 1 GB of data. Data is picked from the SpringLeaf competition on Kaggle. Just to give an overview of the data (it contains 1936 dimensions and 145232 rows). Data is composed of various types e.g. int, float, string,

Apache Spark: map vs mapPartitions?

柔情痞子 提交于 2019-11-26 03:47:06
问题 What\'s the difference between an RDD\'s map and mapPartitions method? And does flatMap behave like map or like mapPartitions ? Thanks. (edit) i.e. what is the difference (either semantically or in terms of execution) between def map[A, B](rdd: RDD[A], fn: (A => B)) (implicit a: Manifest[A], b: Manifest[B]): RDD[B] = { rdd.mapPartitions({ iter: Iterator[A] => for (i <- iter) yield fn(i) }, preservesPartitioning = true) } And: def map[A, B](rdd: RDD[A], fn: (A => B)) (implicit a: Manifest[A],

(Why) do we need to call cache or persist on a RDD

被刻印的时光 ゝ 提交于 2019-11-26 03:24:18
问题 When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call \"cache\" or \"persist\" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default? val textFile = sc.textFile(\"/user/emp.txt\") As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node\'s memory. If so, why do we need to call \"cache\" or \"persist\" on

What does “Stage Skipped” mean in Apache Spark web UI?

放肆的年华 提交于 2019-11-26 02:19:56
问题 From my Spark UI. What does it mean by skipped? 回答1: Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling ( reduceByKey ). Whenever there is shuffling involved Spark automatically caches generated data: Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used

Case class equality in Apache Spark

…衆ロ難τιáo~ 提交于 2019-11-26 02:13:59
问题 Why does pattern matching in Spark not work the same as in Scala? See example below... function f() tries to pattern match on class, which works in the Scala REPL but fails in Spark and results in all \"???\". f2() is a workaround that gets the desired result in Spark using .isInstanceOf() , but I understand that to be bad form in Scala. Any help on pattern matching the correct way in this scenario in Spark would be greatly appreciated. abstract class a extends Serializable {val a: Int} case

spark基本操作 java 版

こ雲淡風輕ζ 提交于 2019-11-26 00:30:39
1.map算子 private static void map() { //创建SparkConf SparkConf conf = new SparkConf() .setAppName("map") .setMaster("local"); //创建JavasparkContext JavaSparkContext sc = new JavaSparkContext(conf); //构造集合 List<Integer> numbers = Arrays.asList(1,2,3,4,5); //并行化集合,创建初始RDD JavaRDD<Integer> numberRDD = sc.parallelize(numbers); //使用map算子,将集合中的每个元素都乘以2 JavaRDD<Integer> multipleNumberRDD = numberRDD.map(new Function<Integer, Integer>() { @Override public Integer call(Integer v1) throws Exception { return v1 * 2; } }); //打印新的RDD multipleNumberRDD.foreach(new VoidFunction<Integer>() { @Override public void

How do I split an RDD into two or more RDDs?

我的未来我决定 提交于 2019-11-26 00:25:53
问题 I\'m looking for a way to split an RDD into two or more RDDs. The closest I\'ve seen is Scala Spark: Split collection into several RDD? which is still a single RDD. If you\'re familiar with SAS, something like this: data work.split1, work.split2; set work.preSplit; if (condition1) output work.split1 else if (condition2) output work.split2 run; which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend... 回答1: It is not possible to yield

Spark - repartition() vs coalesce()

你离开我真会死。 提交于 2019-11-26 00:24:39
问题 According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions. One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased. If the partitions are spread across multiple machines