rdd

How does Sparks RDD.randomSplit actually split the RDD

懵懂的女人 提交于 2019-11-27 02:38:42
问题 So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions. When calling RDD.randomSplit(0.8,0.2) Does the function also shuffle the rdd? Our does the splitting simply sample 20% continuously of the rdd? Or does it select 20% of the partitions randomly? Ideally does the resulting split have the same class distribution as the original RDD. (i.e. 2:1) Thanks 回答1: For each range defined by weights

How DAG works under the covers in RDD?

青春壹個敷衍的年華 提交于 2019-11-27 02:30:11
The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the internal mechanics on Resilient Distributed Datasets with Directed Acyclic Graph seems lacking in this paper. Should it be better learned by investigating the source code? Sathish Even i have been looking in the web to learn about how spark computes the DAG from the RDD and subsequently executes the task. At high level, when any action is called on the

Parsing multiline records in Scala

北城余情 提交于 2019-11-27 02:13:26
Here is my RDD[String] M1 module1 PIP a Z A PIP b Z B PIP c Y n4 M2 module2 PIP a I n4 PIP b O D PIP c O n5 and so on. Basically, I need a RDD of key (containing the second word on line1) and values of the subsequent PIP lines that can be iterated upon. I've tried the following val usgPairRDD = usgRDD.map(x => (x.split("\\n")(0), x)) but this gives me the following output (,) (M1 module1,M1 module1) (PIP a Z A,PIP a Z A) (PIP b Z B,PIP b Z B) (PIP c Y n4,PIP c Y n4) (,) (M2 module2,M2 module2) (PIP a I n4,PIP a I n4) (PIP b O D,PIP b O D) (PIP c O n5,PIP c O n5) Instead, I'd like the output to

Spark groupByKey alternative

风流意气都作罢 提交于 2019-11-27 01:29:49
According to Databricks best practices, Spark groupByKey should be avoided as Spark groupByKey processing works in a way that the information will be first shuffled across workers and then the processing will occur. Explanation So, my question is, what are the alternatives for groupByKey in a way that it will return the following in a distributed and fast way? // want this {"key1": "1", "key1": "2", "key1": "3", "key2": "55", "key2": "66"} // to become this {"key1": ["1","2","3"], "key2": ["55","66"]} Seems to me that maybe aggregateByKey or glom could do it first in the partition ( map ) and

Spark RDD - is partition(s) always in RAM?

牧云@^-^@ 提交于 2019-11-27 00:38:39
问题 We all know Spark does the computation in memory. I am just curious on followings. If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD s data will reside on Spark Workers Memory? If I do not delete RDD , will it be in memory forever? If my dataset(file) size exceeds available RAM size, where will data to stored? 回答1: If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD data will reside on Spark Memory? Yes, All 10 RDDs data will spread in

Spark常用的算子总结(2)——flatMap

*爱你&永不变心* 提交于 2019-11-26 22:54:48
与map类似,区别是原RDD中的元素经map处理后只能生成一个元素,而原RDD中的元素经flatmap处理后可生成多个元素 val a = sc.parallelize(1 to 4, 2) val b = a.flatMap(x => 1 to x)//每个元素扩展 b.collect /* 结果 Array[Int] = Array( 1, 1, 2, 1, 2, 3, 1, 2, 3, 4) */ 来源: https://www.cnblogs.com/pocahontas/p/11334558.html

How do I select a range of elements in Spark RDD?

喜夏-厌秋 提交于 2019-11-26 22:53:18
问题 I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that? I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index. 回答1: I don't think there is an efficient method to do this yet. But the easy way is using filter() , lets say you have an

Spark parquet partitioning : Large number of files

梦想与她 提交于 2019-11-26 22:36:03
问题 I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have

Which operations preserve RDD order?

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-26 21:51:11
RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy() , as explained in this reply . Now, which operations preserve that order? E.g., is it guaranteed that (after a.sortBy() ) a.map(f).zip(a) === a.map(x => (f(x),x)) How about a.filter(f).map(g) === a.map(x => (x,g(x))).filter(f(_._1)).map(_._2) what about a.filter(f).flatMap(g) === a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2) Here "equality" === is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations

Spark when union a lot of RDD throws stack overflow error

a 夏天 提交于 2019-11-26 21:46:45
问题 When I use "++" to combine a lot of RDDs, I got error stack over flow error. Spark version 1.3.1 Environment: yarn-client. --driver-memory 8G The number of RDDs is more than 4000. Each RDD is read from a text file with size of 1 GB. It is generated in this way val collection = (for ( path <- files ) yield sc.textFile(path)).reduce(_ union _) It works fine when files has small size. And there is the error The error repeats itself. I guess it is a recursion function which is called too many