rdd | 易学教程

How to get a sample with an exact sample size in Spark RDD?

阅读更多关于 How to get a sample with an exact sample size in Spark RDD?

问题 Why does the rdd.sample() function on Spark RDD return a different number of elements even though the fraction parameter is the same? For example, if my code is like below: val a = sc.parallelize(1 to 10000, 3) a.sample(false, 0.1).count Every time I run the second line of the code it returns a different number not equal to 1000. Actually I expect to see 1000 every time although the 1000 elements might be different. Can anyone tell me how I can get a sample with the sample size exactly equal

Explain the aggregate functionality in Spark

阅读更多关于 Explain the aggregate functionality in Spark

问题 I am looking for some better explanation of the aggregate functionality that is available via spark in python. The example I have is as follows (using pyspark from Spark 1.2.0 version) sc.parallelize([1,2,3,4]).aggregate( (0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) Output: (10, 4) I get the expected result (10,4) which is sum of 1+2+3+4 and 4 elements. If I change the initial value passed to the aggregate function to

What is the difference between cache and persist?

阅读更多关于 What is the difference between cache and persist?

问题 In terms of RDD persistence, what are the differences between cache() and persist() in spark ? 回答1: With cache() , you use only the default storage level : MEMORY_ONLY for RDD MEMORY_AND_DISK for Dataset With persist() , you can specify which storage level you want for both RDD and Dataset . From the official docs: You can mark an RDD to be persisted using the persist () or cache () methods on it. each persisted RDD can be stored using a different storage level The cache () method is a

Why does sortBy transformation trigger a Spark job?

阅读更多关于 Why does sortBy transformation trigger a Spark job?

问题 As per Spark documentation only RDD actions can trigger a Spark job and the transformations are lazily evaluated when an action is called on it. I see the sortBy transformation function is applied immediately and it is shown as a job trigger in the SparkUI. Why? 回答1: sortBy is implemented using sortByKey which depends on a RangePartitioner (JVM) or partitioning function (Python). When you call sortBy / sortByKey partitioner (partitioning function) is initialized eagerly and samples input RDD

How to control preferred locations of RDD partitions?

阅读更多关于 How to control preferred locations of RDD partitions?

问题 Is there a way to set the preferred locations of RDD partitions manually? I want to make sure certain partition be computed in a certain machine. I'm using an array and the 'Parallelize' method to create a RDD from that. Also I'm not using HDFS, The files are on the local disk. That's why I want to modify the execution node. 回答1: Is there a way to set the preferredLocations of RDD partitions manually? Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to

How to get element by Index in Spark RDD (Java)

阅读更多关于 How to get element by Index in Spark RDD (Java)

问题 I know the method rdd.first() which gives me the first element in an RDD. Also there is the method rdd.take(num) Which gives me the first "num" elements. But isn't there a possibility to get an element by index? Thanks. 回答1: This should be possible by first indexing the RDD. The transformation zipWithIndex provides a stable indexing, numbering each element in its original order. Given: rdd = (a,b,c) val withIndex = rdd.zipWithIndex // ((a,0),(b,1),(c,2)) To lookup an element by index, this

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

阅读更多关于 Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

问题 I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like: >>> rdd1.take(10) # Show a small sample. [(u'2013-10-09', 7.60117302052786), (u'2013-10-10', 9.322709163346612), (u'2013-10-10', 28.264462809917358), (u'2013-10-07', 9.664429530201343), (u'2013-10-07', 12.461538461538463), (u'2013-10-09', 20.76923076923077)

Explanation of fold method of spark RDD

阅读更多关于 Explanation of fold method of spark RDD

问题 I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. My Scala code looks like sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v) And it gave a surprising result 97.0 . This is quite counter-intuitive compared to the Scala version of fold Array(2., 3.).fold(0.0)((p, v) => p+v*v) which gives the expected answer 13.0 . It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. I have

Number of partitions in RDD and performance in Spark

阅读更多关于 Number of partitions in RDD and performance in Spark

问题 In Pyspark, I can create a RDD from a list and decide how many partitions to have: sc = SparkContext() sc.parallelize(xrange(0, 10), 4) How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has? 回答1: The primary effect would be by specifying too few partitions or far too many partitions. Too few partitions You will not utilize all of the cores available in the cluster. Too many partitions There

How to transpose an RDD in Spark

阅读更多关于 How to transpose an RDD in Spark

问题 I have an RDD like this: 1 2 3 4 5 6 7 8 9 It is a matrix. Now I want to transpose the RDD like this: 1 4 7 2 5 8 3 6 9 How can I do this? 回答1: Say you have an N×M matrix. If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy: val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9))) val transposed = sc.parallelize(rdd.collect.toSeq.transpose) If N or M is so large that you cannot hold N or M