rdd | 易学教程

Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

阅读更多关于 Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

问题 I have the following spark job, trying to keep everything in memory: val myOutRDD = myInRDD.flatMap { fp => val tuple2List: ListBuffer[(String, myClass)] = ListBuffer() : tuple2List }.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) => myMergeFunction(p1,p2) }.persist(StorageLevel.MEMORY_ONLY) However, when I looked in to the job tracker, I still have a lot of Shuffle Write and Shuffle spill to disk ... Total task time across all tasks: 49.1 h Input Size / Records: 21.6 GB / 102123058

spark: access rdd inside another rdd

阅读更多关于 spark: access rdd inside another rdd

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I have a lookup rdd of size 6000, lookup_rdd: RDD[String] a1 a2 a3 a4 a5 ..... and another rdd, data_rdd: RDD[(String, Iterable[(String, Int)])]: (id,(item,count)) which has unique ids, (id1,List((a1,2), (a3,4))) (id2,List((a2,1), (a4,2), (a1,1))) (id3,List((a5,1))) FOREACH element in lookup_rdd I want to check whether each id has that element or not, if it is there I put the count and if it's not I put 0, and store in a file. What is the efficient way to achieve this. Is hashing possible? eg. output I want is: id1,2,0,4,0,0 id2,1,1,0,2,0

Spark - Nested RDD Operation

阅读更多关于 Spark - Nested RDD Operation

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I have two RDDs say rdd1 = id | created | destroyed | price 1 | 1 | 2 | 10 2 | 1 | 5 | 11 3 | 2 | 3 | 11 4 | 3 | 4 | 12 5 | 3 | 5 | 11 rdd2 = [1,2,3,4,5] # lets call these value as timestamps (ts) rdd2 is basically generated using range(intial_value, end_value, interval). The params here can vary. The size can be same or different to rdd1. The idea is to fetch records from rdd1 into rdd2 based on the values of rdd2 using a filtering criertia(records from rdd1 can repeat while fetching as you can see in output) filtering criteria rdd1.created

Add PySpark RDD as new column to pyspark.sql.dataframe

阅读更多关于 Add PySpark RDD as new column to pyspark.sql.dataframe

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I have a pyspark.sql.dataframe where each row is a news article. I then have a RDD that represents the words contained in each article. I want to add the RDD of words as a column named 'words' to my dataframe of new articles. I tried df.withColumn('words', words_rdd ) but I get the error AssertionError: col should be Column The DataFrame looks something like this Articles the cat and dog ran we went to the park today it will rain but I have 3k news articles. I applied a function to clean the text such as remove stop words and I have a RDD

Why is my Spark streaming app so slow?

阅读更多关于 Why is my Spark streaming app so slow?

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I have a cluster with 4 nodes: 3 Spark nodes and 1 Solr node. My CPU is 8 core, my memory is 32 GB, disc space is SSD. I use cassandra as my database. My data amount is 22GB after 6 hours and I now have around 3,4 Million rows, which should be read in under 5 minutes. But already it can't complete the task in this amount of time. My future plan is to read 100 Million rows in under 5 minutes . I am not sure what I can increase or do better to achieve this result now as well as to achieve my future goal. Is that even possible or would it be

How to partition a RDD

阅读更多关于 How to partition a RDD

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I have a text file consisting of a large number of random floating values separated by spaces. I am loading this file into a RDD in scala. How does this RDD get partitioned? Also, is there any method to generate custom partitions such that all partitions have equal number of elements along with an index for each partition? val dRDD = sc.textFile("hdfs://master:54310/Data/input*") keyval=dRDD.map(x =>process(x.trim().split(' ').map(_.toDouble),query_norm,m,r)) Here I am loading multiple text files from HDFS and process is a function I am

How to reverse ordering for RDD.takeOrdered()?

阅读更多关于 How to reverse ordering for RDD.takeOrdered()?

问题 What is the syntax to reverse the ordering for the takeOrdered() method of an RDD in Spark? For bonus points, what is the syntax for custom-ordering for an RDD in Spark? 回答1: Reverse Order val seq = Seq(3,9,2,3,5,4) val rdd = sc.parallelize(seq,2) rdd.takeOrdered(2)(Ordering[Int].reverse) Result will be Array(9,5) Custom Order We will sort people by age. case class Person(name:String, age:Int) val people = Array(Person("bob", 30), Person("ann", 32), Person("carl", 19)) val rdd = sc

How can I efficiently join a large rdd to a very large rdd in spark?

阅读更多关于 How can I efficiently join a large rdd to a very large rdd in spark?

I have two RDDs. One RDD is between 5-10 million entries and the other RDD is between 500 million - 750 million entries. At some point, I have to join these two rdds using a common key. val rddA = someData.rdd.map { x => (x.key, x); } // 10-million val rddB = someData.rdd.map { y => (y.key, y); } // 600-million var joinRDD = rddA.join(rddB); When spark decides to do this join, it decides to do a ShuffledHashJoin. This causes many of the items in rddB to be shuffled on the network. Likewise, some of rddA are also shuffled on the network. In this case, rddA is too "big" to use as a broadcast

Spark: difference of semantics between reduce and reduceByKey

阅读更多关于 Spark: difference of semantics between reduce and reduceByKey

In Spark's documentation, it says that RDDs method reduce requires a associative AND commutative binary function. However, the method reduceByKey ONLY requires an associative binary function. sc.textFile("file4kB", 4) I did some tests, and apparently it's the behavior I get. Why this difference? Why does reduceByKey ensure the binary function is always applied in certain order (to accommodate for the lack of commutativity) when reduce does not? Example, if a load some (small) text with 4 partitions (minimum): val r = sc.textFile("file4k", 4) then: r.reduce(_ + _) returns a string where parts

How to create a DataFrame from a text file in Spark

阅读更多关于 How to create a DataFrame from a text file in Spark

问题 I have a text file on HDFS and I want to convert it to a Data Frame in Spark. I am using the Spark Context to load the file and then try to generate individual columns from that file. val myFile = sc.textFile("file.txt") val myFile1 = myFile.map(x=>x.split(";")) After doing this, I am trying the following operation. myFile1.toDF() I am getting an issues since the elements in myFile1 RDD are now array type. How can I solve this issue? 回答1: Update - as of Spark 1.6 , you can simply use the