rdd

Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

江枫思渺然 提交于 2019-12-03 11:00:47
问题 I have the following spark job, trying to keep everything in memory: val myOutRDD = myInRDD.flatMap { fp => val tuple2List: ListBuffer[(String, myClass)] = ListBuffer() : tuple2List }.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) => myMergeFunction(p1,p2) }.persist(StorageLevel.MEMORY_ONLY) However, when I looked in to the job tracker, I still have a lot of Shuffle Write and Shuffle spill to disk ... Total task time across all tasks: 49.1 h Input Size / Records: 21.6 GB / 102123058

spark: access rdd inside another rdd

匿名 (未验证) 提交于 2019-12-03 10:24:21
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a lookup rdd of size 6000, lookup_rdd: RDD[String] a1 a2 a3 a4 a5 ..... and another rdd, data_rdd: RDD[(String, Iterable[(String, Int)])]: (id,(item,count)) which has unique ids, (id1,List((a1,2), (a3,4))) (id2,List((a2,1), (a4,2), (a1,1))) (id3,List((a5,1))) FOREACH element in lookup_rdd I want to check whether each id has that element or not, if it is there I put the count and if it's not I put 0, and store in a file. What is the efficient way to achieve this. Is hashing possible? eg. output I want is: id1,2,0,4,0,0 id2,1,1,0,2,0

Spark - Nested RDD Operation

匿名 (未验证) 提交于 2019-12-03 08:54:24
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have two RDDs say rdd1 = id | created | destroyed | price 1 | 1 | 2 | 10 2 | 1 | 5 | 11 3 | 2 | 3 | 11 4 | 3 | 4 | 12 5 | 3 | 5 | 11 rdd2 = [1,2,3,4,5] # lets call these value as timestamps (ts) rdd2 is basically generated using range(intial_value, end_value, interval). The params here can vary. The size can be same or different to rdd1. The idea is to fetch records from rdd1 into rdd2 based on the values of rdd2 using a filtering criertia(records from rdd1 can repeat while fetching as you can see in output) filtering criteria rdd1.created

Add PySpark RDD as new column to pyspark.sql.dataframe

匿名 (未验证) 提交于 2019-12-03 08:46:08
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a pyspark.sql.dataframe where each row is a news article. I then have a RDD that represents the words contained in each article. I want to add the RDD of words as a column named 'words' to my dataframe of new articles. I tried df.withColumn('words', words_rdd ) but I get the error AssertionError: col should be Column The DataFrame looks something like this Articles the cat and dog ran we went to the park today it will rain but I have 3k news articles. I applied a function to clean the text such as remove stop words and I have a RDD

Why is my Spark streaming app so slow?

匿名 (未验证) 提交于 2019-12-03 08:30:34
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a cluster with 4 nodes: 3 Spark nodes and 1 Solr node. My CPU is 8 core, my memory is 32 GB, disc space is SSD. I use cassandra as my database. My data amount is 22GB after 6 hours and I now have around 3,4 Million rows, which should be read in under 5 minutes. But already it can't complete the task in this amount of time. My future plan is to read 100 Million rows in under 5 minutes . I am not sure what I can increase or do better to achieve this result now as well as to achieve my future goal. Is that even possible or would it be

How to partition a RDD

匿名 (未验证) 提交于 2019-12-03 07:50:05
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a text file consisting of a large number of random floating values separated by spaces. I am loading this file into a RDD in scala. How does this RDD get partitioned? Also, is there any method to generate custom partitions such that all partitions have equal number of elements along with an index for each partition? val dRDD = sc.textFile("hdfs://master:54310/Data/input*") keyval=dRDD.map(x =>process(x.trim().split(' ').map(_.toDouble),query_norm,m,r)) Here I am loading multiple text files from HDFS and process is a function I am

How to reverse ordering for RDD.takeOrdered()?

只愿长相守 提交于 2019-12-03 07:24:51
问题 What is the syntax to reverse the ordering for the takeOrdered() method of an RDD in Spark? For bonus points, what is the syntax for custom-ordering for an RDD in Spark? 回答1: Reverse Order val seq = Seq(3,9,2,3,5,4) val rdd = sc.parallelize(seq,2) rdd.takeOrdered(2)(Ordering[Int].reverse) Result will be Array(9,5) Custom Order We will sort people by age. case class Person(name:String, age:Int) val people = Array(Person("bob", 30), Person("ann", 32), Person("carl", 19)) val rdd = sc

How can I efficiently join a large rdd to a very large rdd in spark?

僤鯓⒐⒋嵵緔 提交于 2019-12-03 06:13:07
I have two RDDs. One RDD is between 5-10 million entries and the other RDD is between 500 million - 750 million entries. At some point, I have to join these two rdds using a common key. val rddA = someData.rdd.map { x => (x.key, x); } // 10-million val rddB = someData.rdd.map { y => (y.key, y); } // 600-million var joinRDD = rddA.join(rddB); When spark decides to do this join, it decides to do a ShuffledHashJoin. This causes many of the items in rddB to be shuffled on the network. Likewise, some of rddA are also shuffled on the network. In this case, rddA is too "big" to use as a broadcast

Spark: difference of semantics between reduce and reduceByKey

╄→尐↘猪︶ㄣ 提交于 2019-12-03 06:02:35
In Spark's documentation, it says that RDDs method reduce requires a associative AND commutative binary function. However, the method reduceByKey ONLY requires an associative binary function. sc.textFile("file4kB", 4) I did some tests, and apparently it's the behavior I get. Why this difference? Why does reduceByKey ensure the binary function is always applied in certain order (to accommodate for the lack of commutativity) when reduce does not? Example, if a load some (small) text with 4 partitions (minimum): val r = sc.textFile("file4k", 4) then: r.reduce(_ + _) returns a string where parts

How to create a DataFrame from a text file in Spark

可紊 提交于 2019-12-03 05:08:43
问题 I have a text file on HDFS and I want to convert it to a Data Frame in Spark. I am using the Spark Context to load the file and then try to generate individual columns from that file. val myFile = sc.textFile("file.txt") val myFile1 = myFile.map(x=>x.split(";")) After doing this, I am trying the following operation. myFile1.toDF() I am getting an issues since the elements in myFile1 RDD are now array type. How can I solve this issue? 回答1: Update - as of Spark 1.6 , you can simply use the