rdd

How to reverse the result of reduceByKey using RDD API?

我的未来我决定 提交于 2019-12-11 05:38:25
问题 I have a RDD of (key, value) that I transformed into a RDD of (key, List(value1, value2, value3) as follow. val rddInit = sc.parallelize(List((1, 2), (1, 3), (2, 5), (2, 7), (3, 10))) val rddReduced = rddInit..groupByKey.mapValues(_.toList) rddReduced.take(3).foreach(println) This code give me the next RDD : (1,List(2, 3)) (2,List(5, 7)) (3,List(10)) But now I would like to go back to the rddInit from the rdd I just computed (the rddReduced rdd). My first guess is to realise some kind of

Spark Coalesce More Partitions

大憨熊 提交于 2019-12-11 05:18:01
问题 I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file. In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen. From the docs it says coalesce should only be used if the number of output

Why pre-partition will benefit spark job because of reducing shuffling?

混江龙づ霸主 提交于 2019-12-11 05:01:49
问题 Many tutorials mention that pre-partition of RDD will optimize data shuffling of spark jobs. What I'm confused is that, for my understanding pre-partition will also lead to shuffling, why shuffling in advance here will benefit some operation? Especially spark it self will do the optimization for a set of transformations. For example: If I want to join two dataset country (id, country) and income (id, (income, month, year)), what's the difference between this two kind of operation? (I use

how to combine 3 pair RDDs

笑着哭i 提交于 2019-12-11 04:45:49
问题 I have a sort of complex requirement 1) 1) for Pinterest twitter handle , pinterest_post , pinterest_likes. handle "what" , 7 JavaPairRDD<String ,Pinterest> PintRDD 2) for Instagram Twitter handle , instargam_post , instagram_likes handle "hello" , 10 handle2 "hi" ,20 JavaPairRDD<String ,Pinterest> instRDD 3) for ontologies twitter handle , categories , sub_categories handle , Products , MakeUp handle , Products, MakeUp handle2 , Services , Face JavaPairRDD<String ,ontologies1> ontologiesPair

the sample method of Spark RDD does not work as expected

不打扰是莪最后的温柔 提交于 2019-12-11 04:45:16
问题 I am trying with the "sample" method of RDD on Spark 1.6.1 scala>val nu = sc.parallelize(1 to 10) scala>val sp = nu.sample(true,0.2) scala>sp.collect.foreach(println(_)) 3 8 scala>val sp2 = nu.sample(true, 0.2) scala>sp2.collect.foreach(println(_)) 2 4 7 8 10 I cannot understand why sp2 contains 2,4,7,8,10. I think there should be only two numbers printed. Is there anything wrong? 回答1: Elaborating on the previous answer: in the documentation (scroll down to sample ) it is mentioned (emphasis

Calling function inside RDD map function in Spark cluster

纵饮孤独 提交于 2019-12-11 04:34:44
问题 I was testing a simple string parser function defined by me in my code, but one of the worker nodes always fails at execution time. Here is the dummy code that I've been testing: /* JUST A SIMPLE PARSER TO CLEAN PARENTHESIS */ def parseString(field: String): String = { val Pattern = "(.*.)".r field match{ case "null" => "null" case Pattern(field) => field.replace('(',' ').replace(')',' ').replace(" ", "") } } /* CREATE TWO DISTRIBUTED RDDs TO JOIN THEM */ val emp = sc.parallelize(Seq((1,

Saving users and items features to HDFS in Spark Collaborative filtering RDD

心不动则不痛 提交于 2019-12-11 04:16:11
问题 I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far: import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating // Load and parse the data val data = sc.textFile("myhdfs/inputdirectory/als.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt,

How to create dynamic group in PySpark dataframe?

我的梦境 提交于 2019-12-11 04:07:07
问题 Though problem is of creating multiple group on the basis of two or more column's values of consecutive row, I am just simplifying the problem this way. Suppose have pyspark dataframe like this >>> df=sqlContext.createDataFrame([ ... Row(SN=1,age=45, gender='M', name='Bob'), ... Row(SN=2,age=28, gender='M', name='Albert'), ... Row(SN=3,age=33, gender='F', name='Laura'), ... Row(SN=4,age=43, gender='F', name='Gloria'), ... Row(SN=5,age=18, gender='T', name='Simone'), ... Row(SN=6,age=45,

How to transfer binary file into rdd in spark?

馋奶兔 提交于 2019-12-11 04:06:28
问题 I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation. But I failed to transfer them into rdd. Does anyone who can offer help? 回答1: You could use binaryRecords() pySpark call to convert binary file's content into an RDD http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords binaryRecords(path, recordLength) Load data from a flat binary file, assuming each record is a set of numbers with the specified

Spark: Cannot add RDD elements into a mutable HashMap inside a closure

[亡魂溺海] 提交于 2019-12-11 02:38:48
问题 I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))] , and myHashMap is scala.collection.mutable.HashMap . I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map . However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")) . Everything in the rddMap is not put into myHashMap . Snippet code: val myHashMap = new HashMap[String, (String,