rdd | 易学教程

How to reverse the result of reduceByKey using RDD API?

阅读更多关于 How to reverse the result of reduceByKey using RDD API?

问题 I have a RDD of (key, value) that I transformed into a RDD of (key, List(value1, value2, value3) as follow. val rddInit = sc.parallelize(List((1, 2), (1, 3), (2, 5), (2, 7), (3, 10))) val rddReduced = rddInit..groupByKey.mapValues(_.toList) rddReduced.take(3).foreach(println) This code give me the next RDD : (1,List(2, 3)) (2,List(5, 7)) (3,List(10)) But now I would like to go back to the rddInit from the rdd I just computed (the rddReduced rdd). My first guess is to realise some kind of

Spark Coalesce More Partitions

阅读更多关于 Spark Coalesce More Partitions

问题 I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file. In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen. From the docs it says coalesce should only be used if the number of output

Why pre-partition will benefit spark job because of reducing shuffling?

阅读更多关于 Why pre-partition will benefit spark job because of reducing shuffling?

问题 Many tutorials mention that pre-partition of RDD will optimize data shuffling of spark jobs. What I'm confused is that, for my understanding pre-partition will also lead to shuffling, why shuffling in advance here will benefit some operation? Especially spark it self will do the optimization for a set of transformations. For example: If I want to join two dataset country (id, country) and income (id, (income, month, year)), what's the difference between this two kind of operation? (I use

how to combine 3 pair RDDs

阅读更多关于 how to combine 3 pair RDDs

问题 I have a sort of complex requirement 1) 1) for Pinterest twitter handle , pinterest_post , pinterest_likes. handle "what" , 7 JavaPairRDD<String ,Pinterest> PintRDD 2) for Instagram Twitter handle , instargam_post , instagram_likes handle "hello" , 10 handle2 "hi" ,20 JavaPairRDD<String ,Pinterest> instRDD 3) for ontologies twitter handle , categories , sub_categories handle , Products , MakeUp handle , Products, MakeUp handle2 , Services , Face JavaPairRDD<String ,ontologies1> ontologiesPair

the sample method of Spark RDD does not work as expected

阅读更多关于 the sample method of Spark RDD does not work as expected

问题 I am trying with the "sample" method of RDD on Spark 1.6.1 scala>val nu = sc.parallelize(1 to 10) scala>val sp = nu.sample(true,0.2) scala>sp.collect.foreach(println(_)) 3 8 scala>val sp2 = nu.sample(true, 0.2) scala>sp2.collect.foreach(println(_)) 2 4 7 8 10 I cannot understand why sp2 contains 2,4,7,8,10. I think there should be only two numbers printed. Is there anything wrong? 回答1: Elaborating on the previous answer: in the documentation (scroll down to sample ) it is mentioned (emphasis

Calling function inside RDD map function in Spark cluster

阅读更多关于 Calling function inside RDD map function in Spark cluster

问题 I was testing a simple string parser function defined by me in my code, but one of the worker nodes always fails at execution time. Here is the dummy code that I've been testing: /* JUST A SIMPLE PARSER TO CLEAN PARENTHESIS */ def parseString(field: String): String = { val Pattern = "(.*.)".r field match{ case "null" => "null" case Pattern(field) => field.replace('(',' ').replace(')',' ').replace(" ", "") } } /* CREATE TWO DISTRIBUTED RDDs TO JOIN THEM */ val emp = sc.parallelize(Seq((1,

Saving users and items features to HDFS in Spark Collaborative filtering RDD

阅读更多关于 Saving users and items features to HDFS in Spark Collaborative filtering RDD

问题 I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far: import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating // Load and parse the data val data = sc.textFile("myhdfs/inputdirectory/als.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt,

How to create dynamic group in PySpark dataframe?

阅读更多关于 How to create dynamic group in PySpark dataframe?

问题 Though problem is of creating multiple group on the basis of two or more column's values of consecutive row, I am just simplifying the problem this way. Suppose have pyspark dataframe like this >>> df=sqlContext.createDataFrame([ ... Row(SN=1,age=45, gender='M', name='Bob'), ... Row(SN=2,age=28, gender='M', name='Albert'), ... Row(SN=3,age=33, gender='F', name='Laura'), ... Row(SN=4,age=43, gender='F', name='Gloria'), ... Row(SN=5,age=18, gender='T', name='Simone'), ... Row(SN=6,age=45,

How to transfer binary file into rdd in spark?

阅读更多关于 How to transfer binary file into rdd in spark?

问题 I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation. But I failed to transfer them into rdd. Does anyone who can offer help? 回答1: You could use binaryRecords() pySpark call to convert binary file's content into an RDD http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords binaryRecords(path, recordLength) Load data from a flat binary file, assuming each record is a set of numbers with the specified

Spark: Cannot add RDD elements into a mutable HashMap inside a closure

阅读更多关于 Spark: Cannot add RDD elements into a mutable HashMap inside a closure

问题 I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))] , and myHashMap is scala.collection.mutable.HashMap . I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map . However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")) . Everything in the rddMap is not put into myHashMap . Snippet code: val myHashMap = new HashMap[String, (String,