rdd

Joining two RDD[String] -Spark Scala

匿名 (未验证) 提交于 2019-12-03 02:35:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have two RDDS : rdd1 [String,String,String]: Name, Address, Zipcode rdd2 [String,String,String]: Name, Address, Landmark I am trying to join these 2 RDDs using the function : rdd1.join(rdd2) But I am getting an error : error: value fullOuterJoin is not a member of org.apache.spark.rdd.RDD[String] The join should join the RDD[String] and the output RDD should be something like : rddOutput : Name,Address,Zipcode,Landmark And I wanted to save these files as a JSON file in the end. Can someone help me with the same ? 回答1: As said in the

How can I save an RDD into HDFS and later read it back?

℡╲_俬逩灬. 提交于 2019-12-03 02:31:54
I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how? It is possible. In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2) , so you can later parse it. Reading can be done with textFile function from SparkContext and then .map to eliminate () So: Version 1: rdd.saveAsTextFile ("hdfs:///test1/"); // later, in other program val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x

RDD transformations and actions can only be invoked by the driver

匿名 (未验证) 提交于 2019-12-03 02:20:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: Error: org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. def computeRatio ( model : MatrixFactorizationModel , test_data : org . apache . spark . rdd . RDD [ Rating ]): Double = { val numDistinctUsers = test_data . map ( x => x . user ).

Modify collection inside a Spark RDD foreach

匿名 (未验证) 提交于 2019-12-03 02:05:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm trying to add elements to a map while iterating the elements of an RDD. I'm not getting any errors, but the modifications are not happening. It all works fine adding directly or iterating other collections: scala> val myMap = new collection.mutable.HashMap[String,String] myMap: scala.collection.mutable.HashMap[String,String] = Map() scala> myMap("test1")="test1" scala> myMap res44: scala.collection.mutable.HashMap[String,String] = Map(test1 -> test1) scala> List("test2", "test3").foreach(w => myMap(w) = w) scala> myMap res46: scala

Spark list all cached RDD names

匿名 (未验证) 提交于 2019-12-03 02:03:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: I am new to Apache Spark, I created several RDD's and DataFrames, cached them, now I want to unpersist some of them by using the command below rddName . unpersist () but I can't remember their names. I used sc.getPersistentRDDs but the output does not include the names. I also used the browser to view the cached rdds but again no name information. Am I missing something? 回答1: @Dikei's answer is actually correct but I believe what you are looking for is sc.getPersistentRDDs : scala > val rdd1 = sc . makeRDD ( 1 to 100 ) # rdd1: org

Spark RDD to DataFrame python

匿名 (未验证) 提交于 2019-12-03 01:52:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the scheme is passed to sqlContext.CreateDataFrame(rdd,schema) function. But I have 38 columns or fields and this will increase further. If I manually give the schema specifying each field information, that it going to be so tedious job. Is there any other way to specify the schema without knowing the information of the columns prior. 回答1: See, There are two ways to convert an RDD to DF in Spark. toDF() and createDataFrame(rdd, schema) I will

Pyspark can't convert float to Float :-/

匿名 (未验证) 提交于 2019-12-03 01:41:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a pyspark rdd: proba_classe_0.take(2) [0.38030685472943737, 0.34728188900913715] I want to transform on DF : from pyspark.sql.types import FloatType fields = [ StructField('probabilite' , FloatType() ) ] schema = StructType(fields) df_proba_classe_1 = spark.createDataFrame(proba_classe_1, schema=schema) df_proba_classe_1.count() I got a strange error : TypeError: StructType can not accept object 0.6196931452705625 in type <class 'float'> 回答1: you gotta map the rdd because rdds are type string rdd = sc\ .parallelize(['0

DataFrame to RDD[(String, String)] conversion

匿名 (未验证) 提交于 2019-12-03 01:39:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I want to convert an org.apache.spark.sql.DataFrame to org.apache.spark.rdd.RDD[(String, String)] in Databricks. Can anyone help? Background (and a better solution is also welcome): I have a Kafka stream which (after some steps) becomes a 2 column data frame. I would like to put this into a Redis cache, first column as a key and second column as a value. More specifically the type of the input is this: lastContacts: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: bigint] . I try to put into Redis as follows: sc

Cannot deserialize RDD with different number of items in pair

匿名 (未验证) 提交于 2019-12-03 01:38:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have two RDD's which have key-value pairs. I want to join them by key (and according to the key, get cartesian product of all values), which I assume can be done with zip() function of pyspark. However, when I apply this, elemPairs = elems1.zip(elems2).reduceByKey(add) It gives me the error: Cannot deserialize RDD with different number of items in pair: (40, 10) And here are the 2 RDD's which I try to zip: elems1 => [((0, 0), ('A', 0, 90)), ((0, 1), ('A', 0, 90)), ((0, 2), ('A', 0, 90)), ((0, 3), ('A', 0, 90)), ((0, 4), ('A', 0, 90)), ((0,

Jaccard Similarity of an RDD with the help of Spark and Scala without Cartesian?

匿名 (未验证) 提交于 2019-12-03 01:36:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am working on pair RDDs. My aim is to calculate jaccard similarity between the set of rdd values and cluster them according to the jaccard similarity threshold value.Structure of my RDD is : val a= [Key,Set(String)] //Pair RDD For example:- India,[Country,Place,....] USA,[Country,State,..] Berlin,[City,Popluatedplace,..] After finding jaccard similarity, I will cluster the similar entities into one cluster. In the above example, India and USA will be cluster into one cluster based on some threshold value whereas Berlin will be in the other