rdd

Spark RDD groupByKey + join vs join performance

瘦欲@ 提交于 2019-12-12 02:24:12
问题 I am using Spark on the cluster which I am sharing with others users. So it is not reliable to tell which one of my code runs more efficient just based on the running time. Because when I am running the more efficient code, someone else maybe running huge data works and makes my code executes for a longer time. So can I ask 2 questions here: I was using join function to join 2 RDDs and I am trying to use groupByKey() before using join , like this: rdd1.groupByKey().join(rdd2) seems that it

How to remove duplicates (more like filter based on multiple properties) with Spark RDD in Scala?

北慕城南 提交于 2019-12-12 02:16:34
问题 As a policy, we do not update our documents, but we recreate with updated values. When I will process the events, I would like to keep only the updated ones, so I would like to filter items out of my RDD based on multiple values. For instance, say an item would be: { "name": "Sample", "someId": "123", "createdAt": "2016-09-21T02:16:32+00:00" } and when it is updated: { "name": "Sample-Updated", "someId": "123", # This remains the same "createdAt": "2016-09-21T03:16:32+00:00" # This is greater

Apache Spark RDD substitution

与世无争的帅哥 提交于 2019-12-12 01:52:44
问题 I'm trying to solve a problem such that I've got a dataset like this: (1, 3) (1, 4) (1, 7) (1, 2) <- (2, 7) <- (6, 6) (3, 7) <- (7, 4) <- ... Since (1 -> 2) and (2 -> 7) , I would like to replace the set (2, 7) as (1, 7) similarly, (3 -> 7) and (7 -> 4) also replace (7,4) as (3, 4) Hence, my dataset becomes (1, 3) (1, 4) (1, 7) (1, 2) (1, 7) (6, 6) (3, 7) (3, 4) ... Any idea how to solve or tackle this ? Thanks 回答1: This problem looks like a transitive closure of a graph, represented in the

Exception org.apache.spark.rdd.RDD[(scala.collection.immutable.Map[String,Any], Int)] in scala/spark

别等时光非礼了梦想. 提交于 2019-12-12 01:49:55
问题 Using Below code I am getting tweets for a particular filter : val topCounts60 = tweetMap.map((_, 1)). reduceByKeyAndWindow(_+_, Seconds(60*60)) one of the sample Output of topCounts60 is in below format if i do topCounts60.println(): (Map(UserLang -> en, UserName -> Harmeet Singh, UserScreenName -> harmeetsingh060, HashTags -> , UserVerification -> false, Spam -> true, UserFollowersCount -> 44, UserLocation -> भारत, UserStatusCount -> 50, UserCreated -> 2016-07-04T06:32:49.000+0530,

Apache Spark what am I persisting here?

廉价感情. 提交于 2019-12-12 01:47:44
问题 In this line, which RDD is being persisted? dropResultsN or dataSetN? dropResultsN = dataSetN.map(s -> standin.call(s)).persist(StorageLevel.MEMORY_ONLY()); Question arises as a side issue from Apache Spark timing forEach operation on JavaRDD, where I am still looking for a good answer to the core question of how best to time RDD creation. 回答1: dropResultsN is the persisted RDD (which is the RDD produced by mapping dataSetN onto the method standin.call() ). 回答2: I found a good example of this

Filter columns from a PipelinedRDD

岁酱吖の 提交于 2019-12-12 00:46:30
问题 I have a pipelinedRDD that contains lot of key:value pairs, I need to filter out only few of them and ignore the rest. How can I achieve this? Sample RDD Data { "PLAN_ID": "7de7cc2d-95be-4b7f-bb2a-77482dc03853" ,"Week": "2017 Wk 11" ,"Demand": 0.0 ,"Sales": 0.0 ,"LostSales": 0.0 ,"InventoryBOP": 0.0 ,"InventoryEOP": 2666.0 ,"Receipt": 2666.0 ,"RecommendedReceipt": 2666.0 ,"WeeksOnHand": 0.0 ,"WeeksOfSales": 0.0} I want to filter only PLAN_ID, Receipt, RecommendedReceipt,InventoryEOP,

python : reduce by key with if condition statement?

走远了吗. 提交于 2019-12-11 19:13:41
问题 (K1, (v1, v2)) (K2, (v3, v4)) (K1, (v1, v5)) (K2, (v3, v6)) How can I sum up the values of the key provided the first value is the some or eque such that I get (k1, (v1,v2+v5), (k2,(v3,v4+v6) ? 回答1: IIUC, you need to change the key before the reduce , and then map your values back in the desired format. You should be able to do the following: new_rdd = rdd.map(lambda row: ((row[0], row[1][0]), row[1][1]))\ .reduceByKey(sum). .map(lambda row: (row[0][0], (row[0][1], row[1]))) 来源: https:/

Filtering RDD Based on condition and extracting matched data in Spark python

白昼怎懂夜的黑 提交于 2019-12-11 17:58:27
问题 I have the data like, cl_id cn_id cn_value 10004, 77173296 ,390.0 10004, 77173299 ,376.0 10004, 77173300 ,0.0 20005, 77173296 ,0.0 20005, 77173299 ,6.0 2005, 77438800 ,2.0 Cl_id IDs: 10004 ,20005 Filter by 10004 10004, 77173296 ,390.0 10004, 77173299 ,376.0 Filter by 20005 20005, 77173296 ,0.0 20005, 77173299 ,6.0 Now I want the return RDD like, 10004,cn_id,x1(77173296.value,77173300.value) ==> 10004,77173296,390.0,376.0 20005,cn_id,x1(77173296.value,77173300.value) ==> 20005,77173296,0.0,6.0

Piping Scala RDD to Python code fails

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 16:31:43
问题 I am trying to execute Python code inside Scala program passing RDD as data to the Python script. The Spark cluster is initialized successfully, the data conversion to RDD is fine and running the Python script separately(outside Scala code) works. However, the execution of the same Python script inside Scala fails with: java.lang.IllegalStateException: Subprocess exited with status 2. Command ran: /{filePath}/{File}.py Looking deeper, it shows import: command not found when trying to execute

Spark: Mapping elements of an RDD using other elements from the same RDD

对着背影说爱祢 提交于 2019-12-11 15:24:16
问题 Suppose I have an this rdd: val r = sc.parallelize(Array(1,4,2,3)) What I want to do is create a mapping. e.g: r.map(val => val + func(all other elements in r)). Is this even possible? 回答1: It's very likely that you will get an exception, e.g. bellow. rdd = sc.parallelize(range(100)) rdd = rdd.map(lambda x: x + sum(rdd.collect())) i.e. you are trying to broadcast the RDD therefore. Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or