rdd

how to merge two RDD to one RDD [duplicate]

独自空忆成欢 提交于 2019-12-11 02:31:21
问题 This question already has answers here : Concatenating datasets of different RDDs in Apache spark using scala (2 answers) Closed 2 years ago . Help ,I have two RDDs, i want to merge to one RDD.This is my code. val us1 = sc.parallelize(Array(("3L"), ("7L"),("5L"),("2L"))) val us2 = sc.parallelize(Array(("432L"), ("7123L"),("513L"),("1312L"))) 回答1: Just use union: val merged = us1.union(us2) Documentation is here Shotcut in Scala is: val merged = us1 ++ us2 回答2: You need the RDD.union These don

Can't zip RDDs with unequal numbers of partitions

烈酒焚心 提交于 2019-12-11 02:08:41
问题 Now I have 3 RDDs like this: rdd1: 1 2 3 4 5 6 7 8 9 10 rdd2: 11 12 13 14 rdd3: 15 16 17 18 19 20 and I want to do this: rdd1.zip(rdd2.union(rdd3)) and I want the result is like this: 1 2 11 12 3 4 13 14 5 6 15 16 7 8 17 18 9 10 19 20 but I have an exception like this: Exception in thread "main" java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions someone tell me I can do this without exception: rdd1.zip(rdd2.union(rdd3).repartition(1)) But it seems like it is

How to know which worker a partition is executed at?

前提是你 提交于 2019-12-11 01:59:36
问题 I just try to find a way to get the locality of a RDD's partition in Spark. After calling RDD.repartition() or PairRDD.combineByKey() the returned RDD is partitioned. I'd like to know which worker instances the partitions are at (for examining the partition behaviour)?! Can someone give a clue? 回答1: An interesting question that I'm sure has not so much interesting answer :) First of all, applying transformations to your RDD has nothing to do with worker instances as they are separate

Load spark data into Mongo / Memcached for use by a Webservice

喜欢而已 提交于 2019-12-11 01:52:53
问题 I am extremely new to spark and have a specific workflow associated question. Although it is not really a coding related question, it is more a spark functionality related question and I thought it would be appropriate here. Please feel free to redirect me to the correct site if you think this question is inappropriate for SO. So here goes: 1. I am planning to consume a stream of requests using Spark's Sliding Window functionality and calculate a recommendation model. Once the model is

Combine two RDDs in pyspark

别来无恙 提交于 2019-12-11 01:28:58
问题 Assuming that I have the following RDDs: a = sc.parallelize([1, 2, 5, 3]) b = sc.parallelize(['a','c','d','e']) How do I combine these 2 RDD to one RDD which would be like this: [('a', 1), ('c', 2), ('d', 5), ('e', 3)] Using a.union(b) just combines them in a list. Any idea? 回答1: You probably just want to b.zip(a) both RDDs (note the reversed order since you want to key by b's values). Just read the python docs carefully: zip(other) Zips this RDD with another one, returning key-value pairs

Spark: Sort an RDD by multiple values in a tuple / columns

不羁的心 提交于 2019-12-11 01:04:20
问题 So I have an RDD as follows RDD[(String, Int, String)] And as an example ('b', 1, 'a') ('a', 1, 'b') ('a', 0, 'b') ('a', 0, 'a') The final result should look something like ('a', 0, 'a') ('a', 0, 'b') ('a', 1, 'b') ('b', 1, 'a') How would I do something like this? 回答1: Try this: rdd.sortBy(r => r) If you wanted to switch the sort order around, you could do this: rdd.sortBy(r => (r._3, r._1, r._2)) For reverse order: rdd.sortBy(r => r, false) 来源: https://stackoverflow.com/questions/36393224

PySpark: Partitioning while reading a binary file using binaryFiles() function

醉酒当歌 提交于 2019-12-11 00:57:06
问题 sc = SparkContext("Local") rdd = sc.binaryFiles(Path to the binary file , minPartitions = 5).partitionBy(8) or sc = SparkContext("Local") rdd = sc.binaryFiles(Path to the binary file , minPartitions = 5).repartition(8) Using either of the above codes, I am trying to make 8 partitions in my RDD {wherein, I want the data to be distributed evenly on all the partitions} . When I am printing {rdd.getNumPartitions()} , the number of partitions shown are 8 only, but on Spark UI , I have observed

How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

谁说我不能喝 提交于 2019-12-11 00:54:12
问题 I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works. I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes). Could anyone explain me the right way to do it? As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a

Transform Java-Pair-Rdd to Rdd

混江龙づ霸主 提交于 2019-12-11 00:32:03
问题 I need to transform my Java-pair-rdd to a csv : so i m thinking to transform it to rdd, to solve my problem. what i want is to have my rdd transformed from : Key Value Jack [a,b,c] to : Key value Jack a Jack b Jack c i see that it is possible in that issue and in this issue(PySpark: Convert a pair RDD back to a regular RDD) so i am asking how to do that in java? Update of question The Type of my JavaPairRdd is of Type : JavaPairRDD<Tuple2<String,String>, Iterable<Tuple1<String>>> and this is

Programmatically generate the schema AND the data for a dataframe in Apache Spark

穿精又带淫゛_ 提交于 2019-12-11 00:19:38
问题 I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from the value of the string below: val headerDescs : String = "Name,Age,Location" val headerSchema = StructType(headerDescs.split(",").map(fieldName => StructField(fieldName, StringType, true))) However now I want to do the same for the data (which is in effect the same data i.e. the metadata). I create an RDD : val headerRDD = sc.parallelize(headerDescs.split(",")) I then