rdd

Pyspark Dataframe get unique elements from column with string as list of elements

我的未来我决定 提交于 2021-02-19 07:34:05
问题 I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column: Here is an example - df - | col1 | col2 | col3 | | "a" | "b" |"[q,r]"| | "c" | "f" |"[s,r]"| Here is my expected response: resp = [q, r, s] Any idea how to get there? My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow. But so far I am not able to do so. I

PySpark takeOrdered Multiple Fields (Ascending and Descending)

余生长醉 提交于 2021-02-19 05:20:26
问题 The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key: >>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x) [10, 9, 7, 6, 5, 4] Is it also possible to define more keys e.g. x,y,z for data that has 3 columns? The keys should be in different order such as x= asc, y= desc, z=asc. That

PySpark takeOrdered Multiple Fields (Ascending and Descending)

落花浮王杯 提交于 2021-02-19 05:19:30
问题 The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key: >>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x) [10, 9, 7, 6, 5, 4] Is it also possible to define more keys e.g. x,y,z for data that has 3 columns? The keys should be in different order such as x= asc, y= desc, z=asc. That

Pyspark RDD collect first 163 Rows

不打扰是莪最后的温柔 提交于 2021-02-18 13:51:54
问题 Is there a way to get the first 163 rows of an rdd without converting to a df? I've tried something like newrdd = rdd.take(163) , but that returns a list, and rdd.collect() returns the whole rdd. Is there a way to do this? Or if not is there a way to convert a list into an rdd? 回答1: It is not very efficient but you can zipWithIndex and filter : rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys() In practice it make more sense to simply take and parallelize : sc.parallelize(rdd.take(163))

Pyspark RDD collect first 163 Rows

妖精的绣舞 提交于 2021-02-18 13:51:11
问题 Is there a way to get the first 163 rows of an rdd without converting to a df? I've tried something like newrdd = rdd.take(163) , but that returns a list, and rdd.collect() returns the whole rdd. Is there a way to do this? Or if not is there a way to convert a list into an rdd? 回答1: It is not very efficient but you can zipWithIndex and filter : rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys() In practice it make more sense to simply take and parallelize : sc.parallelize(rdd.take(163))

repartition() is not affecting RDD partition size

陌路散爱 提交于 2021-02-18 12:17:07
问题 I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:- scala> rdd.partitions.size res56: Int = 50 scala> rdd.repartition(10) res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27 At this stage I perform some action like rdd.take(1) just to force

In Apache Spark, can I incrementally cache an RDD partition?

懵懂的女人 提交于 2021-02-11 13:56:45
问题 I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand. Unfortunately, the following experiment seems to indicate otherwise: val acc = new LongAccumulator() TestSC.register(acc) val rdd = TestSC.parallelize(1 to 100, 16).map { v => acc add 1 v } rdd.persist() val sliced = rdd .mapPartitions { itr => itr.slice(0, 2) }

How to sort a column with Date and time values in Spark?

假如想象 提交于 2021-02-08 15:12:08
问题 Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe. Input: 04-NOV-16 03.36.13.000000000 PM 06-NOV-15 03.42.21.000000000 PM 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM Expected Output: 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM 06-NOV-15 03.42.21.000000000 PM 04-NOV-16 03.36.13.000000000 PM 回答1: As this format is not standard, you need to use the unix_timestamp function to parse the string and

Is there an effective partitioning method when using reduceByKey in Spark?

雨燕双飞 提交于 2021-02-07 14:21:45
问题 When I use reduceByKey or aggregateByKey , I'm confronted with partition problems. ex) reduceBykey(_+_).map(code) Especially, if input data is skewed, the partitioning problem becomes even worse when using the above methods. So, as a solution to this, I use repartition method. For example, http://dev.sortable.com/spark-repartition/ is similar. This is good for partition distribution, but the repartition is also expensive. Is there a way to solve the partition problem wisely? 回答1: You are

pyspark: grouby and then get max value of each group

末鹿安然 提交于 2021-02-07 13:12:55
问题 I would like to group by a value and then find the max value in each group using PySpark. I have the following code but now I am bit stuck on how to extract the max value. # some file contains tuples ('user', 'item', 'occurrences') data_file = sc.textData('file:///some_file.txt') # Create the triplet so I index stuff data_file = data_file.map(lambda l: l.split()).map(lambda l: (l[0], l[1], float(l[2]))) # Group by the user i.e. r[0] grouped = data_file.groupBy(lambda r: r[0]) # Here is where