rdd

Is there an API function to display “Fraction Cached” for an RDD?

落爺英雄遲暮 提交于 2021-02-07 10:59:26
问题 On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached . How can I retrieve this percentage programatically? I can use getStorageLevel() to get some information about RDD caching but not Fraction Cached . Do I have to calculate it myself? 回答1: SparkContext.getRDDStorageInfo is probably the thing you're looking for. It returns an Array of RDDInfo which provides information about: Memory size.

How to convert an Iterable to an RDD

戏子无情 提交于 2021-02-07 10:45:26
问题 To be more specific, how can i convert a scala.Iterable to a org.apache.spark.rdd.RDD ? I have an RDD of (String, Iterable[(String, Integer)]) and i want this to be converted into an RDD of (String, RDD[String, Integer]) , so that i can apply a reduceByKey function to the internal RDD . e.g i have an RDD where key is 2-lettered prefix of a person's name and the value is List of pairs of Person name and hours that they spent in an event my RDD is : ("To", List(("Tom",50),("Tod","30"),("Tom",70

Replace groupByKey with reduceByKey in Spark

大憨熊 提交于 2021-02-07 04:28:19
问题 Hello I often need to use groupByKey in my code but I know it's a very heavy operation. Since I'm working to improve performance I was wondering if my approach to remove all groupByKey calls is efficient. I was used to create an RDD from another RDD and creating pair of type (Int, Int) rdd1 = [(1, 2), (1, 3), (2 , 3), (2, 4), (3, 5)] and since I needed to obtain something like this: [(1, [2, 3]), (2 , [3, 4]), (3, [5])] what I used was out = rdd1.groupByKey but since this approach might be

Difference between loading a csv file into RDD and Dataframe in spark

孤街浪徒 提交于 2021-01-29 08:19:58
问题 I am not sure if this specific question is asked earlier or not. could be a possible duplicate but I was not able to find a use case persisting to this. As we know that we can load a csv file directly to dataframe and can load it into RDD also and then convert that RDD to dataframe later. RDD = sc.textFile("pathlocation") we can apply some Map, filter and other operations on this RDD and can convert it into dataframe. Also we can create a dataframe directly reading a csv file Dataframe =

How can I get a distinct RDD of dicts in PySpark?

人走茶凉 提交于 2021-01-28 04:10:20
问题 I have an RDD of dictionaries, and I'd like to get an RDD of just the distinct elements. However, when I try to call rdd.distinct() PySpark gives me the following error TypeError: unhashable type: 'dict' at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD

Spark: Task not serializable Exception in forEach loop in Java

倖福魔咒の 提交于 2021-01-28 00:56:39
问题 I'm trying to iterate over JavaPairRDD and perform some calculations with keys and values of JavaPairRDD. Then output result for each JavaPair into processedData list. What I already tried: make variables, that I use inside of lambda function static. make methods, that I call from lambda foreach loop static. added implements Serializable Here is my code: List<String> processedData = new ArrayList<>(); JavaPairRDD<WebLabGroupObject, Iterable<WebLabPurchasesDataObject>> groupedByWebLabData

Does Spark write intermediate shuffle outputs to disk

ぃ、小莉子 提交于 2021-01-26 16:46:28
问题 I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149: Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle , even if it was not explicitly persisted. This is

Does Spark write intermediate shuffle outputs to disk

纵饮孤独 提交于 2021-01-26 16:36:49
问题 I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149: Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle , even if it was not explicitly persisted. This is

How to apply multiple filters in a for loop for pyspark

三世轮回 提交于 2021-01-23 11:09:18
问题 I am trying to apply a filter on several columns on an rdd. I want to pass in a list of indices as a parameter to specify which ones to filter on, but pyspark only applies the last filter. I've broken down the code into some simple test cases and tried the non-looped version and they work. test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')] rdd = sc.parallelize(test_input, 1) # Index 0 needs to be longer than length 0 # Index 1 needs to be longer than length 1 for i in [0,1]: rdd =

How to apply multiple filters in a for loop for pyspark

半世苍凉 提交于 2021-01-23 11:08:05
问题 I am trying to apply a filter on several columns on an rdd. I want to pass in a list of indices as a parameter to specify which ones to filter on, but pyspark only applies the last filter. I've broken down the code into some simple test cases and tried the non-looped version and they work. test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')] rdd = sc.parallelize(test_input, 1) # Index 0 needs to be longer than length 0 # Index 1 needs to be longer than length 1 for i in [0,1]: rdd =