rdd

Spark list all cached RDD names and unpersist

霸气de小男生 提交于 2019-12-19 03:19:10
问题 I am new to Apache Spark, I created several RDD's and DataFrames, cached them, now I want to unpersist some of them by using the command below rddName.unpersist() but I can't remember their names. I used sc.getPersistentRDDs but the output does not include the names. I also used the browser to view the cached rdds but again no name information. Am I missing something? 回答1: @Dikei's answer is actually correct but I believe what you are looking for is sc.getPersistentRDDs : scala> val rdd1 = sc

Spark migrate sql window function to RDD for better performance

邮差的信 提交于 2019-12-18 17:25:22
问题 A function should be executed for multiple columns in a data frame def handleBias(df: DataFrame, colName: String, target: String = target) = { val w1 = Window.partitionBy(colName) val w2 = Window.partitionBy(colName, target) df.withColumn("cnt_group", count("*").over(w2)) .withColumn("pre2_" + colName, mean(target).over(w1)) .withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D))) .drop("cnt_group") } This can be written nicely as shown above in

Difference between sc.textFile and spark.read.text in Spark

与世无争的帅哥 提交于 2019-12-18 16:51:51
问题 I am trying to read a simple text file into a Spark RDD and I see that there are two ways of doing so : from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").getOrCreate() sc = spark.sparkContext textRDD1 = sc.textFile("hobbit.txt") textRDD2 = spark.read.text('hobbit.txt').rdd then I look into the data and see that the two RDDs are structured differently textRDD1.take(5) ['The king beneath the mountain', 'The king of carven stone', 'The lord of silver fountain',

Spark: java.io.IOException: No space left on device

断了今生、忘了曾经 提交于 2019-12-18 16:39:09
问题 Now I am learning how to use spark.I have a piece of code which can invert a matrix and it works when the order of the matrix is small like 100.But when the order of the matrix is big like 2000 I have an exception like this: 15/05/10 20:31:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/spark-local-20150510200122-effa/28/temp_shuffle_6ba230c3-afed-489b-87aa-91c046cadb22 java.io.IOException: No space left on device In my program I have lots of

Get the max value for each key in a Spark RDD

喜夏-厌秋 提交于 2019-12-18 12:33:31
问题 What is the best way to return the max row (value) associated with each unique key in a spark RDD? I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF? I have in RDD format: [(v, 3), (v, 1), (v, 1), (w, 7), (w, 1), (x, 3), (y, 1), (y, 1), (y, 2), (y, 3)] And I need to return: [(v, 3), (w, 7), (x, 3), (y, 3)] Ties can return the first value or random. 回答1: Actually you have a PairRDD. One of the best ways

Spill to disk and shuffle write spark

好久不见. 提交于 2019-12-18 11:34:43
问题 I'm getting confused about spill to disk and shuffle write . Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it to disk and then clean up the map for the next spill(if occur), my questions are : What is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system and also record. Admit are different, so

Spill to disk and shuffle write spark

你离开我真会死。 提交于 2019-12-18 11:34:10
问题 I'm getting confused about spill to disk and shuffle write . Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it to disk and then clean up the map for the next spill(if occur), my questions are : What is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system and also record. Admit are different, so

How to print elements of particular RDD partition in Spark?

落花浮王杯 提交于 2019-12-18 11:32:55
问题 How to print the elements of a particular partition, say 5th, alone? val distData = sc.parallelize(1 to 50, 10) 回答1: Using Spark/Scala: val data = 1 to 50 val distData = sc.parallelize(data,10) distData.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>it.toList.map(x => if (index ==5) {println(x)}).iterator).collect produces: 26 27 28 29 30 回答2: you could possible use a counter against foreachPartition() API to achieve it. Here is a Java program that prints content of each partition

pyspark - Grouping and calculating data

送分小仙女□ 提交于 2019-12-18 09:45:36
问题 I have the following csv file. Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt 0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand 1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand 2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand 3,1424696633919,1424696631928385290,-5.9427185,0.6761626999999999,8.128204,a,nexus4,nexus4_1,stand I have to create a RDD where

How to partition Spark RDD when importing Postgres using JDBC?

荒凉一梦 提交于 2019-12-18 06:53:39
问题 I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value column because it's all over the place and doesn't maintain order): df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load() df.printSchema() root |-- id: string (nullable = false) |-- timestamp: timestamp (nullable = false) |-- key: string (nullable = false) |-- value: double (nullable =