rdd | 易学教程

Spark list all cached RDD names and unpersist

阅读更多关于 Spark list all cached RDD names and unpersist

问题 I am new to Apache Spark, I created several RDD's and DataFrames, cached them, now I want to unpersist some of them by using the command below rddName.unpersist() but I can't remember their names. I used sc.getPersistentRDDs but the output does not include the names. I also used the browser to view the cached rdds but again no name information. Am I missing something? 回答1: @Dikei's answer is actually correct but I believe what you are looking for is sc.getPersistentRDDs : scala> val rdd1 = sc

Spark migrate sql window function to RDD for better performance

阅读更多关于 Spark migrate sql window function to RDD for better performance

问题 A function should be executed for multiple columns in a data frame def handleBias(df: DataFrame, colName: String, target: String = target) = { val w1 = Window.partitionBy(colName) val w2 = Window.partitionBy(colName, target) df.withColumn("cnt_group", count("*").over(w2)) .withColumn("pre2_" + colName, mean(target).over(w1)) .withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D))) .drop("cnt_group") } This can be written nicely as shown above in

Difference between sc.textFile and spark.read.text in Spark

阅读更多关于 Difference between sc.textFile and spark.read.text in Spark

问题 I am trying to read a simple text file into a Spark RDD and I see that there are two ways of doing so : from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").getOrCreate() sc = spark.sparkContext textRDD1 = sc.textFile("hobbit.txt") textRDD2 = spark.read.text('hobbit.txt').rdd then I look into the data and see that the two RDDs are structured differently textRDD1.take(5) ['The king beneath the mountain', 'The king of carven stone', 'The lord of silver fountain',

Spark: java.io.IOException: No space left on device

阅读更多关于 Spark: java.io.IOException: No space left on device

问题 Now I am learning how to use spark.I have a piece of code which can invert a matrix and it works when the order of the matrix is small like 100.But when the order of the matrix is big like 2000 I have an exception like this: 15/05/10 20:31:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/spark-local-20150510200122-effa/28/temp_shuffle_6ba230c3-afed-489b-87aa-91c046cadb22 java.io.IOException: No space left on device In my program I have lots of

Get the max value for each key in a Spark RDD

阅读更多关于 Get the max value for each key in a Spark RDD

问题 What is the best way to return the max row (value) associated with each unique key in a spark RDD? I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF? I have in RDD format: [(v, 3), (v, 1), (v, 1), (w, 7), (w, 1), (x, 3), (y, 1), (y, 1), (y, 2), (y, 3)] And I need to return: [(v, 3), (w, 7), (x, 3), (y, 3)] Ties can return the first value or random. 回答1: Actually you have a PairRDD. One of the best ways

Spill to disk and shuffle write spark

阅读更多关于 Spill to disk and shuffle write spark

问题 I'm getting confused about spill to disk and shuffle write . Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it to disk and then clean up the map for the next spill(if occur), my questions are : What is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system and also record. Admit are different, so

Spill to disk and shuffle write spark

阅读更多关于 Spill to disk and shuffle write spark

How to print elements of particular RDD partition in Spark?

阅读更多关于 How to print elements of particular RDD partition in Spark?

问题 How to print the elements of a particular partition, say 5th, alone? val distData = sc.parallelize(1 to 50, 10) 回答1: Using Spark/Scala: val data = 1 to 50 val distData = sc.parallelize(data,10) distData.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>it.toList.map(x => if (index ==5) {println(x)}).iterator).collect produces: 26 27 28 29 30 回答2: you could possible use a counter against foreachPartition() API to achieve it. Here is a Java program that prints content of each partition

pyspark - Grouping and calculating data

阅读更多关于 pyspark - Grouping and calculating data

问题 I have the following csv file. Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt 0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand 1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand 2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand 3,1424696633919,1424696631928385290,-5.9427185,0.6761626999999999,8.128204,a,nexus4,nexus4_1,stand I have to create a RDD where

How to partition Spark RDD when importing Postgres using JDBC?

阅读更多关于 How to partition Spark RDD when importing Postgres using JDBC?

问题 I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value column because it's all over the place and doesn't maintain order): df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load() df.printSchema() root |-- id: string (nullable = false) |-- timestamp: timestamp (nullable = false) |-- key: string (nullable = false) |-- value: double (nullable =